RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

Related tags

Deep Learningru-dolph
Overview

[Paper] [Хабр] [Model Card] [Colab] [Kaggle]

RuDOLPH 🦌 🎄 ☃️

One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP


Russian Diffusion On Language Picture Hyper-modality (RuDOLPH) is a fast and light text-image-text transformer (350M GPT-3) designed for a quick and easy fine-tuning setup for the solution of various tasks: from generating images by text description and image classification to visual question answering and more. This model demonstrates the power of Hyper-modality Transformers.

(!!!) Hyper-modality means generalized multi-modal, e.g., model that consists of two multi-modal parts: text-2-image and image-2-text becomes text and image hyper-modality model

Sparse Attention Mask

row - col - row - [last] conv

Models

Installing

pip install rudolph==0.0.1rc8

Usage

Fine-Tuning example by @Alex Wortega Open In Colab

Init models

from rudalle import get_tokenizer, get_vae
from rudalle.utils import seed_everything
from rudalle.image_prompts import ImagePrompts

from rudolph.model import get_rudolph_model
from rudolph.pipelines import zs_clf, generate_codebooks, self_reranking_by_image, self_reranking_by_text, show, generate_captions, generate_texts
from rudolph import utils

device = 'cuda'
model = get_rudolph_model('350M', fp16=True, device=device)
model.to(device);
tokenizer = get_tokenizer()
vae = get_vae(dwt=False).to(device)

Setup for Fast Image Generation

text = 'старинный будильник многоугольной формы'
bs, images_num = 48, 48
top_k, top_p = 512, 0.9
with torch.no_grad():
    codebooks = generate_codebooks(text, tokenizer, model, top_k=top_k, images_num=images_num, top_p=top_p, bs=bs)
    ppl_text, ppl_image = self_reranking_by_text(text, codebooks, tokenizer, model, bs=bs)
    images = vae.decode(codebooks[ppl_text.argsort()[:9]])
images = torchvision.utils.make_grid(images, nrow=3)
img = torchvision.transforms.functional.to_pil_image(images)
img

Text Generation

generate_texts(
    tokenizer,
    model,
    template='красивый пейзаж ',
    top_k=32, top_p=0.8, texts_num=32, bs=32, seed=42
)[:8]

[{'text': 'красивый пейзаж и деревья в горах с синим небом и облаками в солнечный день. карпаты украина', 'ppl': 155.72},
 {'text': 'красивый пейзаж с горным озером и красивым пейзажем на восходе солнца', 'ppl': 195.81},
 {'text': 'красивый пейзаж с горными вершинами и чистым небом', 'ppl': 219.57},
 {'text': 'красивый пейзаж с горами в тумане, покрывающими горы', 'ppl': 221.36},
 {'text': 'красивый пейзаж и водопад в национальном парке пхутта в таиланде', 'ppl': 248.82},
 {'text': 'красивый пейзаж с голубым небом и белым облаком', 'ppl': 260.76},
 {'text': 'красивый пейзаж с рекой, горы и голубое небо', 'ppl': 273.1},
 {'text': 'красивый пейзаж с зелеными деревьями и голубым небом', 'ppl': 286.22}]

Image Generation + Self Reranking

text = 'красивый пейзаж с озером и лесом на заднем плане'
images_num, bs = 256, 32
seed_everything(42)
codebooks = []
for top_k, top_p, images_num in [
    (2048, 0.975, images_num),
    (1536, 0.975, images_num),
    (1024, 0.975, images_num),
]:
    codebooks.append(generate_codebooks(text, tokenizer, model, top_k=top_k, images_num=images_num, top_p=top_p, bs=bs))

codebooks = torch.cat(codebooks)

ppl_text, ppl_image = self_reranking_by_text(text, codebooks, tokenizer, model, bs=bs)
with torch.no_grad():
    images = vae.decode(codebooks[ppl_text.argsort()[:16]])

pil_images = utils.torch_tensors_to_pil_list(images)
show(pil_images, 8)

text = 'зимнее время года'

ppl_text, ppl_image = self_reranking_by_text(text, codebooks, tokenizer, model, bs=32)
with torch.no_grad():
    images = vae.decode(codebooks[ppl_text.argsort()[:16]])

pil_images = utils.torch_tensors_to_pil_list(images)
show(pil_images, 8)

text = 'ночное время суток'

ppl_text, ppl_image = self_reranking_by_text(text, codebooks, tokenizer, model, bs=32)
with torch.no_grad():
    images = vae.decode(codebooks[ppl_text.argsort()[:16]])

pil_images = utils.torch_tensors_to_pil_list(images)
show(pil_images, 8)

Image Prompt (like Inpainting)

text = 'лодка с алыми парусами'

images_num = 1024
bs = 32

borders = {'up': 6, 'left': 4, 'right': 6, 'down': 2}
image_prompts = ImagePrompts(pil_img, borders, vae, device, crop_first=True)

seed_everything(42)
codebooks = []
for top_k, top_p, images_num in [
    (1024, 0.99, images_num),
]:
    codebooks.append(
        generate_codebooks(text, tokenizer, model, top_k=top_k, images_num=images_num, top_p=top_p, bs=bs, image_prompts=image_prompts)
    )

codebooks = torch.cat(codebooks)

ppl_text, ppl_image = self_reranking_by_text(
    text,
    codebooks,
    tokenizer,
    model,
    bs=bs,
)
with torch.no_grad():
    images = vae.decode(codebooks[ppl_text.argsort()[:16]])

pil_images = utils.torch_tensors_to_pil_list(images)
show(pil_images, 8)

Diffusion (TODO, see Colab)

Image Captioning + Self Reranking

texts = generate_captions(pil_img, tokenizer, model, vae, template='на картинке ', top_k=16, captions_num=128, bs=32, top_p=0.6, temperature=0.8, seed=43, limit_eos=False)
ppl_text, ppl_image = self_reranking_by_image(texts, pil_img, tokenizer, model, vae, bs=32, seed=42)
for idx in ppl_image.argsort()[:8]:
    print(f'-{texts[idx]}')

-на картинке изображено - каяк с плавающей на нем женщиной
-на картинке - лодка с призраками
-на картинке корабль « », вид с воздуха
-на картинке лодка с парусом и 3d эффектом, вид с воздуха
-на картинке лодка с привидениями, вид сверху
-на картинке подводная лодка «акула», вид с воздуха
-на картинке изображено - надувная лодка с жестким дном
-на картинке с сайта esquire, изображен маленький красный корабль

-на картинке собака с длинными ушами, вид спереди
-на картинке собака с большими ушами и с длинными лапами, вид спереди
-на картинке собака с большими ушами и мордой собаки, вид спереди
-на картинке собака с белой гривой, вид спереди собака с коричневым цветом
-на картинке собака с большими ушами и собака с большими ушами, вид спереди
-на картинке собака с большими ушами и коричневым мехом, вид спереди
-на картинке собака с белой гривой, вид спереди собака с белой гривой
-на картинке собака с большими ушами и длинными ушами, вид спереди

-на картинке изображен жилой комплекс «арбат»
-на картинке видно здание с окнами в центре города
-на картинке изображен жилой дом с видом на улицу
-на картинке виднеется здание в центре города
-на картинке изображен вид на жилой комплекс, вид с улицы
-на картинке видна башня банка сбербанка
-на картинке изображен фасад здания с окнами в центре города
-на картинке виднеется здание с балконом

-на картинке мотоцикл иж юпитер вариант с мотором от иж юпитер, вид сзади
-на картинке мотоцикл с мотором и мотором с мотором от мотоцикла, вид сбоку
-на картинке изображен мотоцикл с кузовом из фильма «бэтмен против супермена», вид спереди
-на картинке велосипед с велосипедом в гараже, вид спереди
-на картинке мотоцикл с мотоциклом «мотоцикл» вид сзади, вид спереди
-на картинке велосипед с корзиной для покупок, вид сзади
-на картинке велосипед с мотором от мотоцикла иж юпитер вариант 2 варианта, вид сбоку
-на картинке мотоцикл с мотоциклом « », вид спереди

Zero-Shot Image Classification using PPL

import base64
import requests
from PIL import Image
from io import BytesIO

bs4_urls = requests.get('https://raw.githubusercontent.com/sberbank-ai/ru-dolph/master/pics/pipelines/cats_vs_dogs_bs4.json').json()

f, ax = plt.subplots(2,4, figsize=(12,6))

for i, bs4_url in enumerate(bs4_urls):
    pil_img = Image.open(BytesIO(base64.b64decode(bs4_url)))
    
    classes = ['кошка', 'собака']
    preds = zs_clf(
        pil_img, 
        classes,
        model, 
        tokenizer,
        vae,
        template = '{}', 
    )
    ax[i//4, i%4].imshow(pil_img)
    ax[i//4, i%4].set_title(preds['class'])

Linear Probe (TODO, see Colab)

Authors:

Drawing Drawing

Citation

@article{shonenkov2022ruDolph,
  title         = {RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP},
  author        = {Alex Shonenkov and Michael Konstantinov},
  year          = {2022},
  eprint        = {...},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}
@misc{github2022ruDolph,
  title         = {RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP},
  author        = {Alex Shonenkov and Michael Konstantinov},
  year          = {2022},
  howpublished  = {\url{https://github.com/sberbank-ai/ru-dolph}},
}

Supported by

Owner
AI Forever
Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.
AI Forever
Official implementation of the NRNS paper: No RL, No Simulation: Learning to Navigate without Navigating

No RL No Simulation (NRNS) Official implementation of the NRNS paper: No RL, No Simulation: Learning to Navigate without Navigating NRNS is a heriarch

Meera Hahn 20 Nov 29, 2022
Compare neural networks by their feature similarity

PyTorch Model Compare A tiny package to compare two neural networks in PyTorch. There are many ways to compare two neural networks, but one robust and

Anand Krishnamoorthy 181 Jan 04, 2023
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

Michihiro Yasunaga 434 Jan 04, 2023
Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

DASR Paper Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution Jie Liang, Hui Zeng, and Lei Zhang. In arxiv preprint. Abs

81 Dec 28, 2022
TriMap: Large-scale Dimensionality Reduction Using Triplets

TriMap TriMap is a dimensionality reduction method that uses triplet constraints to form a low-dimensional embedding of a set of points. The triplet c

Ehsan Amid 235 Dec 24, 2022
Neon-erc20-example - Example of creating SPL token and wrapping it with ERC20 interface in Neon EVM

Example of wrapping SPL token by ERC2-20 interface in Neon Requirements Install

7 Mar 28, 2022
GULAG: GUessing LAnGuages with neural networks

GULAG: GUessing LAnGuages with neural networks Classify languages in text via neural networks. Привет! My name is Egor. Was für ein herrliches Frühl

Egor Spirin 12 Sep 02, 2022
A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python.

c is for Camera A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python. The purpose of this project is to explore and underst

Daniele Procida 146 Sep 26, 2022
Make Watson Assistant send messages to your Discord Server

Make Watson Assistant send messages to your Discord Server Prerequisites Sign up for an IBM Cloud account. Fill in the required information and press

1 Jan 10, 2022
DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations This repository contains the data, scripts and baseline co

Alexa 51 Dec 17, 2022
MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python

MNE-Python MNE-Python software is an open-source Python package for exploring, visualizing, and analyzing human neurophysiological data such as MEG, E

MNE tools for MEG and EEG data analysis 2.1k Dec 28, 2022
Music Generation using Neural Networks Streamlit App

Music_Gen_Streamlit "Music Generation using Neural Networks" Streamlit App TO DO: Make a run_app.sh Introduction [~5 min] (Sohaib) Team Member names/i

Muhammad Sohaib Arshid 6 Aug 09, 2022
A simple program for training and testing vit

Vit This is a simple program for training and testing vit. Key requirements: torch, torchvision and timm. Dataset I put 5 categories of the cub classi

xiezhenyu 2 Oct 11, 2022
PyTorch code for DriveGAN: Towards a Controllable High-Quality Neural Simulation

PyTorch code for DriveGAN: Towards a Controllable High-Quality Neural Simulation

76 Dec 24, 2022
The ICS Chat System project for NYU Shanghai Fall 2021

ICS_Chat_System [Catenger] This is the ICS Chat System project for NYU Shanghai Fall 2021 Creators: Shavarsh Melikyan, Skyler Chen and Arghya Sarkar,

1 Dec 20, 2021
PyTorch implementation of "Continual Learning with Deep Generative Replay", NIPS 2017

pytorch-deep-generative-replay PyTorch implementation of Continual Learning with Deep Generative Replay, NIPS 2017 Results Continual Learning on Permu

Junsoo Ha 127 Dec 14, 2022
50-days-of-Statistics-for-Data-Science - This repository consist of a 50-day program

50-days-of-Statistics-for-Data-Science - This repository consist of a 50-day program. All the statistics required for the complete understanding of data science will be uploaded in this repository.

komal_lamba 22 Dec 09, 2022
OpenCV, MediaPipe Pose Estimation, Affine Transform for Icon Overlay

Yoga Pose Identification and Icon Matching Project Goal Detect yoga poses performed by a user and overlay a corresponding icon image. Running the main

Anna Garverick 1 Dec 03, 2021
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
A PyTorch implementation of Mugs proposed by our paper "Mugs: A Multi-Granular Self-Supervised Learning Framework".

Mugs: A Multi-Granular Self-Supervised Learning Framework This is a PyTorch implementation of Mugs proposed by our paper "Mugs: A Multi-Granular Self-

Sea AI Lab 62 Nov 08, 2022