登录查看更多内容

DALL·E: Creating Images from Text Captions

Ali Abbaszadeh

Senior Data Scientist - GenAI

发布日期: 2021年1月9日

GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.

Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another. This training procedure allows DALL·E to not only generate an image from scratch, but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a way that is consistent with the text prompt.

DALL·E is able to create plausible images for a great variety of sentences that explore the compositional structure of language.

DALL·E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively. The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. We plan to provide more details about the architecture and training procedure in an upcoming paper.

Text-to-image synthesis has been an active area of research since the pioneering work of Reed et. al,1 whose approach uses a GAN conditioned on text embeddings. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP. StackGAN3 and StackGAN++4 use multi-scale GANs to scale up the image resolution and improve visual fidelity. AttnGAN5 incorporates attention between the text and image features, and proposes a contrastive text-image feature matching loss as an auxiliary objective. This is interesting to compare to our reranking with CLIP, which is done offline. Other work267 incorporates additional sources of supervision during training to improve image quality. Finally, work by Nguyen et. al8 and Cho et. al9 explores sampling-based strategies for image generation that leverage pretrained multimodal discriminative models.

Similar to the rejection sampling used in VQVAE-2, we use CLIP to rerank the top 32 of 512 samples for each caption in all of the interactive visuals. This procedure can also be seen as a kind of language-guided search16, and can have a dramatic impact on sample quality.

an illustration of a baby daikon radish in a tutu walking a dog

Shaxboz Ruziqulov

3 个月

It seems there was an issue with generating the image. Let me try again and make the prompt simpler for the AI to process. It seems I'm currently unable to generate the image. However, if you'd like, I can help refine the description further, or provide suggestions on how to generate it using other platforms. Let me know how you'd like to proceed!

要查看或添加评论，请登录

Ali Abbaszadeh的更多文章

58.8 Times Faster AI Training With New Chip Architecture

2021年1月8日

58.8 Times Faster AI Training With New Chip Architecture

A brain-inspired computing architecture speeds up complex data processing by running its algorithms inside its memory…
New Deep Learning Method Helps Robots Become Jacks-of-all-Trades

2021年1月5日

New Deep Learning Method Helps Robots Become Jacks-of-all-Trades

One of the biggest things standing in the way of the robot revolution is their inability to adapt. That may be about to…
Musk predicts AI will be superior to humans within five years

2020年12月16日

Musk predicts AI will be superior to humans within five years

Elon Musk has made another of his trademark predictions – this time, it’s that AI will be superior to humans within…

1 条评论
50 times faster deep learning with new breakthrough algorithm by Numenta

2020年11月24日

50 times faster deep learning with new breakthrough algorithm by Numenta

Using algorithms derived from neuroscience, AI research company Numenta has achieved a dramatic performance improvement…

1 条评论
Tiny Machine Learning

2020年11月9日

Tiny Machine Learning

Tiny machine learning (tinyML) is the intersection of machine learning and embedded internet of things (IoT) devices…
PruneBERT: A high efficient version of Bert up to 97% saving in original parameters

2020年7月13日

PruneBERT: A high efficient version of Bert up to 97% saving in original parameters

They propose the use of movement pruning, deterministic first-order weight pruning method that is more adaptive to…
Why you should choose PyTorch over Tensorflow in your researches

2020年7月2日

Why you should choose PyTorch over Tensorflow in your researches

Iteration time is faster in PyTorch, Because of deferred execution model everything takes longer in TF More integrated…
New Features in Python 3.9

2020年6月22日

New Features in Python 3.9

Some of the newest features are incredibly exciting, and it will be amazing to see them used after release. Some of the…
How To Build a Question Answering Bot with BERT

2020年6月15日

How To Build a Question Answering Bot with BERT

The idea behind transfer learning (Like Bert) is to take a model that was trained on a very large dataset, then…
Top 10 Recommended ML and NLP Books

2020年6月11日

Top 10 Recommended ML and NLP Books

Reinforcement Learning, Second Edition (An Introduction) By Richard S. Sutton and Andrew G.

See all articles

DALL·E: Creating Images from Text Captions

Ali Abbaszadeh

Senior Data Scientist - GenAI

Ali Abbaszadeh的更多文章

社区洞察

其他会员也浏览了

Meet Vectara: powerful, free neural search

In-Depth Guide to Fine-tuning LLMs with LoRA and QLoRA: Enhancing Efficiency and Performance

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

Janus Pro 7B vs DALL-E 3: A Comparative Analysis

Hallucinations in LLMs: bug or feature?

Future of Artificial Intelligence

Harmonic Loss Trains Interpretable AI Models

Where Semantics and Machine Learning Converge

Deep Dive: Building GPT from scratch - part 9

Understanding Group of Experts: A Powerful Ensemble Learning Approach

Ali Abbaszadeh的更多文章

58.8 Times Faster AI Training With New Chip Architecture

New Deep Learning Method Helps Robots Become Jacks-of-all-Trades

Musk predicts AI will be superior to humans within five years

50 times faster deep learning with new breakthrough algorithm by Numenta

Tiny Machine Learning

PruneBERT: A high efficient version of Bert up to 97% saving in original parameters

Why you should choose PyTorch over Tensorflow in your researches

New Features in Python 3.9

How To Build a Question Answering Bot with BERT

Top 10 Recommended ML and NLP Books

社区洞察

其他会员也浏览了

Meet Vectara: powerful, free neural search

In-Depth Guide to Fine-tuning LLMs with LoRA and QLoRA: Enhancing Efficiency and Performance

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

Janus Pro 7B vs DALL-E 3: A Comparative Analysis

Hallucinations in LLMs: bug or feature?

Future of Artificial Intelligence

Harmonic Loss Trains Interpretable AI Models

Where Semantics and Machine Learning Converge

Deep Dive: Building GPT from scratch - part 9

Understanding Group of Experts: A Powerful Ensemble Learning Approach