Generating Images with Large Language Model (GILL)

Generating Images with Large Language Model (GILL)

No alt text provided for this image

By now, we all know that LLMs work by creating embeddings of sentences in a large, multi-dimensional textual space.

While LLMs like GPT4 are able to take in multiple types of data including textual, images, audio and video, their outputs are text-only.

When we look at text-to-image generation models, they similarly have their own text to image embeddings.

Is there a way to bridge this gap?

Seems there IS a way to make two different embeddings talk to each other. This is akin to creating convertors between say the Google and Apple universes. Impossible you would say? It isn't so in this particular instance.

A recent paper titled "Generating Images with Multimodal Language Models" shows the way. So how do they do it?

They use pre-trained LLM as well as the image generator models to produce both image and text outputs. A large majority of the parameters are kept constant with only a small number of parameters on image caption data to be fine-tuned. There are three main challenges that they overcame

  • The model needs to learn to process image-and-text content
  • It needs to learn to produce images (retrieved or generated)
  • It needs to determine whether to produce text or images at each step.
  • Finally, it needs to decide whether to generate an image or to retrieve an image through similarity search

  1. This is done by taking in image and associated text inputs and converts them to text interleaved with image embeddings as shown in the figure on the left. The important piece to note here is that this mapping translates images into embedding vectors in the token embedding space of the LLM that is being used.
  2. In the next stage, the Model output embeddings are passed through a decision model that decides whether to return an image from the database using a similarity search or whether to use a GILLMapper Model to generate a new image.
  3. This last piece is done by mapping the IMG tokens from the original embedding space to a semantically meaningful region of the input space of an image generation model like Stable Diffusion.

This process ensures that using pre-defined LLM and image-generation models a mapping can be defined, allowing "conversation" to happen between the two models with different embeddings. Isn't that something!

Pawan Dhail

Talent & Tech Capability Development

1 年

This section of G-AI is so exciting to explore and try by self for implementation. It’s like Talking pictures, or on the other side make your words to portray your thoughts, as you speak.

回复
Sriram S

AI Strategy | Advisory | Google PMLE | Star Performer | Learning Catalyst | Data Science Mentor

1 年

Lovely work ??

要查看或添加评论,请登录

Arun Krishnan的更多文章

  • A new architecture that incorporates more human-like memory features

    A new architecture that incorporates more human-like memory features

    The one huge drawback of attention models that are ubiquitous in LLMs, is the fact that the memory requirements can…

    3 条评论
  • What's Deep about DeepSeek?

    What's Deep about DeepSeek?

    Deepseek has taken the LLM world by storm, achieving parity with the latest models from OpenAI at a fraction of the…

    16 条评论
  • BertViz - Visualizing Attention in Transformers

    BertViz - Visualizing Attention in Transformers

    With the increasing use of LLMs and Transformers in organisations, users are starting to demand explainability from…

  • Buffer-of-Thought Prompting

    Buffer-of-Thought Prompting

    With use cases becoming more and more complicated and agent-based systems becoming the norm for #GenerativeAI based…

    1 条评论
  • To Embed or not to Embed ...

    To Embed or not to Embed ...

    Everyone by now, ought to be familiar with the Retrieval-Augmented Generation (RAG) approach, wherein documents or text…

  • The GenAI conundrum

    The GenAI conundrum

    So you are the CEO of a company and have heard of this wonderful new toy called Generative AI. You call a meeting of…

    9 条评论
  • Understanding the craft of writing

    Understanding the craft of writing

    I have never written an article about writing. Even though I have published my first novel and three more are already…

  • Are neural networks actually starting to replicate the functioning of the human brain?

    Are neural networks actually starting to replicate the functioning of the human brain?

    Artificial Neural Networks (ANNs), as the name suggests were patterned after the way we thought the human brain worked.…

    2 条评论
  • Claude and "Constitutional" AI

    Claude and "Constitutional" AI

    For a while now, I have been of the firm opinion that we need to build in Asimov's Three Laws of Robotics into our AI…

  • All about Chain-of-Thought (CoT)Prompting

    All about Chain-of-Thought (CoT)Prompting

    The rapidity with which LLM models have been progressing has been nothing short of stunning. The last few months have…

    5 条评论

社区洞察

其他会员也浏览了