Generating Images with Large Language Model (GILL)
Arun Krishnan
Entrepreneur, Technology Leader, Business Leader, Analytics, AI, GenAI, Author Experienced Data Science and AI professional and leader. Driving Data Science, AI and GenAI technology and business growth
By now, we all know that LLMs work by creating embeddings of sentences in a large, multi-dimensional textual space.
While LLMs like GPT4 are able to take in multiple types of data including textual, images, audio and video, their outputs are text-only.
When we look at text-to-image generation models, they similarly have their own text to image embeddings.
Is there a way to bridge this gap?
Seems there IS a way to make two different embeddings talk to each other. This is akin to creating convertors between say the Google and Apple universes. Impossible you would say? It isn't so in this particular instance.
领英推荐
A recent paper titled "Generating Images with Multimodal Language Models" shows the way. So how do they do it?
They use pre-trained LLM as well as the image generator models to produce both image and text outputs. A large majority of the parameters are kept constant with only a small number of parameters on image caption data to be fine-tuned. There are three main challenges that they overcame
This process ensures that using pre-defined LLM and image-generation models a mapping can be defined, allowing "conversation" to happen between the two models with different embeddings. Isn't that something!
Talent & Tech Capability Development
1 年This section of G-AI is so exciting to explore and try by self for implementation. It’s like Talking pictures, or on the other side make your words to portray your thoughts, as you speak.
AI Strategy | Advisory | Google PMLE | Star Performer | Learning Catalyst | Data Science Mentor
1 年Lovely work ??