登录查看更多内容

Generating Images with Large Language Model (GILL)

Arun Krishnan

Entrepreneur, Technology Leader, Business Leader, Analytics, AI, GenAI, Author Experienced Data Science and AI professional and leader. Driving Data Science, AI and GenAI technology and business growth

发布日期: 2023年6月13日

+ 关注

By now, we all know that LLMs work by creating embeddings of sentences in a large, multi-dimensional textual space.

While LLMs like GPT4 are able to take in multiple types of data including textual, images, audio and video, their outputs are text-only.

When we look at text-to-image generation models, they similarly have their own text to image embeddings.

Is there a way to bridge this gap?

Seems there IS a way to make two different embeddings talk to each other. This is akin to creating convertors between say the Google and Apple universes. Impossible you would say? It isn't so in this particular instance.

领英推荐

Under-thinking in LLMs: Understanding the Phenomenon…

Setu Chokshi 1 个月前

??Top ML Papers of the Week

DAIR.AI 9 个月前

How to get more out of LLMs

Stefan Huyghe 1 年前

A recent paper titled "Generating Images with Multimodal Language Models" shows the way. So how do they do it?

They use pre-trained LLM as well as the image generator models to produce both image and text outputs. A large majority of the parameters are kept constant with only a small number of parameters on image caption data to be fine-tuned. There are three main challenges that they overcame

The model needs to learn to process image-and-text content
It needs to learn to produce images (retrieved or generated)
It needs to determine whether to produce text or images at each step.
Finally, it needs to decide whether to generate an image or to retrieve an image through similarity search

This is done by taking in image and associated text inputs and converts them to text interleaved with image embeddings as shown in the figure on the left. The important piece to note here is that this mapping translates images into embedding vectors in the token embedding space of the LLM that is being used.
In the next stage, the Model output embeddings are passed through a decision model that decides whether to return an image from the database using a similarity search or whether to use a GILLMapper Model to generate a new image.
This last piece is done by mapping the IMG tokens from the original embedding space to a semantically meaningful region of the input space of an image generation model like Stable Diffusion.

This process ensures that using pre-defined LLM and image-generation models a mapping can be defined, allowing "conversation" to happen between the two models with different embeddings. Isn't that something!

Pawan Dhail

Talent & Tech Capability Development

1 年

This section of G-AI is so exciting to explore and try by self for implementation. It’s like Talking pictures, or on the other side make your words to portray your thoughts, as you speak.

Sriram S

1 年

Lovely work ??

2 次回应

查看更多评论

要查看或添加评论，请登录

Arun Krishnan的更多文章

A new architecture that incorporates more human-like memory features

2025年1月28日

A new architecture that incorporates more human-like memory features

The one huge drawback of attention models that are ubiquitous in LLMs, is the fact that the memory requirements can…

3 条评论
What's Deep about DeepSeek?

2025年1月27日

What's Deep about DeepSeek?

Deepseek has taken the LLM world by storm, achieving parity with the latest models from OpenAI at a fraction of the…

16 条评论
BertViz - Visualizing Attention in Transformers

2024年6月25日

BertViz - Visualizing Attention in Transformers

With the increasing use of LLMs and Transformers in organisations, users are starting to demand explainability from…
Buffer-of-Thought Prompting

2024年6月20日

Buffer-of-Thought Prompting

With use cases becoming more and more complicated and agent-based systems becoming the norm for #GenerativeAI based…

1 条评论
To Embed or not to Embed ...

2023年12月12日

To Embed or not to Embed ...

Everyone by now, ought to be familiar with the Retrieval-Augmented Generation (RAG) approach, wherein documents or text…
The GenAI conundrum

2023年11月30日

The GenAI conundrum

So you are the CEO of a company and have heard of this wonderful new toy called Generative AI. You call a meeting of…

9 条评论
Understanding the craft of writing

2023年6月15日

Understanding the craft of writing

I have never written an article about writing. Even though I have published my first novel and three more are already…
Are neural networks actually starting to replicate the functioning of the human brain?

2023年5月25日

Are neural networks actually starting to replicate the functioning of the human brain?

Artificial Neural Networks (ANNs), as the name suggests were patterned after the way we thought the human brain worked.…

2 条评论
Claude and "Constitutional" AI

2023年5月23日

Claude and "Constitutional" AI

For a while now, I have been of the firm opinion that we need to build in Asimov's Three Laws of Robotics into our AI…
All about Chain-of-Thought (CoT)Prompting

2023年5月15日

All about Chain-of-Thought (CoT)Prompting

The rapidity with which LLM models have been progressing has been nothing short of stunning. The last few months have…

5 条评论

See all articles

Generating Images with Large Language Model (GILL)

Arun Krishnan

Entrepreneur, Technology Leader, Business Leader, Analytics, AI, GenAI, Author Experienced Data Science and AI professional and leader. Driving Data Science, AI and GenAI technology and business growth

领英推荐

Arun Krishnan的更多文章

社区洞察

其他会员也浏览了

SLM and LLM... My Top 10 in July 2024

"There is no Moat in LLMs" - Rapid Commoditization of Large Language Models (LLMs)

The Technology Behind Large Language Models: Harnessing the Mathematical Elegance of Tamil

Everything about LLM Hallucinations

Faithful Logical Reasoning- Symbolic Chain-of-Thought & GNN-RAG - Graph Neural Retrieval for Large Language Model Reasoning

How To Use Prompt Engineering With Large Language Models

Prompt Compression in Large Language Models

Retrieval Augmented Generation (RAG) v/s Long-Context (LC) reasoning tradeoffs in Transformer based Language Models

Probabilistic Language Models

How to Understand “Tokens” in AI Large Language Models?

领英推荐

Arun Krishnan的更多文章

A new architecture that incorporates more human-like memory features

What's Deep about DeepSeek?

BertViz - Visualizing Attention in Transformers

Buffer-of-Thought Prompting

To Embed or not to Embed ...

The GenAI conundrum

Understanding the craft of writing

Are neural networks actually starting to replicate the functioning of the human brain?

Claude and "Constitutional" AI

All about Chain-of-Thought (CoT)Prompting

社区洞察

其他会员也浏览了

SLM and LLM... My Top 10 in July 2024

"There is no Moat in LLMs" - Rapid Commoditization of Large Language Models (LLMs)

The Technology Behind Large Language Models: Harnessing the Mathematical Elegance of Tamil

Everything about LLM Hallucinations

Faithful Logical Reasoning- Symbolic Chain-of-Thought & GNN-RAG - Graph Neural Retrieval for Large Language Model Reasoning

How To Use Prompt Engineering With Large Language Models

Prompt Compression in Large Language Models

Retrieval Augmented Generation (RAG) v/s Long-Context (LC) reasoning tradeoffs in Transformer based Language Models

Probabilistic Language Models

How to Understand “Tokens” in AI Large Language Models?