登录查看更多内容

Exploring GPT Visual Capabilities and Multimodal AI: A Journey to Solving Wordle

Jennifer Marsman

发布日期: 2024年12月17日

AI investments are forecasted to approach $200 billion globally by 2025 which offers great potential across industries to strike while the iron is hot. With 75% of global knowledge workers using AI, it's clear that leveraging these models and making AI education more accessible is vital for teams and leaders looking to stay ahead. (You can find some interesting scenarios on my Github.)

Multimodality in AI is a game-changer. Combining text, visuals, and interactive data is becoming more prominent in discussions about advanced AI. In this article, I delve into how GPT's visual capabilities function and their potential applications in real-world scenarios, using the popular game Wordle as a fun and practical example. You can view my full video series below, where I dive deep into my journey exploring GPT's visual capabilities, along with demos, code, and tips to optimize your prompt engineering.

What Are GPT Visual Capabilities and Their Real-World Applications??

My journey started with a simple question: How far can multimodal AI go in analyzing and processing complex visual information? I tested its limitations and explored various examples, from identifying fashion to assessing a skier’s form and interpreting circuit diagrams. Through these examples, we learn how AI can support people, such as the blind and low-vision community, through tools like Seeing AI.?

I was attracted to the Wordle scenario because it seemed significantly challenging. To solve a puzzle, we need to combine both the logic (of the rules of the game) with interpreting a lot of visual information. This is complex image understanding because the model needs to produce essentially 3 distinct task results: OCR-like identification of the distinct characters/letters in the puzzle, the position of the letter within the word, and the color behind each letter after a guess. Historically, positional information has been difficult for LLMs, and all of this is made more complicated by needing to perform localized for each letter. I wasn't sure if it would be possible, but I love Wordle and I wanted to try!

The Differences Between Zero-Shot, One-Shot, and Few-Shot Prompt Engineering?

Next, let's think about the way we provide language models with information and how to better speak the language of AI. Prompt engineering is key to maximizing the effectiveness of language models. When a zero-shot prompt to initiate a response without prior examples doesn't perform well, a great first step is to try a few-shot approach with structured prompts that guide the model in how to respond.?I also share a research paper by some of my coworkers in Microsoft Research that explores GPT's visual capabilities.

Tokenization: How Do Models Represent Language??

Simple few-shot examples were not sufficient to solve a problem of Wordle's complexity, and identify the character, position, and color for each letter. So, I reflected on how the model may be "seeing" those characters - models use not characters or words, but tokens as their units to represent language. Let's take a sidebar to discuss tokens and the tokenization process of your prompts.??

领英推荐

AutoML-GPT; Causal Reasoning and LLMs; MetaGPT; Free…

Danny Butvinik 1 年前

OpenAI's AI Model Aims for "Ph.D.-Level" Intelligence

Innovation Incubator Advisory 8 个月前

GenAI Weekly — Edition 31

Shuveb Hussain 6 个月前

Tokenization is an essential aspect of how models interpret input. By mapping words to tokens and understanding how words are broken down, we will try to guide the model to respond more accurately. This technique was particularly effective in enhancing model comprehension of extracting characters from my Wordle board.?

Understanding How AI Recognizes Color in Multimodal Settings?

To comprehend a Wordle puzzle, an image understanding model would need to extract the characters, their relative positions, and their background colors on a Wordle board.?Now that we were able to extract the characters consistently correctly, we will focus on positions and color. I shared some insights around chaining and classification that could potentially help. Then, much like my thinking process for characters and how the model represents language, I had to understand how the model identifies and understands color.? We saw in the first post's video that this model can identify specific hues in images, in the bridesmaid dress example. But solving Wordle requires very precise localized color classification within the image to identify the color behind each letter. ?

Using Multimodality to Solve Wordle: The Final Test?

Now for the fun part! Let’s put our code to the test.??

See the AI model in action, where I highlight the culmination of everything we've learned in prompt engineering to extract characters, positions, and colors for each letter. By leveraging the power of multimodal AI, the model processes and evaluates visual data to solve the Wordle puzzle!?

What’s next:?

AI-related investment could peak at 2.5% to 4% of GDP in the U.S. The future is AI. Now I may have shown you a fun Wordle example, but I see us using GPT's visual capabilities to solve real-world problems industry-wide. Research specific use cases in your industry and experiment with available models to uncover insights.?
AI education will see a significant rise over the next year, and honing prompt engineering skills will set teams up for success. For me, the best way to learn is by doing...you can use free resources like Copilot to test the boundaries of what's possible and deepen your understanding. My GitHub also has excellent resources.?
The journey doesn’t stop at solving Wordle. It’s about empowering teams and individuals to embrace AI tools confidently and responsibly. Together, we’re only scratching the surface of what’s possible.?

Thank you for following along on this journey! All my research, code, and development work are now available for you to explore and build upon in my WordleGPT GitHub repo.

Chinwendu Iwuorie ??

AI/ML Practitioner | Customer Centric | Project Management

2 个月

Oh wow this is really cool!

1 次回应

Virginie Pontruché

Strategic Customer Executive Microsoft; SheSays & Cloud Seeders board member

3 个月

Oh yay, I remember when AZURE came up on the Wordle but didn’t capture my screenshot. Such a fun experiment!

1 次回应

Lynn Langit

Cloud Architect | Linked In Learning Instructor

3 个月

thanks for writing this up, so informative to follow your journey here...

1 次回应

查看更多评论

要查看或添加评论，请登录

Jennifer Marsman的更多文章

Inside the New Microsoft, Where Lie Detection is a Killer App

2016年2月25日

Inside the New Microsoft, Where Lie Detection is a Killer App

I'm very honored to represent Microsoft's innovation in machine learning in this Bloomberg article.

2 条评论

Exploring GPT Visual Capabilities and Multimodal AI: A Journey to Solving Wordle

Jennifer Marsman

领英推荐

Jennifer Marsman的更多文章

社区洞察

其他会员也浏览了

Mitigating AI Hallucinations: Best Practices for Reliable AI Systems

Custom AI Solutions: Tailoring Transformer Model Development Services to Your Business Needs

Going Beyond Prompts: Advanced Model Customization using Fine-tuning, Embedding and Function Calling.

Real-World Examples of AI Products in Action- From Start to Finish

How LoRA Streamlines AI Fine-Tuning

Open-source or closed-source AI

What Do You Need to Know About Llama 3.3 and How It Differs From Older AI Models?

Number of exciting new developments in generative AI

?? Understanding the Landscape of AI: RAG vs. KAG

GPT-5 Unleashed Soon: New Era for Agentic-AI

领英推荐

Jennifer Marsman的更多文章

Inside the New Microsoft, Where Lie Detection is a Killer App

社区洞察

其他会员也浏览了

Mitigating AI Hallucinations: Best Practices for Reliable AI Systems

Custom AI Solutions: Tailoring Transformer Model Development Services to Your Business Needs

Going Beyond Prompts: Advanced Model Customization using Fine-tuning, Embedding and Function Calling.

Real-World Examples of AI Products in Action- From Start to Finish

How LoRA Streamlines AI Fine-Tuning

Open-source or closed-source AI

What Do You Need to Know About Llama 3.3 and How It Differs From Older AI Models?

Number of exciting new developments in generative AI

?? Understanding the Landscape of AI: RAG vs. KAG

GPT-5 Unleashed Soon: New Era for Agentic-AI