Exploring GPT Visual Capabilities and Multimodal AI: A Journey to Solving Wordle

Exploring GPT Visual Capabilities and Multimodal AI: A Journey to Solving Wordle

AI investments are forecasted to approach $200 billion globally by 2025 which offers great potential across industries to strike while the iron is hot. With 75% of global knowledge workers using AI, it's clear that leveraging these models and making AI education more accessible is vital for teams and leaders looking to stay ahead. (You can find some interesting scenarios on my Github.)

Multimodality in AI is a game-changer. Combining text, visuals, and interactive data is becoming more prominent in discussions about advanced AI. In this article, I delve into how GPT's visual capabilities function and their potential applications in real-world scenarios, using the popular game Wordle as a fun and practical example. You can view my full video series below, where I dive deep into my journey exploring GPT's visual capabilities, along with demos, code, and tips to optimize your prompt engineering.


What Are GPT Visual Capabilities and Their Real-World Applications??

My journey started with a simple question: How far can multimodal AI go in analyzing and processing complex visual information? I tested its limitations and explored various examples, from identifying fashion to assessing a skier’s form and interpreting circuit diagrams. Through these examples, we learn how AI can support people, such as the blind and low-vision community, through tools like Seeing AI.?

I was attracted to the Wordle scenario because it seemed significantly challenging. To solve a puzzle, we need to combine both the logic (of the rules of the game) with interpreting a lot of visual information. This is complex image understanding because the model needs to produce essentially 3 distinct task results: OCR-like identification of the distinct characters/letters in the puzzle, the position of the letter within the word, and the color behind each letter after a guess. Historically, positional information has been difficult for LLMs, and all of this is made more complicated by needing to perform localized for each letter. I wasn't sure if it would be possible, but I love Wordle and I wanted to try!


The Differences Between Zero-Shot, One-Shot, and Few-Shot Prompt Engineering?

Next, let's think about the way we provide language models with information and how to better speak the language of AI. Prompt engineering is key to maximizing the effectiveness of language models. When a zero-shot prompt to initiate a response without prior examples doesn't perform well, a great first step is to try a few-shot approach with structured prompts that guide the model in how to respond.?I also share a research paper by some of my coworkers in Microsoft Research that explores GPT's visual capabilities.


Tokenization: How Do Models Represent Language??

Simple few-shot examples were not sufficient to solve a problem of Wordle's complexity, and identify the character, position, and color for each letter. So, I reflected on how the model may be "seeing" those characters - models use not characters or words, but tokens as their units to represent language. Let's take a sidebar to discuss tokens and the tokenization process of your prompts.??

Tokenization is an essential aspect of how models interpret input. By mapping words to tokens and understanding how words are broken down, we will try to guide the model to respond more accurately. This technique was particularly effective in enhancing model comprehension of extracting characters from my Wordle board.?


Understanding How AI Recognizes Color in Multimodal Settings?

To comprehend a Wordle puzzle, an image understanding model would need to extract the characters, their relative positions, and their background colors on a Wordle board.?Now that we were able to extract the characters consistently correctly, we will focus on positions and color. I shared some insights around chaining and classification that could potentially help. Then, much like my thinking process for characters and how the model represents language, I had to understand how the model identifies and understands color.? We saw in the first post's video that this model can identify specific hues in images, in the bridesmaid dress example. But solving Wordle requires very precise localized color classification within the image to identify the color behind each letter. ?


Using Multimodality to Solve Wordle: The Final Test?

Now for the fun part! Let’s put our code to the test.??

See the AI model in action, where I highlight the culmination of everything we've learned in prompt engineering to extract characters, positions, and colors for each letter. By leveraging the power of multimodal AI, the model processes and evaluates visual data to solve the Wordle puzzle!?


What’s next:?

  • AI-related investment could peak at 2.5% to 4% of GDP in the U.S. The future is AI. Now I may have shown you a fun Wordle example, but I see us using GPT's visual capabilities to solve real-world problems industry-wide. Research specific use cases in your industry and experiment with available models to uncover insights.?
  • AI education will see a significant rise over the next year, and honing prompt engineering skills will set teams up for success. For me, the best way to learn is by doing...you can use free resources like Copilot to test the boundaries of what's possible and deepen your understanding. My GitHub also has excellent resources.?
  • The journey doesn’t stop at solving Wordle. It’s about empowering teams and individuals to embrace AI tools confidently and responsibly. Together, we’re only scratching the surface of what’s possible.?

?

Thank you for following along on this journey! All my research, code, and development work are now available for you to explore and build upon in my WordleGPT GitHub repo.

Chinwendu Iwuorie ??

AI/ML Practitioner | Customer Centric | Project Management

2 个月

Oh wow this is really cool!

Virginie Pontruché

Strategic Customer Executive Microsoft; SheSays & Cloud Seeders board member

3 个月

Oh yay, I remember when AZURE came up on the Wordle but didn’t capture my screenshot. Such a fun experiment!

Lynn Langit

Cloud Architect | Linked In Learning Instructor

3 个月

thanks for writing this up, so informative to follow your journey here...

要查看或添加评论,请登录

Jennifer Marsman的更多文章

社区洞察

其他会员也浏览了