An LLM With A Visual Sketchpad Can Now Smash Its Competitors Without One (Even GPT-4o)
Ashish Mradul Bamania
?? I Help You To Level Up With AI | Tech & AI Writer With 1M+ views | Software Engineer | Emergency Physician
Humans have been using Sketching as a tool for formulating ideas, communicating them and using them to solve problems for ages.
Think about all the cave paintings that still make sense of what they are about.
Or the first images you created as a child, haphazardly drawing with multiple crayons on a blank canvas when you did not yet know how to speak.
Sketching somehow preserves and propagates knowledge like text never can.
This was an important insight that stuck with the researchers of a recent pre-print on ArXiv.
They introduced a framework called Sketchpad, which gives multi-modal LLMs a visual sketchpad and the tools to draw on it.
The framework allows these LLMs to draw intermediary sketches to boost their reasoning ability when prompted.
And yes, it works wonders!
Sketchpad significantly enhances task performance compared to other LLMs that do not utilize sketching, resulting in an average improvement of 12.7% on math tasks and 8.6% on vision tasks.
Notably, when Sketchpad is used with GPT-4o, it sets a new state-of-the-art performance on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%) and visual correspondence (80.8%) benchmarks.
It is also noted that human evaluations show high agreement with Sketchpad-enabled GPT-4o’s plans, with 80% matching on geometry tasks and a 92.8% validity rating on vision tasks!
Here is a story in which we deep-dive into how the Sketchpad framework works, how it unlocks new insights into the inner workings of LLMs, and how it supercharges the performance of state-of-the-art LLMs like never before.
But First, Why Do LLMs Struggle With Mathematical & Visual Tasks?
Many LLMs perform well on tasks that can be solved by pure linguistic context but inherently lack the understanding of mathematics and visuospatial data.
Mathematical tasks often require step-by-step reasoning, the ability to handle abstract concepts, and meticulous logic application.
For visual tasks, the models need to be able to recognize complex multi-dimensional objects and relate them spatially.
Often, the training data lacks such features, or the architecture of the LLMs is not good enough to understand these patterns.
Researchers have previously tried to address these problems that LLMs face in mathematical tasks with better prompting techniques. One such technique is Chain-of-Thought Prompting.
Let’s talk about it.
What Is Chain-of-Thought Prompting?
Published in 2022 in ArXiv, Chain-of-Thought (CoT) is a prompting technique that allows LLMs to decompose a complicated reasoning task into small intermediate sub-problems.
Each of these sub-problems is tackled before the LLM gives the final answer.
Chain-of-thought prompting is conceptually similar to the Divide-and-Conquer algorithmic technique in that both methods break down complex tasks into simpler components and work on them before arriving at the final solution.
Both of these processes are similar to how the human thought process works when solving a complex problem.
CoT prompting involves crafting a prompt to guide an LLM through a series of intermediate reasoning steps before arriving at the solution.
This contrasts with the standard prompting approach, where reasoning steps are not explicitly included in the prompt.
It is seen that CoT prompting significantly improves an LLM’s performance on complex reasoning tasks, such as arithmetic, commonsense, and symbolic reasoning.
(Note that CoT described here is Few-shot CoT prompting.)
This approach is improved with Zero-shot Chain-of-Thought prompting, which simply involves adding “Let’s think step by step” to the original prompt.
Later in 2022, more work was done to devise another approach called Automatic Chain-of-Thought Prompting.
This approach automatically constructs demonstrations for Chain-of-Thought prompting in LLMs rather than manually doing this as in previous approaches, using diversity-based Clustering and Zero-shot (“Let’s think step by step”) prompts.
But What About Improving Performance On Visual Tasks?
Similar to CoT prompting, researchers have previously explored decomposing complex vision tasks into smaller and simpler sub-steps that can be solved using specialized vision tools.
These use LLMs to generate Python code that invokes the required vision tools to solve a sub-problem.
Although quite admirable, these tools do not completely address the problem.
They follow a pre-defined plan and do not change it according to the intermediate visual cues produced. This frequently causes them to produce incorrect results.
Researchers have thus combined and improved upon these ideas to create the Sketchpad framework.
Let’s talk about it next.
Here Comes “Sketchpad”
Borrowing insights from previous research work and the Sketchpad framework enables multi-modal LLMs to draw sketches.
These sketches allow these models to reason during their intermediate steps to answer a query.
Think of it like Chain-of-Thought prompting but with intermediate visual reasoning steps, or call it “Visual Chain-of-Thought prompting”.
The framework can be used on any multi-modal LLM out of the box and requires no fine-tuning of the baseline model.
It is built upon the open-source AutoGen framework that allows developers to build LLM applications via multiple agents that can converse and coordinate with each other to accomplish tasks.
This is how it works with an LLM interactively:
This interaction continues till the LLM determines that it has enough information from its context to answer the given query.
领英推荐
Performance On Mathematical Problem-Solving Tasks
Sketchpad-integrated LLMs are evaluated on different mathematical tasks, and the results are shown below.
Geometry Problems
Problems from the Geometry3K dataset are used for this evaluation.
A problem example is shown below.
Mathematical Function Solving
Problems from the IsoBench datasets are used for this evaluation.
The following prompt is given to the model for this task.
The intermediate thought, action and observation steps are shown below.
2. Identifying Convexity/ Concavity: To determine whether a function is Convex or Concave
The following prompt is given to the model for this task.
The intermediate steps for this task are not shown in the original research paper.
Graph Problem Solving
Problems from the IsoBench datasets are used for this evaluation.
The following prompt is given to the model for this task.
The intermediate thought, action and observation steps are shown below.
2. Graph Maximum Flow: To determine the maximum flow that can be sent through a network from a source to a sink vertex, considering the capacity constraints on the edges
The following prompt is given to the model for this task.
3. Graph Isomorphism Task: To figure out if two graphs are structurally equivalent
The prompt given to the model for this task is shown below.
The original research paper does not show the intermediate steps to both of the above tasks.
Game Strategy Formulation & Analysis
Problems from the IsoBench datasets are used for this evaluation to find the outcome of a chess game.
The following prompt is given to the LLM, which uses Python’s chess library to draw chess boards based on the Forsyth–Edwards Notation.
Again, the original research paper does not show the intermediate steps for this task.
Results For Mathematical Problem-Solving Tasks
It is seen that Sketchpad leads to large performance gains for GPT-4 models across almost all tasks to outperform all other baseline models.
Performance On Computer Vision Tasks
Sketchpad-integrated LLMs are evaluated on different complex visual reasoning tasks based on the V*Bench, BLINK and MMVP benchmarks.
A few examples of these tasks are shown below.
LLMs are prompted to use different specialist vision tools to sketch and manipulate the given images to solve these tasks, as displayed below.
Results For Visual Reasoning Tasks
It is found that Sketchpad enhances the performance of GPT-4 Turbo and GPT-4o, outshining other baseline models to reach a new state-of-the-art performance on all tasks.
Cost Of Running Sketchpad
Sketchpad’s per-sample cost using GPT-4o ranges from $0.011 to $0.133.
This is more for visual tasks than mathematical tasks due to increased token usage.
Although Sketchpad increases the computational resources required to answer queries, its results are mind-blowing, and this research could be a significant step towards more human-like multi-modal intelligence in LLMs.
What are your thoughts about it? Let me know in the comments below.
Further Reading