登录查看更多内容

Pure Vision Based GUI Agent: OmniParser V2 (aka: cursor control)

Marian Dumitrascu

Principal Solutions Architect | AWS AI/ML GenAI Quantum Computing

发布日期: 2025年2月28日

For years, researchers have been working to create AI agents that can effectively navigate and interact with user interfaces. While large vision-language models like GPT-4 have shown tremendous potential in this space, they've been significantly limited by their inability to reliably identify interactive elements and accurately ground actions to specific regions on the screen.

This is where OmniParser comes in - a breakthrough approach that dramatically enhances the ability of vision models to operate across diverse interfaces without requiring access to underlying HTML or other structural information.

The Problem: Why GUI Agents Struggle

Current GUI agents face two critical challenges:

Element Identification: They struggle to reliably detect interactive elements (buttons, icons, links) across different platforms and applications
Action Grounding: Even when they can identify what action to take, they often can't precisely locate where on the screen to perform that action

Previous solutions like Set-of-Mark prompting have made progress by overlaying bounding boxes on screenshots, but most implementations still require access to HTML or view hierarchies - limiting their use to specific platforms like web browsers or certain mobile apps.

OmniParser: A Vision-Only Approach

Microsoft Research has introduced OmniParser, a comprehensive vision-based parsing method that significantly enhances the ability of models like GPT-4 to understand user interfaces and generate accurately grounded actions.

Example of screen parsing. (Image from the

Key Components

OmniParser integrates several specialized models:

Interactable Region Detection Model: Trained on 67,000+ screenshots with labeled interactable regions from popular websites
Icon Description Model: A BLIP-v2 model fine-tuned on 7,000+ icon-description pairs to extract functional semantics from visual elements
OCR Module: Detects and extracts text elements from the interface

By combining these elements, OmniParser provides a structured, DOM-like representation of any UI without needing access to the underlying code.

How It Works

When presented with a screenshot, OmniParser first identifies all potentially interactive elements
It overlays bounding boxes with unique numeric IDs on these elements
For each element, it generates functional descriptions (e.g., "a settings icon" rather than "a gray gear symbol")
This enhanced representation is provided to GPT-4 alongside the original task request

The result? GPT-4 can now understand both what is on the screen and where specific actions should be performed.

Impressive Results

OmniParser doesn't just incrementally improve GUI agent performance - it dramatically transforms it. Testing on three major benchmarks shows:

ScreenSpot Benchmark:

GPT-4 baseline: 16.2% accuracy
OmniParser: 73.0% accuracy (+56.8%)
Outperformed specialized models like SeeClick, CogAgent, and Fuyu

领英推荐

Disabled by Design?—Poor DALL-E!

Ignatius Fernandez 1 年前

Newsletter #10

Sciflare Technologies Pvt Ltd 1 年前

On Humans and Design

Morten Rand-Hendriksen 7 年前

Mind2Web Benchmark:

OmniParser outperformed GPT-4 baselines requiring HTML information
Achieved 42.0% success rate on cross-domain tasks (compared to 36.8% for the previous best approach)

AITW Benchmark (mobile):

Improved performance from 53.0% to 57.7% over the best GPT-4 baseline
Successfully generalized from web detection to mobile interfaces

Why This Matters

OmniParser represents a fundamental shift in how AI can interact with user interfaces:

Cross-Platform Capability: Works across different operating systems (Windows, macOS, iOS, Android) and applications
HTML-Free Operation: Doesn't require access to the underlying code or DOM structure
Enhanced Understanding: Provides richer semantic information about interface elements
Better Grounding: Significantly improves action precision by providing clear reference points

This approach opens the door for truly generalized GUI agents that can operate across virtually any interface they encounter.

The Future of GUI Agents

OmniParser demonstrates that the capabilities of vision language models like GPT-4 have been substantially underestimated due to limitations in parsing techniques. With robust screen parsing methods, these models can achieve far greater accuracy and usefulness in real-world scenarios.

As the researchers note: "We hope OmniParser can serve as a general and easy-to-use tool that has the capability to parse general user screens across both PC and mobile platforms without any dependency on extra information such as HTML and view hierarchy."

The implications extend beyond just improving existing applications. This approach could enable entirely new categories of AI assistants that can navigate complex interfaces on behalf of users - from helping with technical tasks to making applications more accessible to those with disabilities.

At PREDICTif we vigilantly monitor emerging GenAI technologies on your behalf. Our team of young, and talented data science engineers, stands ready and happy to help you unleash the power of this new technological wave.

References:

Note: This article is based on research published by Microsoft Research, authored by Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah.

要查看或添加评论，请登录

Marian Dumitrascu的更多文章

This is Major: Majorana 1 by Microsoft

2025年2月20日

This is Major: Majorana 1 by Microsoft

Microsoft has unveiled a quantum computing milestone that rewrites the rules of what's possible. The new Majorana 1…

1 条评论
OpenAI Unveils Deep Research: The AI Tool That’s About to Make Your Job Easier (and Smarter)

2025年2月3日

OpenAI Unveils Deep Research: The AI Tool That’s About to Make Your Job Easier (and Smarter)

The days of drowning in browser tabs, spreadsheets, and endless PDFs are numbered. This week, OpenAI dropped Deep…

1 条评论
LLM-AutoDiff: When Language Models Learn to Optimize Their Prompts

2025年1月30日

LLM-AutoDiff: When Language Models Learn to Optimize Their Prompts

The field of Large Language Models (LLMs) has revolutionized how we interact with AI, but one challenge has remained…
Alibaba's Qwen 2.5-Max: The New AI Powerhouse That's Reshaping the LLM Landscape (Again)

2025年1月30日

Alibaba's Qwen 2.5-Max: The New AI Powerhouse That's Reshaping the LLM Landscape (Again)

This January 2025 is raining with LLMs, Alibaba Cloud has unveiled its latest achievement in artificial intelligence -…
Janus-Pro: The Two-Faced AI That's Revolutionizing Visual AI Understanding AND Generation

2025年1月28日

Janus-Pro: The Two-Faced AI That's Revolutionizing Visual AI Understanding AND Generation

In Roman mythology, Janus was the god of transitions, looking both to the future and the past. Fittingly, DeepSeek's…
Agent-R: A Self-Training Framework for Language Model Agents

2025年1月28日

Agent-R: A Self-Training Framework for Language Model Agents

Agent-R (created by a ByteDance Seed research group) introduces an innovative iterative self-training framework that…
VideoRAG: Advancing AI with Video-Based Knowledge Retrieval

2025年1月24日

VideoRAG: Advancing AI with Video-Based Knowledge Retrieval

[arXiv Paper](https://arxiv.org/pdf/2501.
Transformer2: When AI Learns to Rewrite Its Own Rules

2025年1月21日

Transformer2: When AI Learns to Rewrite Its Own Rules

As the industry chases bigger models, Sakana AI (creators of AI Scientist) took a different path: they created an AI…
Beyond the Hype: Why DeepSeek-R1's Pure RL Approach is a Game-Changer in AI Development

2025年1月21日

Beyond the Hype: Why DeepSeek-R1's Pure RL Approach is a Game-Changer in AI Development

While your feed is flooded with DeepSeek-R1 vs OpenAI o1 benchmark comparisons, there's a revolutionary detail…

2 条评论
Sims - Because Agents Are Not Enough

2025年1月17日

Sims - Because Agents Are Not Enough

In the rapidly evolving landscape of AI, we're witnessing a significant shift in how we interact with autonomous…

1 条评论

See all articles

Pure Vision Based GUI Agent: OmniParser V2 (aka: cursor control)

Marian Dumitrascu

Principal Solutions Architect | AWS AI/ML GenAI Quantum Computing

The Problem: Why GUI Agents Struggle

OmniParser: A Vision-Only Approach

Key Components

How It Works

Impressive Results

领英推荐

Why This Matters

The Future of GUI Agents

Marian Dumitrascu的更多文章

社区洞察

其他会员也浏览了

How AI and Machine Learning Are Transforming Web Development

How to Train an AI to Use Your Own Design System

Generating code with Sourcegraph Cody on a scale

AI-Powered Frontend: Leveraging GPT APIs and Other Tools

Llama 2: The Ultimate Guide to Creating an App in No Time

Introducing the new Postbot AI assistant and Postman CLI to automate your API testing

Free GPT in every cell! - GPT Google Sheets Addon

Good practices - learn from my experiences with GPT

How AI and Machine Learning are Transforming Web Development?

Fooocus – A Free Open-Source Image Generation Tool

The Problem: Why GUI Agents Struggle

OmniParser: A Vision-Only Approach

Key Components

How It Works

Impressive Results

领英推荐

Why This Matters

The Future of GUI Agents

Marian Dumitrascu的更多文章

This is Major: Majorana 1 by Microsoft

OpenAI Unveils Deep Research: The AI Tool That’s About to Make Your Job Easier (and Smarter)

LLM-AutoDiff: When Language Models Learn to Optimize Their Prompts

Alibaba's Qwen 2.5-Max: The New AI Powerhouse That's Reshaping the LLM Landscape (Again)

Janus-Pro: The Two-Faced AI That's Revolutionizing Visual AI Understanding AND Generation

Agent-R: A Self-Training Framework for Language Model Agents

VideoRAG: Advancing AI with Video-Based Knowledge Retrieval

Transformer2: When AI Learns to Rewrite Its Own Rules

Beyond the Hype: Why DeepSeek-R1's Pure RL Approach is a Game-Changer in AI Development

Sims - Because Agents Are Not Enough

社区洞察

其他会员也浏览了

How AI and Machine Learning Are Transforming Web Development

How to Train an AI to Use Your Own Design System

Generating code with Sourcegraph Cody on a scale

AI-Powered Frontend: Leveraging GPT APIs and Other Tools

Llama 2: The Ultimate Guide to Creating an App in No Time

Introducing the new Postbot AI assistant and Postman CLI to automate your API testing

Free GPT in every cell! - GPT Google Sheets Addon

Good practices - learn from my experiences with GPT

How AI and Machine Learning are Transforming Web Development?

Fooocus – A Free Open-Source Image Generation Tool