The state of AI web Agents

The state of AI web Agents

When will AI finally do everything for us online? Or at least manage to perform basic web actions properly? As part of the browser we are building, I've had to deal a lot with web AI agents (though not only them), so for the public benefit, here is a short article on what can currently be done, what can't, recommended tools and where we are heading.

AI assistants

One of the most talked about yet unrealized directions in AI is that of the personal assistant. An AI tool that knows me, has memory, can perform many tasks for me, and removes all the uninteresting things from my mind so I can focus on what really matters (like thinking about the Roman Empire). The most natural place to start is the web world since that is where most (if not all) of our activity already takes place. So what can really be done today? Let's go from easy to hard.

Search and Retrieval

Using LLMs for basic search is an almost solved problem. Beyond the many existing tools (Perplexity , You , etc.), this can be implemented in code relatively simply. Some good examples can be found here , and here. Basically what they do is search Google via API, extract snippets from the results, and summarize (an identical mechanism to Bing Chat ).

There are also code examples for more advanced scraping that clicks on links and then clicks further links, summarizing everything at the end. Here is an excellent project example (Israeli). One problem with scraping is that many sites we want to search require login or block bots that don't enter through our browser (remember ChatGPT's failed Browsing mode? This is one reason they removed it).

Actions on sites

Regarding actions on the sites themselves, currently, the tools on the market mainly offer textual capabilities (search, explain, rewrite, etc. ) Two tools I really like in this area: WebPilot , which allows prompting any site you are on (I use it a lot for summaries and explanations), and Mano , which is a sidebar that also allows prompting sites and additionally searching the web. It also allows saving templates for future use and performing actions within sites (like answering emails for example).

The problems with performing complex actions on the web

What about more complex actions? Here we can divide it into two: General complex actions (booking flights, price comparisons) and personal complex actions (scheduling meetings, answering emails, summarizing open issues under my name on Github). Before we get to solutions, we'll preface by saying one of the biggest difficulties in trying to give LLMs web access is that due to the complexity of how sites are built, on most sites, it is not possible to perform meaningful actions without opening a browser and physically accessing them.

In addition, many sites are dynamic, contain pop-up menus, items that can only be acted upon in a certain order, and scripts that run and change the site on the fly. So simply giving the HTML to an LLM and asking it to work with it will not help. Also on many sites, the full HTML is much larger than the model's context window.

Solution approaches

There are two approaches to solving this problem:

  1. The API approach. For example, the Google Calendar API. Or the ChatGPT Plugins. The big advantage here is determinism (function calls) and relatively small token counts. The big disadvantages are limited capabilities and very few sites developing such things. In addition, unfortunately, the results are still not good enough (try for example to check flight prices through Expedia's GPT plugin, spoiler, there is usually no connection between what's on the site and the plugin's results).
  2. The second approach is to "browse" the site like users do. This also has two sub-approaches. One takes the site's HTML. Cleverly compresses it and writes small code snippets in libraries like Selenium or Puppeteer to perform actions in real-time. Here is an excellent DeepMind paper demonstrating this approach.Another tool using the first approach is MultiOn , which I've written about before. They use a different HTML compression than the paper but similarly start training their own models for solving the problem. HyperWrite's Personal Assistant also seems to be using (likely) the same approach. The big advantage of this approach is that theoretically, anything is possible. The disadvantage is that beyond the token limitation, this approach is extremely slow and requires many LLM calls for every simple action. Even if that issue is solved, there is still the complexity of how sites are built which causes agents to fail most tasks they are given.The second approach takes faking the user to the extreme. In this approach the multimodal LLM looks at the site itself and uses physical coordinates to click, enter text, scroll, etc. Currently, no major player seems to be using this approach officially but there is a decent chance Adept is building a model this way.

The need to learn the user

One of the biggest problems with all the proposed tools is that ultimately we all have our own ways of doing things, whether it's our preferred sites, intents when taking actions, or our browsing history. None of these things currently enter these agents. An approach that could work much better is learning the user, identifying their important repeated actions, the way they perform them, the context of each action, and coming to automate those actions in the most automatic way possible. This (along with many other things) can be done in a browser truly built for this from the start.

The future

Where are we headed? There are three main things that will nonetheless bring this future closer to us:

  1. Advances in language models allow for taking in larger texts and performing more complex and faster reasoning processes.
  2. Development of a conversational interface for many sites that will allow accessing them in a much more intuitive way.
  3. Creation of a platform that translates sites into easy-to-use APIs that will allow LLMs (and in general) to access and perform actions in a much more deterministic way.

Summary

In terms of the current state, MultiOn is the leader in performance but even there, at least currently, the product is not viable (very slow and mostly does not work). Adept raised hundreds of millions but has not yet released a product. Hyperwrite promises a lot but at least currently delivers little. But, and this is a big but, I foresee significant progress in the field within 6 months to a year from today and think much of how we do things today will change significantly. I think this is one of the most promising areas of AI right now and highly suggest to follow it closely

Netanel Stern

CEO and security engineer

2 个月

???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ

回复

???? ??? ?? ??????! ??? ????? ???? ?????? ??? ?????? ??? ??????? ???? ????? ?????? ?????? ???? ?????? ???? ????, ????? ????? ?????? ?????? ?????: https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

回复
Zach Gunderson

Write clearly, be understood

1 年

?? watching this, great summary, concise but thorough, technical but understandable, nicely done! Agree this is an important watch, actually maybe the most important thing to watch for Consumer AI. Getting to this place (and I agree we will get there, it’s a question of when), is what will truly revolutionize the way we interact with the digital world.

回复
Asaf Azar

Product @ Fiverr Pro | GTM

1 年

Great read, very insightful! What about cost efficiency? Do you think it worth using "rockets" like AI agents for relatively simple tasks that can be done in a short amount of time? Thank you for sharing.

回复

Very insightful Amit Mandelbaum thanks for writing it up

要查看或添加评论,请登录

社区洞察

其他会员也浏览了