The state of AI web Agents
When will AI finally do everything for us online? Or at least manage to perform basic web actions properly? As part of the browser we are building, I've had to deal a lot with web AI agents (though not only them), so for the public benefit, here is a short article on what can currently be done, what can't, recommended tools and where we are heading.
AI assistants
One of the most talked about yet unrealized directions in AI is that of the personal assistant. An AI tool that knows me, has memory, can perform many tasks for me, and removes all the uninteresting things from my mind so I can focus on what really matters (like thinking about the Roman Empire). The most natural place to start is the web world since that is where most (if not all) of our activity already takes place. So what can really be done today? Let's go from easy to hard.
Search and Retrieval
Using LLMs for basic search is an almost solved problem. Beyond the many existing tools (Perplexity , You , etc.), this can be implemented in code relatively simply. Some good examples can be found here , and here. Basically what they do is search Google via API, extract snippets from the results, and summarize (an identical mechanism to Bing Chat ).
There are also code examples for more advanced scraping that clicks on links and then clicks further links, summarizing everything at the end. Here is an excellent project example (Israeli). One problem with scraping is that many sites we want to search require login or block bots that don't enter through our browser (remember ChatGPT's failed Browsing mode? This is one reason they removed it).
Actions on sites
Regarding actions on the sites themselves, currently, the tools on the market mainly offer textual capabilities (search, explain, rewrite, etc. ) Two tools I really like in this area: WebPilot , which allows prompting any site you are on (I use it a lot for summaries and explanations), and Mano , which is a sidebar that also allows prompting sites and additionally searching the web. It also allows saving templates for future use and performing actions within sites (like answering emails for example).
The problems with performing complex actions on the web
What about more complex actions? Here we can divide it into two: General complex actions (booking flights, price comparisons) and personal complex actions (scheduling meetings, answering emails, summarizing open issues under my name on Github). Before we get to solutions, we'll preface by saying one of the biggest difficulties in trying to give LLMs web access is that due to the complexity of how sites are built, on most sites, it is not possible to perform meaningful actions without opening a browser and physically accessing them.
In addition, many sites are dynamic, contain pop-up menus, items that can only be acted upon in a certain order, and scripts that run and change the site on the fly. So simply giving the HTML to an LLM and asking it to work with it will not help. Also on many sites, the full HTML is much larger than the model's context window.
领英推荐
Solution approaches
There are two approaches to solving this problem:
The need to learn the user
One of the biggest problems with all the proposed tools is that ultimately we all have our own ways of doing things, whether it's our preferred sites, intents when taking actions, or our browsing history. None of these things currently enter these agents. An approach that could work much better is learning the user, identifying their important repeated actions, the way they perform them, the context of each action, and coming to automate those actions in the most automatic way possible. This (along with many other things) can be done in a browser truly built for this from the start.
The future
Where are we headed? There are three main things that will nonetheless bring this future closer to us:
Summary
In terms of the current state, MultiOn is the leader in performance but even there, at least currently, the product is not viable (very slow and mostly does not work). Adept raised hundreds of millions but has not yet released a product. Hyperwrite promises a lot but at least currently delivers little. But, and this is a big but, I foresee significant progress in the field within 6 months to a year from today and think much of how we do things today will change significantly. I think this is one of the most promising areas of AI right now and highly suggest to follow it closely
CEO and security engineer
2 个月???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ
???? ??? ?? ??????! ??? ????? ???? ?????? ??? ?????? ??? ??????? ???? ????? ?????? ?????? ???? ?????? ???? ????, ????? ????? ?????? ?????? ?????: https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU
Write clearly, be understood
1 年?? watching this, great summary, concise but thorough, technical but understandable, nicely done! Agree this is an important watch, actually maybe the most important thing to watch for Consumer AI. Getting to this place (and I agree we will get there, it’s a question of when), is what will truly revolutionize the way we interact with the digital world.
Product @ Fiverr Pro | GTM
1 年Great read, very insightful! What about cost efficiency? Do you think it worth using "rockets" like AI agents for relatively simple tasks that can be done in a short amount of time? Thank you for sharing.
Very insightful Amit Mandelbaum thanks for writing it up