An AI took control of my computer ...

An AI took control of my computer ...

...and it didn't exactly go to plan!

Tl;dr: This feature is in beta, so it has a lot of problems, but the potential is enormous. It's easy to set up and serves as a good, working, basic proof-of-concept, and the concept is incredible, but don't expect it to change your life this week.


Background

Anthropic recently released a beta feature called Computer Use, which essentially lets an AI take control of your computer to perform tasks you tell it to do by itself. It's like all the effort they saved on coming up with a creative name went straight into making the feature itself great. The setup is relatively easy; I was able to set it up in under five minutes!


Setup Steps

Anthropic recommends using a virtual machine (VM) to test this feature since it's still in beta i.e. not fully polished and may have some bugs. Since it can literally control your computer (e.g., send a random email or even delete your operating system), a VM is the safest option. Letting the AI use a VM is like letting a kid play with a dollhouse: they might burn it to the ground, but your real house stays intact.

My "computer"

Once I set up the VM and ran the feature, it booted into a 2000s-style Ubuntu (?) desktop with basic apps like a browser and spreadsheet. On the left, there was a simple chatbot-style interface where I could type in tasks in plain text, and the AI would then control the screen to complete them.

P.S. if you want to see the detailed tutorial for setting this up from scratch, let me know in the comments below.


The Tests

I asked the AI to do two things: first, find and download an image of Albert Einstein; second, find Nvidia's stock price from Yahoo Finance for the last three months and download the data.


How Does It Work?

Screenshot, thinking, action, repeat

The feature works by taking screenshots of the current screen, identifying buttons, text, etc., and using their coordinates to navigate and click. For example, if it wants to close a tab, it takes a screenshot, locates the "close tab" button, finds its coordinates, navigates there, and clicks.

At each step, whether navigating or clicking, it takes a picture. You can see its thinking process in the chat on the left:

Tool Use: computer
Input: {'action': 'mouse_move', 'coordinate': [511, 101]}

Tool Use: computer
Input: {'action': 'left_click'}

Tool Use: computer
Input: {'action': 'key', 'text': 'Return'}        


How Did It Go?

"Promising proof-of-concept" is a fair summary. It's promising because this feels like the first step in a seismic shift in how we use computers. This must be how people must've felt when the Apple II came out. It's building on familiar technology but changing how we interact with it entirely.

Constant rate-limiting really hinders the experience

In terms of current status, though, it didn't complete either task successfully. For the image download, it got stuck in a loop trying to close the initial popup you see when opening Google. Eventually, I hit a rate limit, which meant I couldn't use the feature for the next 30 minutes.

For the stock price task, it managed to find the Yahoo Finance page but struggled to get to the historical data section and chose the wrong date range. It couldn't find the download button and kept trying various actions (except scrolling for some reason) until I hit another rate limit. I tried both tasks a couple times each and gave up in the end.


The Problems

  • It can be INCREDIBLY slow. It takes a screenshot at every step, and then processes it, the latter of which takes a few seconds each time. When a page is perfectly set up with no ads, no moving parts and is overall very static, it tends to do things relatively fast (although still much slower than an adult human). I'm not sure why it needs to do take screenshots for both navigation and clicking. I guess it could be to verify that the interface hasn't changed unexpectedly or that the target element is still in the right place, which might help handle dynamic changes or unexpected popups. However, this approach feels inefficient and might be optimized to skip redundant steps in stable contexts. If it knows the download button is at (x=200, y=300), why take another picture before clicking? Maybe it's to ensure the button is still there or to handle cases where elements move, but those seem like edge cases that slow down the process unnecessarily.
  • It makes mistakes. When I initially asked it to find the last month's prices (instead of three months), it still clicked the "three months" button.
  • Anthropic has severe rate limiting. I don't know if it's just for non-enterprise users, but I couldn't get even one basic task done without hitting rate limits every time and being told to wait for 15-30 minutes.
  • Current use cases are limited. Since it relies on static screenshots, it can't handle tasks involving dynamic content. For instance, you couldn't tell it to pause a YouTube video at a specific moment as it takes too long to process each step. Of course, this is barely v1 of this feature, so we can expect it to improve this aspect in the coming months.
  • Cost. It cost me around $0.60 for these two simple tasks, including retries after failures. It might not sound like a lot, but for basic tasks, it adds up. Imagine downloading stock prices for 10,000 companies: it would cost a small fortune, and that's before any analysis.


Where Do We Go From Here?

The potential here is enormous. Imagine a future where you could ask your computer to prepare a report, automate repetitive tasks, or even navigate complex software just by describing what you need. We're likely looking at a whole new way to interact with computers and information. While computer automation tools have existed before, they were rule-based and struggled to adapt to varied scenarios. LLMs, with their nuanced understanding, could be a game changer (I know, overused phrase, but it fits here). The improvements needed are obvious: faster image processing, better navigation, more safeguards, and cost reduction to make this viable.

Ninad Karpe

Founder & Partner at 100X.VC | Early stage startup investor using iSAFE notes

3 个月

Interesting thoughts! Keep writing Prashant Lonikar

Shreyansh singh

Student at Bits pilani kk birla goa campus

3 个月

Good wishes Dear Bhai :)... Keep learning and growing more and more Dear Bhai :)...

回复
C.A. Shashikant Lonikar

Practicing Chartered Accountant

4 个月

Great article and explained AI features very well

要查看或添加评论,请登录

Prashant Lonikar的更多文章

  • Hey Siri, Build Me an App!

    Hey Siri, Build Me an App!

    My Weekend Adventure in AI-Assisted Development It all started with a simple frustration. I was using DayOne, a popular…

    3 条评论
  • Liquid Foundational Models (LFMs)

    Liquid Foundational Models (LFMs)

    Liquid Foundation Models (LFMs) were recently introduced by Liquid AI, a startup spun off from MIT. The company…

    1 条评论
  • OpenAI's new "Canvas"

    OpenAI's new "Canvas"

    Not really but kind of Hey, so OpenAI just rolled out this new thing called Canvas. It's supposed to be a game-changer…

    2 条评论
  • India's Accounting Body Releases Sustainability Reporting Model

    India's Accounting Body Releases Sustainability Reporting Model

    Just a moment after Earth Day celebrations worldwide, India's central accounting body, the Institute of Chartered…