Why Large Action Model (LAM) Will Shape the Future of Tech?
Steve Jobs believed that 'devices' would someday turn into a 'bicycle for the mind,' but the effect on some of us is similar to that of smoking or junk food.
We're addicted to our screens; the average person spends 6 hours and 58 minutes per day on screens connected to the internet.
On average, people check their phones 58 times per day. And almost 52% of phone checks (30 per day) occur during work hours.
All of us have, at one point, questioned whether the device inside our pocket has made us more productive or lazier.
When Jesse Lyu got on stage to present rabbit inc. 's new device, he made a statement that resonated with me. He said, "Our smart devices have become the best way to kill time instead of saving it."
Our smart devices have become the best way to kill time instead of saving it.
If you examine the list of the most downloaded applications for 2023, over half are designed with a singular purpose: to kill your time. Of course, these apps will claim otherwise.
But more recently, a host of new platforms have started leveraging artificial intelligence to make us more productive.
OpenAI might have opened Pandora's box as far as ease of finding information is concerned, but the Large Action Model (LAM) will finally close the lid on it.
What is LAM?
Large Action Model (LAM) has been designed to understand how humans interact with computer programs. Unlike previous methods, LAM directly understands how different programs work and what users do on them without needing to use text as a middle step.
The question I've asked myself is why a Large Action Model (LAM) is needed when we have made so many recent advances in natural language models and computer vision.
Neural language models have enabled machines to understand better and respond to human language. Speech recognition and synthesis technologies have also improved, allowing the creation of machines that understand human intentions deeply and contextually in real time.
This progress has led to a new way of interacting with devices using spoken language rather than touch. It started with smart speakers and has expanded to AI chatbots and operating systems with natural language interfaces.
However, designing these devices poses challenges, such as the lack of application programming interfaces (API) for major service providers.
To overcome this, platforms like rabbit inc. use neuro-symbolic programming to learn user interactions directly without relying on rigid APIs.
Large Action Model (LAM) aims to better understand human intentions expressed through actions on computers and, by extension, in the physical world. The emphasis is on learning and interpreting user actions rather than relying on predefined interfaces.
What Problem is LAM Trying to Solve?
The way people interact with computers is different from how they use natural language or vision. The way applications work is more structured than a picture and more detailed and messy than a sentence or a paragraph.
Rabbit's Large Action Model (LAM) needed different qualities compared to a model that only understands language or vision.
For example, while it's fine for a smart chatbot to be creative, actions learned by LAM on applications should be very regular, simple, stable (not changing too much), and easy to explain. This approach aligns with Occam's razor, which suggests that simpler explanations are often better.
Let's consider a specific example related to an action performed on a computer application. Imagine a user interacting with a photo editing app:
In this case, the chatbot is expected to be creative and come up with a unique solution like the user asked.
Here, the LAM-learned action is highly regular and minimalistic. Instead of being creative, it sticks to a straightforward, predictable action—adjusting brightness. This aligns with the idea that actions on applications learned by LAM should be regular, minimalistic, and stable.
领英推荐
Language models face challenges in understanding applications when presented with raw text.
Even the most advanced large language models, equipped with their current tokenizers, struggle to accommodate the representation of raw-text applications within their context window.
In simpler terms, these models find it hard to fully grasp the content and structure of applications when they are in raw text format, like the HTML of a webpage.
How Large Action Model (LAM) Works?
LAM's way of modeling is based on imitation or learning by demonstration. It watches how a person uses an interface and aims to accurately replicate the process, even if the interface changes a bit.
Unlike a black-box model that outputs actions without control, LAM's approach is more transparent. Once it learns from a demonstration, it directly applies the learned routine to the target application without the need for continuous observation or adaptation.
This makes the process more understandable, and any technically trained person can inspect and understand the "recipe" or steps involved.
As LAM keeps learning from demonstrations, it builds a comprehensive understanding of every aspect of an application's interface. It essentially creates a "conceptual blueprint" of the underlying service provided by the application.
In simpler terms, LAM acts like a bridge, connecting users to the services offered by an application through its interface.
Why Large Action Model (LAM) May Shape the Future of Tech?
We've all been sold on the promise of AI assistants and, more recently, AI workers— be it a physical device or a Chrome-based extension that can serve as your personal secretary. However, neither has lived up to our expectations.
When Humane made its announcement last year, you finally felt that the device had serious potential. However, the pricing was not within reach for people who were considering purchasing the device.
The subscription for the device was a complete bummer - you need us to pay for a device that we're not even sure whether we actually need.
But the idea of an AI companion or AI worker is promising; if LAM can somehow fulfill this void, then they could possibly disrupt the market. Large Action Models could perhaps be trained to perform complex tasks by accessing multiple or more platforms.
When Jesse Lyu was presenting Rabbit R1, it reminded me of the presentation made by Dag Kittlaus at TechCrunch Disrupt seven years ago. Dag and his team were building Viv Labs after selling Siri to 苹果 .
However, Viv was later sold to 三星电子 and became Bixby. When Dag was asked why they decided to sell to Samsung, he replied, "They ship 500 million devices a year.' You asked me onstage about our real goal, and I said ubiquity."
They ship 500 million devices a year.' You asked me onstage about our real goal, and I said ubiquity.
Of course, 三星电子 has spent the last few years improving Bixby, but it is nowhere near what Dag's team imagined it would become.
You can only hope that Jesse Lyu has learned from his experience of selling 渡鸦科技 to 百度 , considering that fulfilling the promise of AI workers or AI assistants requires powerful hardware and software.
If you manage to fulfill the promise, you're possibly looking at a $14.77 Billion market, maybe even larger, considering the use cases keep evolving.
Message for the Reader: AI workers and assistants are pretty clever, but discerning human emotions is not their strong suit. Hence, I'm reaching out to you, the discerning reader, with a humble request: if you found this article delightful, consider giving it a thumbs up or sharing it.
Your human touch is the true test of its likability!
AI Speaker & Consultant | Helping Organizations Navigate the AI Revolution | Generated $50M+ Revenue | Talks about #AI #ChatGPT #B2B #Marketing #Outbound
1 年It's important to be mindful of our screen time and find a balance.
I thrive on growing businesses through Digital Transformation & Strategy | Client Partner @ Meta | AI Powered Marketing | Digital Leader | Follow for Real World Digital Marketing Knowledge
1 年Fascinating read! If there is one thing that’s stood out, that’s the interaction between computer and humans. Well, there will be efforts made to make human interaction with a computer as native as possible, it is very hard nut to crack.