Action Jack or the problem with Large Action Models in the enterprise
The next frontier for AI companies is building “Large Action Models,” (LAMs) such as pioneers Adept, H-Company, and Orby. It is a straightforward yet compelling idea: Train models on enough user actions, and they will learn to act just like humans, clicking buttons and using software. In the last 12 months, more than $500m have been raised by companies trying to build LAMs in different variations.
There is immense potential for better automation, as the last generation of legacy SaaS has largely failed to deliver the potential for cost savings for companies. Gartner estimates that enterprises will spend $1tn on software in 2024 (4x in 10 years) while revenue per employee is flat and IT and shared service headcount has exploded. We need less software, and we need fewer people doing meaningless work. Large Action Models could fix the software industry's current challenges.
The hypothesis is simple: Use human actions, rules, and constraints instead of just language as inputs to the models, and you get models that can take actions or use tools. These are often implemented as multiple agents with different specializations. Simple enough. Except it is much harder than it seems.???
The self-driving car problem
Orchestrating enterprise software and having AI take action sounds much easier than creating a self-driving car; everything is in the computer and digital - a controlled digital world.?
Except, it is not. Most enterprises have every kind of system, data, and hacks you can imagine - just to make the everyday world work. Mainframes, images of handwritten text in PDFs, shared inboxes, SharePoint folders, fax, and 25-year-old excel spreadsheets with custom macros are all part of this world.?
To compare it to the self-driving car problem, 20 years and $150bn+ in investments, we have cars that can “almost” drive as well as a human, except when they hit lighting poles on straight roads in Phoenix. Actions are hard, especially if they have real-world consequences.?
And the enterprise landscape is even harder than a dystopian version of San Francisco, not Phoenix, to “drive” in, with another big difference: It will actively fight back. If you want to install an agent to “look” on all users' browsers or desktops so you have inputs for your model, talk to IT (evil laugh). Did anyone say GDPR or the EU AI Act? Integrate to back-end systems and speak to IT again (even more evil laugh). Make decisions on finance data that goes to the ERP. Now you have to talk to the CFO (not laughing). This is not self-driving cars in San Francisco; it is an active warzone with everyone shooting at you every step of the way.?
Costly Data and even costlier IP challenges
Imagine you solved the above problems and now have your input data. What is meaningful data for your action model, process data, rules, screenshots, or videos??
Running images through LLMs and understanding them is expensive, and interpreting user actions takes a lot of work. You can’t just rely on reading the browser with an extension (maybe for consumers, but not in the enterprise). Secondly, suppose you want this to be a LARGE action model In that case, you need lots and lots of data, data that is hard to get, owned by many vendors (370+ SaaS vendors in the enterprise on average today, not counting legacy applications). This data costs many times that of scraping the consumer internet and building ChatGPT before you even make the model.?
Further, why should most data silos even participate when they are trying to make their copilots, and what data can be legally used? It’s a bit like the early days of music streaming before Spotify or similar platforms emerged, solving IP and rights issues.??
Pesky Humans!
Once all that is solved, the most complex challenge is humans! We are annoying, non-linear, pattern machines with a lot of knowledge stored in our brains that are invisible to machines.?
领英推荐
Current LLMs depend on human feedback and data labeling to reach performance levels. For consumer LLMs, large outsourced workforces do this, but that won’t work for enterprise use cases, as the context and quality of labeling and feedback matter a lot; most existing legacy enterprise applications are not built for that, so you need an additional UX layer for feedback and validation loops while keeping the context of the data that's relevant for the labeling or feedback. Before you know it, you have replicated most legacy applications from scratch to gather user feedback.?
Look, no hands!
It’s easy to make a convincing demo; just look at Adept's early concept videos showing an agent using Salesforce or buying plane tickets, just as the first self-driving car demonstrations were compelling 20 years ago. Making it work in the real world is extremely hard. Adept just got acquired by Amazon (not Microsoft), showing how even if you are one of the co-authors from the “Attention is all you need” seminal paper with $400m in funding, you don’t get far in trying to do actions top down in the enterprise.?
Apple’s Intelligence
Apple is never a first mover, but they are often correct. It’s been interesting to watch as they waited to move in AI until a few weeks ago. Their vision for the future is as simple as it is elegant.?
Everything is tied together on the phone, leveraging the power of tight hardware and software integration. Forcing app vendors to add new LLM “tool” handles in the form of App Intents that will allow Apple’s voice-activated “AI” to orchestrate them.?
Apple is using its distribution power and ecosystem to generate the world rather than trying to make a self-driving car. This is smart as it is much easier to do with the current tools, gives a better end-user experience, and requires significantly less compute than “seeing,” “reasoning,” and then “actioning”, which makes it device and battery-friendly.?
Generating the world vs. understanding it
So, a lesson learned from Apple is: what if we didn’t have to automate the existing software? Of course, it is much harder to control and generate the world through distribution and ecosystem control in software, but there is another solution. Just generate the application the end-user is using pre-baked for automation.?
We can’t remake the real world to make it easier for self-driving cars to get around, but the software is digital, not bricks and asphalt.?
AI is accelerating code generation, task management, and tool use. What if we could start with a contained problem and generate a solution for that problem with all the necessary learning and ability to take action baked in? We would start with data, defined APIs, and input/output and then generate an applicable solution.?
We believe this is cheaper and faster, but we also think it is a better solution in the long run. Especially if each of these solutions, which we call “work blocks,” is composable and human-validated.?Ensuring trust in outputs and work done by AI and humans.
The goal is not to create a single large action model but a platform to execute millions of small ones linked together to larger ones when necessary. We want to develop new bottom-up systems of work rather than training AI to click buttons in old applications and reinforce existing silos.?
Technology Business Leader
3 个月The long tail result of your hypothesis is a fundamental change in the economics of “IT”… an intent based interaction, “generated” services to deliver the intent, based on guardrails and good data.
Oh, compliance reporting and attestation just became the next level of automation. Checkbox compliance to be delviered by AI :-) Will be interesting to see when LAM's will be used to clean the signal from noise - and used as an active feeback loop to developers - eradicate all useless functionality. Which then leaves a good question - when checkbox compliance becomes (really) useless - what will replace it?
AI Whisperer and Head Honcho at Faktory
3 个月It's going to be extremely hard to beat the generalized models. Bloomberg tried and was superseeded in no time. You see how the new Claude more or less can do the same as Devin the code agent My suggesting is to focus on orchestration rather than trying to make a better model at least with what I have seen so far.