Arrival of the AI Smartphone, AI OS and generative user interface
Intro
You might have spotted or not the arrival of some strange mammals, fashion items and types of smartphones, either as generally available consumer products, concept prototypes, or in form of announcements. Among the buzzwords are next-gen AI smartphone, AI OS, and generative user interface.
What’s new, what’s coming our way, what might we overlook and possibly leave us feeling left behind?
Machine Learning and forms of Artificial Intelligence have been around for decades, and ‘AI-powered’ system features have been present on smartphones for quite some time. This includes e.g. auto-completion of text or sophisticated image processing.
Remember the arrival of ChatGPT by OpenAI in Nov 2022. This has given digital industries a new boost (massive hope and just a little bit of hype too), due to the real potential and promises of generative AI (or in short, genAI). Newspaper journalists suddenly spoke about foundation models and large language models (LLM). Even my grandmother, if she were still alive, would by now eloquently talk with her neighbours about how LLMs work. Everybody pretends to know. Excitement is sky-high. 2023 became prime time for LLMs, certainly when used from a PC, laptop, or tablet computer (remember: from Microsoft Copilot in Feb 2023 to OpenAI GPT-4 in Mar 2023 to Google Gemini in Dec 2023). High-speed trains are slow in comparison to this technology development.
The impact on smartphones and consumer devices is still to come, driven by technological progress on multiple fronts: hardware, foundation models and other AI technologies. While IDC summarises the big new thing we witness as ’on-device genAI support’, I go further. I call it ... well, I don’t know. I like to avoid ‘next-gen’. Too many things are next-gen these days. Maybe ‘genAI user interaction and user interface’, or ‘intent-driven user interface creation and service execution’. So far, I badly fail in finding a super-elegant and long-living 3-letter acronym like CLI or GUI.
As the new technologies improve to successfully support different deployment scenarios (on-device, in the cloud), new use cases, product types, product features and interaction modes will become feasible. To give a few examples:
Use cases: body-worn voice-controlled camera as part of a wearable gadget, more natural language processing incl. live translation, more sophisticated computational photography, direct execution of e-commerce purchase wishes without fiddling with an app.
Product types: the app-less smartphone (e.g. Deutsche Telekom’s T Phone / T-Mobile REVVL), screen-less cellular wearable (AI Pin), square-formatted intelligent companion (Rabbit R1).
Product features: real-time graphical user interface (UI) generation and rendering (generative UI as per Brain.ai), concierge for an app-free smartphone, support for on-device genAI model execution at low power consumption for inferencing, smart user intent recognition and dispatching of requests and tasks to the most suitable genAI models in the cloud (Humane.com AI Pin), and AI OS.
Interaction modes: (exclusively) voice and gesture control, touch and text, circle to search (c.f. Samsung S24).
Let’s look a bit into progress on the technical front and see what it enables.
Hardware
Consumer devices like smartphones and other gadgets get a boost in hardware performance through hardware accelerators, in particular neural processing units (NPUs). We all remember multi-core CPUs, central processing units, perfectly fine to run the average app. While energy-hungry and high-performing GPUs (graphics processing units) have been complementing CPUs in particular in the cloud e.g. to enable training of large language models, multi-core NPUs are there to effectively and efficiently speed up execution of neural network models including genAI models on devices like smartphones, while keeping power consumption low. Techniques like int8 quantization (using 8-bit integers instead of floating-point numbers) and model compression are also used to get LLMs to run on phones and consumer gadgets of small form factor.
Some examples of what we might consider to include NPU technology are: Qualcomm Snapdragon 8 Gen 3, Apple A17 Pro, and MediaTek Dimensity 9300. More of such is to come without any doubt.
New hardware on devices and tricks like quantization (for which you do not necessarily need new hardware) help with application responsiveness and therefore user experience. Equally revolutionary is what runs on the hardware. Namely ...
Software
Notable in my view are software concepts as prototyped and being brought to market now by various startup companies.
The generative user interface
Currently, the most illustrative example is the user interface Natural AI, generated on the fly in real-time, created by Brain.ai, and demonstrated together with Deutsche Telekom in early 2024 for use in an app-free smartphone. The origins of Natural AI go back to 2020. It is a bit difficult to describe, and best understood through a demo. However, imagine a smartphone screen that is rather white and empty (like the bare-bones Google search web page). All you can do is express your wish or intent e.g. via voice control. You say “I want to book a flight from Munich to London, for 2 people, departure in the morning of April 29, return flight in the evening of May 2. My preference is the cheapest direct flight.”?As you speak and the spoken words appear on the still rather empty smartphone screen, the next suitable user interface component is generated and rendered on the screen to progress in line with your emerging sentence. You don’t see a traditional app.
The newness of the concept becomes clear if you have a new phone with none of the e-commerce apps installed on it. In above case, you would possibly search an app store for the Lufthansa app, or the British Airways app, or another one to find your optimal flight. After making your choice regarding app, you have to download it, install it, fire it up etc. You go after the apps. With Natural AI, ‘you no longer go to Apps, Apps come to you’, just as your thought process evolves.
There is of course the question, where to run this generative user interface process: on device (with the help of an accelerator), driving up the cost of the device, or in the cloud over a fast 5G connection, lowering the bill of material of the device.
Given above user interface concept, different interpretations are possible. Two which I spotted are:
The AI Operating System (AI OS)
The example I pick is Rabbit OS which is used for the Rabbit R1 device. The Rabbit R1 is a square-shaped slim device, half the size of a smartphone, with a screen, a rotating camera and a push/scroll button.
Rabbit uses a special type of foundation model as a universal controller for apps. They call their foundation model LAM which stands for Large Action Model. The goal is similar to the generative user interface created by Brain.ai: To simplify the interaction between user and various digital services (which the user individually doesn’t need to be aware of or know how to use).
In case of Rabbit’s approach, we end up with a single, unified interface towards end users which, as usual, has pros and cons. In the case of Natural AI from Brain.ai, you end up with a ‘relatively ever-evolving’ graphical user interface, created on the fly and evolving in line with your intent.
Among the benefits: With both Rabbit OS and Natural AI, you don’t need to worry about whether there are 10 or 1000 good apps out there. The AI does the job for you and interfaces with what is most suitable for you (that’s at least the idea).
A clear downside is that today these app-free user interfaces are miles away from role model user interfaces which we find in the market (a favourite of mine is the Spotify UI). The app-free UIs lack elegance and good design. If this can be improved over time, great, otherwise it could turn into an Achilles heel for these concepts. Time and user feedback will tell.
A different type of AI OS is Cosmos from Humane. I haven’t found much public information about Cosmos yet, however, to quote from the website:
“Cosmos introduces the Ai Bus, a groundbreaking Ai framework that transforms how cloud software interacts. As an ultra-fast conduit enabling Ai-driven experiences, the era of searching for, downloading, managing, or launching apps is over. [You see the relation to the app-free smartphone] Our Ai Bus quickly understands what you need, connecting you to the right Ai experience or service instantly, all at the speed of thought.”?
In my interpretation this points to an architecture of smart dispatching of user requests to genAI models in the cloud. An implementation in form of a new device type is Humane’s AI Pin. Humane calls the concept “contextual computing”. The idea is to capture user context (e.g. through a picture or video of the environment, of what you see or hold in your hand, captured by the device camera) and user commands, fire a request from the device out to a cloud and get a response from the most suitable, up to the task large language model, such that the response is of course tailored to the prompt and system prompt given to the LLM which will hold at least part of the context the user is in.
Above is about technologies, and they are progressing very fast indeed. Devices like the app-free smartphone (Deutsche Telekom, Brain.ai, Qualcomm), Rabbit R1, and the AI Pin present rather new and bold concepts. Only a few years ago, they would not have been feasible at all. For now, these product concepts appear still immature to me. Although I marvel at the technologies (and in the case of the AI Pin, at how much of hi-tech including gesture recognition and laser projector has been packed into a small form factor), fantastic technologies do not necessarily translate straightaway into loved user experiences, as early user feedback shows e.g. in the case of the AI Pin and other products.
We may hope, history is not repeating itself. To quote from Wikipedia: “The Newton was considered technologically innovative at its debut, but a combination of factors, including its high price and early problems with its handwriting recognition feature, limited its sales. This led to Apple ultimately discontinuing the platform at the direction of?Steve Jobs in 1998, a year after his return to the company.” But then, a repeat of history may not be that bad either, since after the Newton, great and outstanding devices appeared.
More important
What I actually find more important than sometimes overly bold claims by startup companies are two things:
First, what Jerry Yue explains in this video on the website of Brain.ai and similar in this interview.
Maybe something interesting and fundamentally new is indeed emerging from how foundation models and large language models can be leveraged for end user devices. Something that rings in some new era maybe, where the change is in fact noticeable to end users and makes life easier and more enjoyable?
Second, I also share how much I liked the research section of the rabbit website. Congratulations. This web page is worth reading.
领英推荐
The page explains, what the Large Action Model (LAM) of rabbit does, namely, understanding human intentions on computers (whatever the form factor).
In a nutshell: The rabbit OS captures your intent through a natural language interface (just talk to the device or write your intent as a phrase). However, instead of using your natural language prompt and feeding this to a large language model (like GPT), which would then use a plugin to interface to an API of an online web service, the LAM does not rely on any APIs. Instead, it operates the interface of the online service on your behalf, in form of learnt actions. To quote: “The LAM can learn to see and act in the world like humans do”. Here is my analogy: As children learn from parents and imitate them, a LAM learns from demonstrations and imitates as well. In the case of rabbit, the LAM completes tasks in a virtual environment in the cloud.
As Rabbit explains, their LAM model builds on advances not just in neural networks like transformer models used in OpenAI’s GPT-4, Google’s Gemini, Meta’s Llama and many others. Instead, their LAM leverages advances in neuro-symbolic programming (yes, still more to learn). Importantly, “the LAM allows for the direct modeling of the structure of various applications and user actions performed on them without a transitory representation, such as text.”
Rabbit’s LAM has been trained as they say for the most popular apps. The more they train it on further apps, the more powerful it will become.
Might all this lead to something very impactful?
When it comes to interfacing to operating systems (or their kernels), it’s worthwhile to note the evolution we have witnessed over time as Jerry Yue reminds us:
The approach taken by Rabbit is remarkable: They argue that “unavailability of application programming interfaces (API) for major service providers” is a real issue that can limit what natural language interfaces actually have access to. “To address this issue, we take advantage of neuro-symbolic programming to directly learn user interactions with applications, bypassing the need to translate natural language user requests into rigid APIs”, rigid like a natural language-interface to SQL to give an example.
Rabbit also explains an important difference between deep neural networks as used for foundation models from which we derive multi-modal LLMs (for text, images and videos) and a LAM: “The characteristics we desire from a LAM are also different from a foundation model that understands language or vision alone: while we may want an intelligent chatbot to be creative [and accept that it sometimes hallucinates and makes things up], LAM-learned actions on applications should be?highly regular, minimalistic?(per Occam`s razor),?stable, and?explainable”, which makes perfect sense. You don’t want a LAM, when expected to drive actions like navigating through a website, pressing buttons and entering text and numeric values, to become overly creative.
As mentioned, LAMs learn action from demonstration, “LAM‘s modeling approach is rooted in imitation, or?learning by demonstration:?it observes a human using the interface and aims to reliably replicate the process, even if the interface is presented differently or slightly changed.”
Whether super successful in the short-term or not, also regarding commercial products, this direction of research has significant potential in my opinion. As Rabbit explains, their LAM rides on the frontier of interdisciplinary scientific research?in language modelling (which benefits so greatly from deep neural networks and machine learning), and programming languages as well as formal methods which open the door to symbolic methods that enable heuristic search and logical methods of reasoning like induction and deduction. The LAM of Rabbit is a good example of progress in neuro-symbolic techniques.
As Gary Marcus wrote in 2020 in relation to robust AI (p44): We cannot construct rich cognitive models in an adequate, automated way without the triumvirate of hybrid architecture, rich prior knowledge, and sophisticated techniques for reasoning.
Seems, we are moving in this direction, though the road might be a bit rocky and bumpy.
Update - 5 May 2024
As the "AI Smartphone" is still a young concept with some early implementations in the market, we can expect that concept to evolve over time. There might be multiple directions in which this could go, and you might have some ideas, wishes and suggestions as a user of smartphones as well. In this context it is good to see what smartphone manufacturers envisage (like Apple, Samsung, and others).
Regarding a vision or outlook for the AI Smartphone: see e.g. the "AI Smartphone White Paper" from IDC and OPPO from Feb 2024, which has been published in form of a (ppt) presentation. It's very nicely done.
Some envisaged features look futuristic and, in my opinion, will not yet become a reality with excellent user experience in 2024. Other features appear to me technically feasible, but I would not use a smartphone to use such feature (e.g. video editing: Why would video editing on a smartphone (outdoors during sunshine, or indoors when I sit opposite a friend in a coffee place?) make my use of time more efficient than doing this on a modern laptop or a PC with a 4K monitor? Also, in case you wonder, I would never do video editing while having the draft video projected on my palm from an AI OS-powered gadget. And when it comes to screen size and usability, I don't want an AI Smartphone that is so big that I cannot place it anymore into my trouser pocket.
So, I fully agree with the statement on p4 "Phone use needs to be more efficient", but for me that does not imply that I'm going to spend significantly more time engaged with my phone, largely gesturing and speaking to it. I'll never create spreadsheets via my phone nor will I try software development on it, even if some code gets generated automatically for me on the phone.
Though the white paper is inspiring in its vision, I struggle with some messages: e.g. on p17 it sort of claims that, in comparison with today's (call them legacy) smartphones, the AI Smartphone delivers as a unique benefit a user experience that is "free of hallucinations". Really? My legacy smartphone does not hallucinate in the slightest. If I now purchase an AI Smartphone, I will be freed from hallucinations of LLMs? Hmm, reality is different, at least in 2024.
Interestingly, while AI Agents and LLMs are present in this vision paper, one piece is absent : Small Language Models (SLM). While the word "large" shows up 15 times, my search didn't find the word "small" in the white paper. Why? In the presence of Apple's work, of the Stable Diffusion SLM and of Microsoft's Phi-3 small language model announced on Apr 23, 2024 ... Let's see.?
References
[4] Rabbit R1 vs Humane AI Pin - Which Should You Buy? (Design, Features, Specs, Price And Availability) (youtube.com)
[15] [2002.06177] The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence (arxiv.org)
?
?
Stealth | Arkam | Amex | BITS
5 个月One of the most comprehensive and easy to understand articles on the topic.
If you're old enough to remember it: the Microsoft paperclip was ridiculed for when it misjudged the user's intentions, much more than it was appreciated for when it judged them correctly. AI phones will also get it wrong sometimes; the user interfaces need to be carefully designed so that when they do get it wrong, the user experience is graceful rather than irritating.
Senior Manager R&D, Research Clusters AI and Quantum at Vodafone
5 个月Have added an update to the article today related to OPPO (see bottom of the page).
Lead Security Architect IT/OT, at National Grid
5 个月wow this is really interesting. i remember the first iterations of vf.co.uk driven by charlie debney, then the rapid steps taken by vizzavi. almost a generation ago. thanks gunter!
??? Engineer & Manufacturer ?? | Internet Bonding routers to Video Servers | Network equipment production | ISP Independent IP address provider | Customized Packet level Encryption & Security ?? | On-premises Cloud ?
5 个月The advent of AI smartphones, coupled with AI operating systems (AIOS) and generative user interfaces, heralds a new era in mobile technology. These advancements promise to redefine the way we interact with our devices, offering personalized experiences driven by intelligent algorithms. With AI seamlessly integrated into every facet of the user interface, one wonders: How will this transformative shift impact our relationship with technology, and what ethical considerations must we navigate as AI becomes increasingly pervasive in our daily lives?