Apple WWDC - The impact on AI in the Cloud - Part 1.

Apple WWDC - The impact on AI in the Cloud - Part 1.

The Future of AI Inference in the Cloud Part 1

At WWDC, Apple outlined its vision for AI. If you haven’t seen these announcements yet, you can read about them here. This blog explores the implications for cloud computing based on Apple’s AI developments.

Following announcements from Microsoft and Google, Apple revealed plans to enable AI models to run locally on devices. Apple has trained a 3B parameter foundational model for on-device use, fine-tuned specifically for summarization tasks. Normally, an FP16 model with 3B parameters would require up to 6GB of memory. However, Apple’s approach to fine-tuning the adapters, at 16 bits, plus some other techniques outline here reduces memory usage to only 3GB. This still uses a significant portion of most iPhones' memory, which is why only high-powered devices can support Apple Intelligence.

There is less information on the model size in the cloud. However, according to Apple’s announcements, they will run on Apple chips. Given the expected release schedule for Apple Intelligence, there are no reports of new Apple chips coming to market in time, so the assumption is this will be based on Apple M2 Ultras. They feature a 1024GB memory bus bandwidth, 800GB/s memory bandwidth, and 192GB memory capacity. This is a stark contrast to Nvidia H200s (6144GB, 4800GB/s, 141GB) which will be available in most cloud providers in the same timeframe. This is most likely why Apple partnered with Open AI (more providers to come) to deliver the more enriched experiences.

Advancements in model design have reduced the size of usable models. Additionally, a new architecture with a ‘router’ on the device decides whether prompts should stay local or be sent to the cloud, transforming the future of inference infrastructure. When ChatGPT was first released approximately 18 months ago, the only logical place for inference was the cloud, with ample memory for any model size, and primary use cases not impacted by latency. Now, model architecture has shrunk while use cases have become more complex and latency-sensitive, such as summarizing information between applications. This shift doesn’t just change where inference can occur; it also changes how platforms are built, potentially impacting cloud economics significantly.

I believe the ship has sailed on delivering small models out of the cloud. The economics don’t make sense when you can run them on-device for “free.” But what happens when you want to deliver experiences beyond simple summarization or image generation? You can leverage the cloud, in this case, Apple’s cloud. While it can handle larger tasks, the latency to this cloud can impact the AI experiences you want to deliver. For instance, Apple has data centers across the US. From my home in LA, the nearest data center is in Northern California (NC). The latency from my mobile device to an AWS region in NC is currently 103ms. Studies indicate that latencies over 100ms affect human perception, while over 300ms can cause multi-context switching. In this case, minimizing compute latency for inference is crucial to maintaining user experience quality.

This is where I believe the future of the cloud lies (at least for now) in the AI inference space. Hyperscaler clouds tend to have more globally deployed infrastructure than the average device manufacturer. In theory, this generally lowers latency. However, in my specific case, my nearest region is located next to Apple’s, and I am still over 100ms in Round Trip Time (RTT). This is where the cloud edge can play a major part in inference going forward. Each cloud provider has their own offering here. At AWS, we have Local Zones, which are infrastructure located in major metro areas, and Wavelength, which places infrastructure in telecommunications providers’ data centers. These edge locations enable AI experiences that are richer than on-device models but require less latency than centralized regions. This hybrid approach ensures that as models shrink, we can still deliver robust user experiences on devices and wearables.?

Some of these edge locations can already support the instances needed for inference, so there’s no reason this work cannot start immediately. I believe the tipping point will come when as-a-service solutions, such as AWS Bedrock or SageMaker, allow you to train models and then one-click deploy them to the edge globally or regionally, similar to how we do with web services and Content Delivery Networks (CDN). This is the type of value that the cloud needs to start to offer if it is to hold on to inference revenue, epecially for consumer based applications which demand user experience and will be moving to even smaller form factors such as Meta’s Ray Bans (which as a Dad, I am a huge fan of by the way).

Over time, today’s more powerful models are expected to shrink to fit on devices. Meanwhile, user expectations for enriched AI experiences will likely require even larger models.?This is why I think the hybrid approach could be around for a long time to come.

Part two of this blog will explore how the new model for on-device inference, supported by cloud infrastructure, could create new opportunities in the multi-cloud space.



要查看或添加评论,请登录

David Murray的更多文章

社区洞察

其他会员也浏览了