LLM Inference War Begins

LLM Inference War Begins

The battle for inference performance and training is intensifying by the day. SambaNova, Cerebras, and Groq are pushing the limits of token speed with record-breaking performance for Meta’s Llama.?

Meanwhile, OpenAI’s o1 is looking to slow down inference to enhance reasoning capabilities, aka “thinking”, by allocating more compute resources to inference, allowing models to interact with external tools for deeper analysis, rather than relying solely on pre-training with large datasets.?

This development comes in the backdrop of Oracle launching the world’s first zetascale cloud computing cluster powered by NVIDIA’s Blackwell GPUs. It now offers up to 131,072 NVIDIA B200 GPUs, which is six times more than other cloud hyperscalers like AWS, Azure, and Google Cloud. It delivers 2.4 zettaFLOPs while competitors have only reached the exascale levels.?

The need for speed in AI is only getting bigger and better. (see below)?

The infographic shows emerging trends in the AI hardware sector. At the forefront, companies like AMD, NVIDIA, and Intel (shown in vibrant RGB colours) continue to drive advancements in GPUs and PCs. They are supported by foundational design and foundry firms such as TSMC, SMIC, UMC, Arm, and SiFive (illustrated in white), whose expertise in chip manufacturing underpins the latest AI innovations.

Companies like Nokia, Cisco, and Fujitsu (represented in aqua-blue) play an essential role in data management and transmission, ensuring seamless integration across this complex ecosystem.?

Meanwhile, an exciting development is taking place in the training and inference arena (highlighted in yellow) with firms like Oracle, Azure, GCP, and AWS, together with SambaNova, Cerebras, and Groq, spearheading support for the next-generation of AI systems, pushing technological boundaries like never before.?

The Battle for LLM Inference is Heating up?

SambaNova recently launched its cloud inference platform, giving developers access to Llama 3.1 models, including the 8B, 70B, and 405B versions, on their custom AI chips, SN40L. The platform has set a new record for inference on Meta’s Llama 3.1 405B, serving the model in native 16-bit precision and achieving 132 output tokens per second.?

Notably, among the three—Groq, Cerebras, and SambaNova—it is the only platform offering Llama 3.1 405B. “The ecosystem around Llama is continuing to push the limits. SambaNova Cloud is setting a new bar for inference on 405B and it’s available for developers to start building today,” posted AI at Meta on X.        

“Fast inference is no longer a nice-to-have demo, it will be the driving force behind future frontier models. Time to switch over to custom AI hardware and short NVIDIA,” said Zoltan Csaki, machine learning engineer at SambaNova.?

The API inference offering is powered by SambaNova’s SN40L custom AI chip, which features their Reconfigurable Dataflow Unit architecture. Manufactured on TSMC’s 5 nm process, the SN40L combines DRAM, HBM3, and SRAM on each chip.

The RDU’s architecture is built around streaming dataflow, which allows multiple operations to be combined into one process, removing the need for manual programming. This delivers faster performance by using a blend of different parallelism techniques, such as pipeline, data, and tensor parallelism, all supported by the hardware.

Cerebras Inference recently announced that it delivers 1,800 tokens per second for the Llama 3.1 8B model and 450 tokens per second for the Llama 3.1 70B model, making it 20 times faster than NVIDIA GPU-based hyperscale clouds.?

According to Artificial Analysis Llama 3.1-8B models running on NVIDIA H100 systems across hyperscalers and specialised cloud providers delivered speeds ranging from 72 to 257 tokens per second, with AWS reporting 93 tokens per second for the same workload.

Cerebras Inference is powered by the Cerebras CS-3 system and its advanced AI processor, the Wafer Scale Engine 3 (WSE-3). Unlike traditional GPUs, which require trade-offs between speed and capacity, the CS-3 offers top-tier performance for individual users while maintaining high throughput.?

The WSE-3’s massive size allows it to support many users simultaneously, delivering impressive speed. With 7,000 times more memory bandwidth than NVIDIA’s H100, the WSE-3 addresses the core technical challenge of generative AI memory bandwidth.

Cerebras addresses the inherent memory bandwidth limitations of GPUs, which require models to be moved from memory to compute cores for every output token. This process results in slow inference speeds, particularly for large language models like Llama 3.1-70B, which has 70 billion parameters and requires 140GB of memory.

Cerebras Inference supports models from billions to trillions of parameters. For models exceeding the memory capacity of a single wafer, Cerebras splits them at layer boundaries and maps them to multiple CS-3 systems. Larger models, such as Llama3-405B and Mistral Large, are expected to be supported in the coming weeks.

Nothing like Groq?

It recently achieved a speed of 544 tokens per second on the Llama 3.1 70B model and 752 tokens per second on the Llama 3.1 8B model, according to Artificial Analysis.?

Founded in 2016 by Jonathan Ross, Groq distinguishes itself by eschewing GPUs in favour of its proprietary hardware, the LPU. The company recently raised $640 million in a Series D funding round, bringing its valuation to $2.8 billion. Most recently, it announced a partnership with Aramco Digital to establish the world’s largest inferencing data centre in Saudi Arabia.

Groq’s LPU challenges traditional GPU makers like NVIDIA, AMD, and Intel, with its tensor streaming processor built solely for faster deep learning computations. The LPU is designed to overcome the two LLM bottlenecks: compute density and memory bandwidth.?        

In terms of LLMs, an LPU has greater compute capacity than a GPU and CPU. This reduces the amount of time per word calculated, allowing text sequences to be generated much faster. Additionally, eliminating external memory bottlenecks enables the LPU inference engine to deliver orders of magnitude better performance on LLMs compared to GPUs.?

The LPU is designed to prioritise the sequential processing of data, which is inherent in language tasks. This contrasts with GPUs, which are optimised for parallel processing tasks such as graphics rendering.?

Enjoy the full story here.?


Top Stories of the Week >>?

Oracle’s Multi-Cloud Era Begins

Oracle has officially entered the multi-cloud era. At Oracle CloudWorld 2024 in Las Vegas, the company revealed its latest partnership with AWS, completing its multi-cloud strategy. This follows its earlier collaborations with Microsoft Azure and Google Cloud.

“We’re entering a new phase where services on different clouds work gracefully together. The clouds are becoming open; they’re no longer walled gardens. Customers will have choices and can use multiple clouds,” said Oracle CTO Larry Ellison in his keynote speech at the event, which AIM was a part of.

Read the full story here.?

Can OpenAI o1 Save GitHub Copilot from Cursor?

Tech X has been on fire since OpenAI dropped o1, or what some call Strawberry or Q*. The new model from OpenAI is solving complex problems with its improved reasoning capabilities in areas such as science, coding, and maths. However, the real treat for developers was o1-mini, a smaller model designed specifically for advanced coding.?

Meanwhile, Cursor and Claude continue to dominate the developers’ minds and have been successful in making people shift away from GitHub Copilot for certain use cases. According to sources, Microsoft has plans to upgrade its capabilities on the VS Code IDE, which would help it compete with Cursor, but what about GitHub Copilot? Read to find out.?


People & Tech >>

How Lightstorm is Transforming India’s Network Infrastructure

For years, India’s terrestrial networks were a challenge, with frequent outages causing global enterprises to avoid routing through the country. Lightstorm, a leading cloud and data centre connectivity solution provider, set out to change this narrative by tackling the problem from the ground up. Read on.


AIM Research >>?

Here’s Your Ultimate AI Vendor Database for Strategic Business Decisions

AIM Research has launched VendorAI, a comprehensive database profiling over 200 AI vendors, offering businesses valuable insights for strategic decision-making across mergers and acquisitions, competitive intelligence, and vendor selection. Access it here.


AI Events >>?

Breaking Down Data Silos?

Join the DECODE Webinar to explore how cloud technology can break down data silos and enable seamless data integration, featuring insights from industry leaders at DBS Bank and Google. Register here.


AI Bytes >>?

  • AI godmother Fei-Fei Li, alongside Justin Johnson, Christoph Lassner, and Ben Mildenhallhas, has co-founded World Labs, securing $230 million to develop AI capable of understanding and interacting with the 3D physical world.?
  • Google DeepMind has unveiled ALOHA Unleashed and DemoStart, groundbreaking AI systems that dramatically enhance robot dexterity, enabling tasks like tying shoelaces and inserting gears with human-like precision, while using advanced reinforcement learning to cut training time by 100x.?
  • Y Combinator, led by Garry Tan, is expanding to four batches a year starting in 2025, marking its biggest shift since 2005 to accommodate the growing influx of AI startups.
  • Google has introduced DataGemma, a new open model that integrates LLMs with real-world data from its Data Commons repository, using retrieval-augmented methods like RIG and RAG to reduce AI hallucinations and improve the accuracy of generative AI outputs in research and decision-making contexts.
  • AWS has selected seven Indian startups—Converse, House of Models, Neural Garage, Orbo.ai, Phot.ai, Unscript AI, and Zocket—for its Global Generative AI Accelerator program, offering up to $1 million in credits, mentorship, and technical support to scale their AI innovations.?
  • Progress has acquired ShareFile, a cloud-based collaboration platform, for $875 million, adding over $240 million in annual revenue and 86,000 customers to its portfolio.?
  • Klarna CEO Sebastian Siemiatkowski recently announced the company’s decision to end relationships with Salesforce and Workday, streamlining its tech stack through AI initiatives to enhance efficiency and reduce the workforce by 50%.?
  • Hume AI introduced EVI 2, a voice-to-voice AI model designed for human-like, emotionally intelligent conversations with multilingual support, adaptive personalities, and a focus on preserving voice identity without cloning.?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了