Royal 888 ph withdrawal,Pharaoh treasure jili apk.REGISTER NOW GET FREE 888 PESOS REWARDS!

The battle for inference performance and training is intensifying by the day. SambaNova, Cerebras, and Groq are pushing the limits of token speed with record-breaking performance for Meta’s Llama.?

Meanwhile, OpenAI’s o1 is looking to slow down inference to enhance reasoning capabilities, aka “thinking”, by allocating more compute resources to inference, allowing models to interact with external tools for deeper analysis, rather than relying solely on pre-training with large datasets.?

This development comes in the backdrop of Oracle launching the world’s first zetascale cloud computing cluster powered by NVIDIA’s Blackwell GPUs. It now offers up to 131,072 NVIDIA B200 GPUs, which is six times more than other cloud hyperscalers like AWS, Azure, and Google Cloud. It delivers 2.4 zettaFLOPs while competitors have only reached the exascale levels.?

The need for speed in AI is only getting bigger and better. (see below)?

The infographic shows emerging trends in the AI hardware sector. At the forefront, companies like AMD, NVIDIA, and Intel (shown in vibrant RGB colours) continue to drive advancements in GPUs and PCs. They are supported by foundational design and foundry firms such as TSMC, SMIC, UMC, Arm, and SiFive (illustrated in white), whose expertise in chip manufacturing underpins the latest AI innovations.

Companies like Nokia, Cisco, and Fujitsu (represented in aqua-blue) play an essential role in data management and transmission, ensuring seamless integration across this complex ecosystem.?

Meanwhile, an exciting development is taking place in the training and inference arena (highlighted in yellow) with firms like Oracle, Azure, GCP, and AWS, together with SambaNova, Cerebras, and Groq, spearheading support for the next-generation of AI systems, pushing technological boundaries like never before.?

The Battle for LLM Inference is Heating up?

SambaNova recently launched its cloud inference platform, giving developers access to Llama 3.1 models, including the 8B, 70B, and 405B versions, on their custom AI chips, SN40L. The platform has set a new record for inference on Meta’s Llama 3.1 405B, serving the model in native 16-bit precision and achieving 132 output tokens per second.?

Notably, among the three—Groq, Cerebras, and SambaNova—it is the only platform offering Llama 3.1 405B. “The ecosystem around Llama is continuing to push the limits. SambaNova Cloud is setting a new bar for inference on 405B and it’s available for developers to start building today,” posted AI at Meta on X.

“Fast inference is no longer a nice-to-have demo, it will be the driving force behind future frontier models. Time to switch over to custom AI hardware and short NVIDIA,” said Zoltan Csaki, machine learning engineer at SambaNova.?

The API inference offering is powered by SambaNova’s SN40L custom AI chip, which features their Reconfigurable Dataflow Unit architecture. Manufactured on TSMC’s 5 nm process, the SN40L combines DRAM, HBM3, and SRAM on each chip.

The RDU’s architecture is built around streaming dataflow, which allows multiple operations to be combined into one process, removing the need for manual programming. This delivers faster performance by using a blend of different parallelism techniques, such as pipeline, data, and tensor parallelism, all supported by the hardware.

Cerebras Inference recently announced that it delivers 1,800 tokens per second for the Llama 3.1 8B model and 450 tokens per second for the Llama 3.1 70B model, making it 20 times faster than NVIDIA GPU-based hyperscale clouds.?

According to Artificial Analysis Llama 3.1-8B models running on NVIDIA H100 systems across hyperscalers and specialised cloud providers delivered speeds ranging from 72 to 257 tokens per second, with AWS reporting 93 tokens per second for the same workload.

Cerebras Inference is powered by the Cerebras CS-3 system and its advanced AI processor, the Wafer Scale Engine 3 (WSE-3). Unlike traditional GPUs, which require trade-offs between speed and capacity, the CS-3 offers top-tier performance for individual users while maintaining high throughput.?

The WSE-3’s massive size allows it to support many users simultaneously, delivering impressive speed. With 7,000 times more memory bandwidth than NVIDIA’s H100, the WSE-3 addresses the core technical challenge of generative AI memory bandwidth.

Cerebras addresses the inherent memory bandwidth limitations of GPUs, which require models to be moved from memory to compute cores for every output token. This process results in slow inference speeds, particularly for large language models like Llama 3.1-70B, which has 70 billion parameters and requires 140GB of memory.

Cerebras Inference supports models from billions to trillions of parameters. For models exceeding the memory capacity of a single wafer, Cerebras splits them at layer boundaries and maps them to multiple CS-3 systems. Larger models, such as Llama3-405B and Mistral Large, are expected to be supported in the coming weeks.

Nothing like Groq?

It recently achieved a speed of 544 tokens per second on the Llama 3.1 70B model and 752 tokens per second on the Llama 3.1 8B model, according to Artificial Analysis.?

Founded in 2016 by Jonathan Ross, Groq distinguishes itself by eschewing GPUs in favour of its proprietary hardware, the LPU. The company recently raised $640 million in a Series D funding round, bringing its valuation to $2.8 billion. Most recently, it announced a partnership with Aramco Digital to establish the world’s largest inferencing data centre in Saudi Arabia.

Groq’s LPU challenges traditional GPU makers like NVIDIA, AMD, and Intel, with its tensor streaming processor built solely for faster deep learning computations. The LPU is designed to overcome the two LLM bottlenecks: compute density and memory bandwidth.?

In terms of LLMs, an LPU has greater compute capacity than a GPU and CPU. This reduces the amount of time per word calculated, allowing text sequences to be generated much faster. Additionally, eliminating external memory bottlenecks enables the LPU inference engine to deliver orders of magnitude better performance on LLMs compared to GPUs.?

The LPU is designed to prioritise the sequential processing of data, which is inherent in language tasks. This contrasts with GPUs, which are optimised for parallel processing tasks such as graphics rendering.?

Enjoy the full story here.?

LLM Inference War Begins

Bhasker Gupta

Founder & CEO at AIM | Meet Me At Cypher

领英推荐

The Belamy

16,186 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Supercharging Your AI/ML with NVIDIA A100 GPU on E2E Cloud

Supercharging Your AI/ML with NVIDIA A100 GPU on E2E Cloud

The Great GPU Shortage and the GPU Rich/Poor

How do we leverage Cloud GPUs to boost the performance of AI/ML workloads?

AI Hardware and Infrastructure: Driving the Future of AI with Cutting-Edge Developments

Microsoft challenges Nvidia, Intel with new custom AI chips

How Bad is the AI Compute Shortage, Really?

"Unleashing AI Power: NVIDIA DGX vs. Cloud Giants – The Ultimate Showdown for Enterprise AI Dominance"

The Case for LPUs: Why the Magnificent 6 Must Rethink Their LLM Strategy

This AI newsletter is all you need #92

领英推荐

The Belamy

16,186 位关注者

Optimus Does What!?

2024年10月14日

Cheap Code ≠ Cheaper Coders

2024年10月7日

Nothing like Cypher ??

2024年9月30日

The Era of ‘Enterprise’ AI Agents Begins

2024年9月23日

The Billion-Dollar AI Club

2024年9月9日

Are Developers Becoming Obsolete?

2024年9月2日

YC24 is Flooded with Indian AI Startups ??

2024年8月26日

The Week of Sovereign AI

2024年8月19日

The Rise of Physical AI Agents

2024年8月12日

A Crazy Week for Indian GenAI Startups

2024年8月5日

社区洞察

其他会员也浏览了

Supercharging Your AI/ML with NVIDIA A100 GPU on E2E Cloud

Supercharging Your AI/ML with NVIDIA A100 GPU on E2E Cloud

The Great GPU Shortage and the GPU Rich/Poor

How do we leverage Cloud GPUs to boost the performance of AI/ML workloads?

AI Hardware and Infrastructure: Driving the Future of AI with Cutting-Edge Developments

Microsoft challenges Nvidia, Intel with new custom AI chips

How Bad is the AI Compute Shortage, Really?

"Unleashing AI Power: NVIDIA DGX vs. Cloud Giants – The Ultimate Showdown for Enterprise AI Dominance"

The Case for LPUs: Why the Magnificent 6 Must Rethink Their LLM Strategy

This AI newsletter is all you need #92