Some reflections from Hot Chips 2024 (part I)

Some reflections from Hot Chips 2024 (part I)

I attended the Hot Chips conference from Sunday to Tuesday at Stanford and was completely impressed by the gathering many of the world's processor designers, "freely" sharing their technical accomplishment (quotation here because of the different degree of sharing). Hot Chips was established 35 years ago in 1989, as a conference focused on high-performance microprocessors and related integrated circuits. It brings together chip designers, computer architects, system engineers and other industry professionals to exchange ideas and showcase new developments. To me, it is truly a unique forum and glad two of Celesta Capital portfolio companies, Eliyan Corporation and SambaNova Systems (both with strong ties to my alma mater Stanford) are sponsors of the conference.

Some reflections:

Nvidia AI for Chip Design Research

Assisted Hardware Design: ( Bryan Chin , Stelios Diamantidis , Mark H. R. , Hanxian Huang )The title was intended to be provocative, "Will AI Elevate or Replace Hardware Engineers?". As is usually the case, the answer is most likely YES to both. It is safe to say that most chip companies will adopt AI in their design flow (e.g. Nvidia above), or run the risks of not being competitive in terms of productivity especially given the massive complexity and costs of building a state-of-the-art SoC or in the case of many AI/ML processor, a complex system. I would not be too quick to replace design steps that are well handled by existing EDA tools, but focus on obvious areas where innovation is necessary. I am excited for all the startups pursuing this area, including our portfolio company, Normal Computing , tackling some of the most difficult problems.


Direct Liquid Cooling (Supermicro)

Cooling of Hot Chips: The power consumption for AI processors are mind boggling and now a limiting factor to overall system performance. For example, Nvidia H100 consumes up to 700W, the Blackwell up to 1200W, racks can vary between 30kW to 120kw and AI clusters with 22,000 H100 GPUs can demand about 31 MW of power. The adoption of Liquid-cooling with higher thermal conductiivty provides opportunity to reduce cooling power, electricity costs and also the noise. Direct Liquid Cooling is becoming a key consideration for racks over 20kW. Immersion Liquid Cooling take it even further, where the entire server is immersed in a dielectric coolant cooling all the components. Another implication is reliability of large AI systems relative to cooling and how that factors into training of larger and larger models. This was covered by OpenAI's session later as well. I am actively investigating startups with novel solutions to power management and cooling.

High-Performance Processor

Qualcomm Oryon CPUs

Gerard Williams III presented the Snapdragon X Elite 高通 Oryon CPU, which is largely based on NUVIA Inc acquisition, another Celesta Capital -lead investment. The CPU consists of three 4 CPU clusters, each with its own L2 Cache and Bus Interface Unit. Without going into details, the Oryon represents a well-balanced CPU with careful attention to branch prediction, prefetching algorithms and overall throughput, with features such as 600+ re-order buffer designed to retire 8 instructions per cycle. The vector execution supports a variety for data types, important for AI/ML workloads. Overall, the Oryon is an impressive CPU that 高通 will apply to a broad range pf applications by integrating into variety of SoCs.


Intel Lunar Lake

Arik Gihon from 英特尔 presented the Lunar Lake, Intel's flagship SoC for next gen of AI PCs. First thing that I took notice is the emphasis around leveraging advanced packaging with memory stack, compute tile, controller tile and based tile. The base tile sits on the fiberglass substrate with fine high density wiring to the 2 LPDDR5X memory chips. The benefit is reduced memory physical layer power by 40%. Power delivery is done with 4 PMICs providing more power rails, dynamic voltage and telemetry, claimed to have ML-based workload classification and frequency control. The use of glass substrate with its advantages is an important trend to keep an eye on.

Specialized Processors:


Tenstorrent Blackhole


Tenstorrent Software Ecosystem

Jasmina Vasiljevic and Davor Capalija from Tenstorrent presented Blackhole and TT-Metalium - The Standalone AI Computer and Programming Model. The Blackhole includes 140 Tensix++ Cores@6nm achieving 745 FP8 TOPS, 512 GB/s of GDDR6, 10x400Gbps Ethernet and 16 RISC-V CPU cores. The architecture incorporates 752 Baby RISC-V CPU. The baby RISC-V cores are matched with compute, data movement and storage blocks as "user kernel microcontrollers". Each of the Tensix Cores include 5 baby RISC-Vs with 32bit ISA. Two independent 2-dimensional torus NoCs is then used to connect the Tensix++ Cores. The Ethernet links are used to scale out multiple Blackholes. With 32 chips in 4x8 mesh, the Blackholes can serve as AI, Memory or Switching appliances. Tenstorrent is leveraging open source software with an MLIR-based compiler, 3rd party training compiler and vLLM. I can definitely see Jim Keller fingerprints throughout with a modular, well-designed architecture and framework. The simplicity, scalability and modularity enable the software to efficiently map to the underlying hardware.


SK Hynix AiMX

Guhyun Kim from SK hynix presented In-memory computing research targeting the memory bound problem of LLM processing. Based on GDDR6, AiMX integrated processing directly adjacent to the memory banks with internal bandwidth of 512 GB/s per die. AiM also exploits data placement optimization, to further balance the compute/memory locality. Interesting to me is how AiMX is proposing a hybrid model where the system consists of H100 GPU and the AiMX ASIC. SK hynix is now applying the same technique to LPDDR targeting mobile applications, where there is the highest potential for adoption, in my opinion, given the obvious advantages for energy efficiency.


Intel Xeon6

Praveen Mosur from 英特尔 presented Intel Xeon6 SoC based on the Intel 4 Process Node. Personally, the interesting parts of the architecture are the bifurcated design strategy offering both the Performance Core (P-core) and Efficiency core (E-core) variants. This approach allows Intel to address a wider range of workloads and customer needs. The Chiplet based design allows for flexible configurations of compute and I/O employing Intel's next gen packaging technology, including EMIB (embedded multi-die interconnect bridge) for high-performance die-to-die connections. (my 2 cents)I for one believe the imminent demise of Intel is premature and this foundational company makes a strong comeback, reverting to it technology excellence roots.


Predictable Scaling


Training Compute Scaling

Trevor Cai from OpenAI delivered the keynote on the topic of Predictable Scaling and Infrastructure. ChatGPT collects a large dataset of text, code, images, audio, math and uses the data to pre-train a model to predict the next word. The post-training then enables the model to follow instructions and conversational. Synchronous Stochastic Gradient Descent (SGD) is used to distribute the training workload across multiple accelerators, exploiting many forms of data and model parallelism. Reinforcement learning with human in the loop then further refines the model. Trevor described the evolution from GPT-1 (June 2018) to GPT-4 (March 2023) when GPT became actually "useful". He also strongly advocated the idea that "Scale Works", with understanding of multiple languages, diagrams and understanding physics. He showed results from an OpenAI technical report where the power law fit of model size versus loss. This is substantiated by the 4-5x per year growth of Training compute (FLOP) from 10^14 in 2010 to 10^26 in 2024. He further promotes the idea that inference demand is driven by the intelligence of the model. As the training systems continue to grow in scale, the importance of Cluster-level RAS is highlighted as cost of failure scales up as well. Power management also becomes paramount a power is now a limiting factor. The case against the ever-growing LLMs including the performance limitations (could be made up by improvement in training algorithms and hardware), economic feasibility and data availability and quality. Obviously, the continuing growth and optimization of LLM models will push the hardware leading edge and great for chip and system designers.

To be continued!


Tom T.

Power Management Solutions for AI: Cloud and Enterprise Infrastructure and Edge AI

6 个月

Thanks for sharing, nice summary!

回复
Joseph Raffa

Venture Capital | Advanced Computing Hardware and Applications | Industrial and Scientific AI/ML Platforms | Semiconductors

6 个月

Great summary; thanks for sharing!

回复
Jon Castor

Entrepreneur, Corporate Director, Volunteer

6 个月

Excellent review. Thank you for sharing!

回复
Jack G.

Executive Chairman/ CEO at Numem

6 个月

Steve Excellent Analysis and Summary. Thank you for Sharing

Zhiping Yang

Signal Integrity | Power Integrity | High-speed Circuit | Electromagnetic Design & Simulation| EMC| RF Desense| IEEE fellow

6 个月

Thanks for sharing!

回复

要查看或添加评论,请登录

Steve Fu的更多文章

  • DeepSeek Progression

    DeepSeek Progression

    The massive stock drop certainly stimulated me to learn more about DeepSeek so I figured I would share it with you…

  • DeepSeek: Revolutionizing AI with Cost-Effective Innovation

    DeepSeek: Revolutionizing AI with Cost-Effective Innovation

    If you are like me, you woke up wondering what happened to 25% of the value in your stock portfolio. In a…

    15 条评论
  • GPUs versus TPUs, which is better?

    GPUs versus TPUs, which is better?

    Lately, I have been thinking about how LLM processing might evolve, the associated implications and the advent of LLM…

    1 条评论

社区洞察

其他会员也浏览了