Photonics Upgrading AI Computing to the Speed of Light

Photonics Upgrading AI Computing to the Speed of Light

We’ve talked before about how high-speed optical interconnects enable a more decentralized system of data centers in different geographical areas, which are connected to cope with the strain of data center clusters on power grids. But even within a data center, a rack, and a server, the adoption of going all optical interconnects to connect AI nodes is happening. Why?

A Whole AI Cluster Like One GPU

GPT-4 has more than a trillion parameters (1.76 trillion parameters, according to a report from THE DECODER). For reference, GPT-3 had 175 billion parameters in 2020 and GPT-2 had 1.54 billion in 2019. Inferencing with GPT-4 requires tens of GPUs fine-tuning hundreds of GPUs and training thousands of GPUs. Imagine how, as these models continue to grow rapidly in size and complexity, these numbers are likely to increase an order of magnitude. Since a typical chassis holds at most eight GPUs, and a rack holds 16-32 GPUs, this means that even inferencing is a rackscale operation.

Generative AI systems need global communication across many GPUs, and therefore demand low latency and high bandwidth. GPUs need to be able to talk to each other beyond a single chassis or rack in a data center. Properly performing training, fine-tuning, and inference demands a lot of resources — one rack for inference, tens of racks for fine-tuning, and hundreds of racks for training. All of these GPUs need to interconnect to complete tasks. In a sense, the entire distributed compute operation needs to look like a single virtual GPU (as described in Ayar Labs’ whitpaper).

The majority of organizations are currently running generative AI tasks on systems that rely on traditional interconnects – electrical I/O combined with pluggable optics that convert? electrical signals into optical signals and vice versa. However, these systems cannot provide the efficiency that generative AI work requires due to signal degradation, large energy demand, and physical footprint.?

According to Semi Vision reporting on Nov. 29, 2024: “NVIDIA's GB200 introduces backplane connectors to link 72 Blackwell GPUs with 5,000 NVLink copper cables, totaling over two miles in length. Amphenol, NVIDIA's primary copper cable supplier, now produces components similar to splitters called "cartridges" instead of individual copper cables. Each cartridge contains thousands of wires, achieving a data rate of 112G per wire under GH200 specifications. However, GB200 specifications are expected to upgrade to 224G per wire, resulting in poor yields and failed testing.” This has led to additional delays in Nvidia’s GB200 mass production, with Microsoft reducing its GB200 orders to Foxconn from 20,000 units to 12,000 units and shifting some orders to the GB300, which is scheduled for delivery in June 2025.

Credit: Semi Vision (Product images and names belong to Nvidia and Amphenol)


Go For All Optics?

Effect Photonics wrote: “As explained by Andrew Alduino and Rob Stone of Meta in a talk at the 2022 OCP Global Summit, interconnecting AI nodes via optics will be vital to decreasing the power per bit transmitted inside data center racks. This means changing the traditional architecture inside these racks. Instead of using an electro-optical switch that converts the electrical interconnects between AI nodes to optical interconnects with another, the switching inside the data center rack would be entirely optical… these new optical interconnections might need co-packaged coherent optics.”

Credit: Effect Photonics

NLM Photonics has a demonstration of a further evolution for all optics interconnects inside servers, for your reference.

Credit: NLM Photonics

Converting electrical data to an optical format and back during data transmission is energy-intensive. Communicating with high bandwidth-density at the lowest energy cost per bit is the key metric to benchmark here. Current pluggables-based GPU-to-GPU links consume a total of 30 picoJoules per bit (pJ/b). According to Ayar Labs, their photonics, in-package optical I/O solution directly connects two packages using less than 5pJ/b. Pluggables are bulky modules, and their cost is too high to scale.?

Next-Generation Optics Emerging

A replacement strategy, co-packaged optics (CPO), has achieved maturity and is increasingly being deployed in interconnect fabrics. CPO moves the I/O module away from the faceplate by integrating the module components into a single package alongside compute or switch chips. However, CPO manufacturing is more expensive and challenging than traditional optical modules manufacturing, and its adoption may require significant changes to network infrastructure and design.?

Ayar Labs’ in-package optical I/O aims to tackle all of these bottlenecks Their optical I/O chiplet TeraPHY?, fabricated by GlobalFoundries’ standard CMOS process with existing system-in-package architectures, has achieved a bandwidth of 4 Tbps bidirectional. ( technical comparison with CPO here). The company has raised around $219.7M in total from Nvidia, Intel Capital, GlobalFoundries, Applied Ventures, Lockheed Martin Ventures, BlueSky Capital, and more (34 total investors, including grants from NSF and DARPA). Ayar Labs is already shipping thousands of engineering samples of its chiplets. “Based on what our customers need to solve problems within data centers and their progress on the ecosystem to support it, between 2026 and 2028, you’ll see commercial offerings with optical I/O,” according to the company.

NLM Photonics aims to enable faster, energy-efficient movement of data across endpoints with its hybrid electro-optic (EO) modulation technology combining organic materials with semiconductor photonics platforms. The company is at an early stage, having raised $4.4M in total from strategic investors like Hamamatsu Photonics and Tokyo Ohka Kogyo and grants from NSF and US DOE. Its partner, Switzerland-based Polariton Technologies, designs and manufactures plasmonic PICs and claims to have commercialized the world’s fastest and smallest EO modulators. Polariton also partners with Lightwave Logic (NASDAQ: LWLG) for its proprietary engineered electro-optic (EO) polymers. Iit just demonstrated 400 Gbps for data centers’ optical transceiver modules in September 2024 on ECOC.

Lasers with different wavelengths don’t interfere with each other. So, if you want to parallelize data transmission, you don’t have to scale through physical hardware but can use multiple lasers with different colors (i.e. different wavelengths) in the same optical waveguide. A single laser can generate hundreds of distinct wavelengths of light that can simultaneously transfer independent streams of data between AI nodes. Having more wavelengths per fiber improves energy efficiency.?

Xscape Photonics claims that its fully programmable, multi-color photonics platform, ChromX, can handle hundreds of colors on a single fiber. AI data center fabrics can boost “escape bandwidth” out of GPUs by 10X, reducing power consumption by the same amount. The company just raised a $44M series A led by IAG Capital Partners with participation from Altair, Cisco Investments, Fathom Fund, Kyra Ventures, LifeX Ventures, NVIDIA, and OUP. The funding will ramp up the mass production of the platform. (Source)

Enlightra, with Y Combinator and Intel Ignite, raised a $4M seed round from Creator Fund and Aloniq in 2023. It claims to have built, certified, and launched sales of the first-ever product using multi-color lasers on a silicon chip driven by a single continuous-wave laser. The company says it has received very positive feedback from customers. (Source)

Fast-Growing Si Photonics Market

As AI systems do more parallelized processing, photonics will play an increasingly larger role. It enables the integration of optical components like lasers, modulators, and detectors on a single chip/substrate. Photonics includes many semiconductor technologies beyond just Si, but Si Photonics (SiPho) is in high demand because it leverages Si semiconductor economies of scale to create photonic integrated circuits (PICs) that use light to transmit and process data. Lots of developments are bringing data transmission at the speed of light from between data centers, clusters, and racks to between GPUs. SiPho isn’t just used for AI/HPC computing but also for quantum computing and telecommunications.

The SiPho market, currently valued at around $2 billion in 2024, is projected by many to show a compound annual growth rate of between 25% to 30% and reach $7-11 billion by 2029 to 2031. The Asia Pacific region is anticipated to show the fastest growth. (Source 1, Source 2, Source 3, Source 4)

The SiPho ecosystem includes PIC designers (including IP providers and design service providers), foundries and manufacturing fabs (including chiplet manufacturing), epi wafer suppliers, silicon-on-substrate (SOI) substrate providers, transceiver integrators, equipment providers, end users, and so on. Advanced semiconductor packaging technologies play a crucial role in packing chips and components in a compact way on Si platforms. For example, Marvell Technology’s SiPho Engine leverages advanced 3D packaging to integrate hundreds of components into a single device. In terms of pure foundries, TSMC and GlobalFoundries are well-known players, but there are also many startups partnering with them to push boundaries in the space.

Credit: TSMC

All posts so far from this emerging AI infrastructure thesis-based series are here:

We are tracking hundreds of companies in the AI infrastructure space. If you would like to get our full list of AI Data Center Cooling Tech companies and important signal reviews for 2024, please sign up here. We'll share them in your inbox.

To startups in our focused thesis, you're welcome to share your updates with us.

To investors, we collect and clean data/signals, so you focus on deal-making.

Eigen Fu

BizDevelopment, ip/asic/chiplet/sip DesignConsulting and Supplychain: fusionsip.com & AdvancedChipletTechnology

2 个月

not an option befire 800Gbps...

要查看或添加评论,请登录

Global League的更多文章

社区洞察

其他会员也浏览了