To Co-Package or not to Co-package Ethernet Switch Chips

To Co-Package or not to Co-package Ethernet Switch Chips

BI am writing this paper to initiate an open and professional discussion about the trend of driving switch silicon co-packaging for the next generation switching solutions.

Co-packaging is a very complex initiative, and all aspects of it need to be considered as an industry and a community. Every one of the subjects discussed in this paper can and should be deep-dived by the appropriate parties to ensure we drive the industry in the right direction.

Please note that everything in this paper is my personal view and does not represent any past, present, or future employers.

The problem

Switching silicon technology reached very high density, with 25.6T silicon/switches emerging to mainstream now and 51.2T silicon coming soon. We see a firm push from end-users to integrate the switching silicon with the optics into a single co-packaged solution; the fibers will be directly attached between the co-packaged solution and the switch system's front panel. This push is widespread and firm, creating a significant effort at most silicon suppliers.

One would ask, what is the reason for taking this initiative for the switching technology? The general reasoning for the co-packaged solution mainly the following:

1.    Connecting high-speed silicon to front panel modules is becoming very complex at 50G PAM4 and 100G PAM4 speeds

2.    IO power reduction due to the proximity of the switching silicon and the electrical part of the optical chiplet

3.    Increased faceplate density of LC fiber connectors vs. QSFP-DD or OSFP modules

4.    Potential cost reduction in large radix switches

This paper will address all of these concerns and open up the discussion for the price we will pay as a community and data center operators when you go to a co-packaged solution.

Addressing the reasoning

High-speed connectivity

Copper-based connectivity for high-speed signaling has been evolving for 20+ years, breaking the barriers; again, and again, we reached a point that we know very well how to make the copper connection (PCB based) for 25G NRZ and 50G PAM4 for short distances for 3-10" without a need for repeaters. We will connect 100G PAM4 with either advanced PCB technology or with a cabling system. We have a copper solution on the system level to connect a chip to the front panel for the foreseen interconnect technology of choice.

The QSFP-DD and OSFP sockets and modules are rated for 50G PAM4, and it is only a matter of time that we will be able to support 100G PAM4 with a pluggable socket.

For the 25.6T, the 51.2T, and 102.4T switches, we do not need a faster interconnect than 100G PAM4 with existing silicon packaging technology with 256/512/1024(needs to be developed) SerDes channels per package.

As described above, you can see there no high-speed signaling requirement that leads us to integrate the optics to the switching engine for the current and next two steps of switches.

IO power reduction

We all strive for an overall power reduction in our data centers, looking for a more sustainable solution and reduce the power OpEx cost. Overall, switch power saving for the co-packaged solution has been estimated to be between 10%-30%. All of it comes from the IO power of the switching ASIC and the IO power for the optical modules (including all the associated logic with that interface).

No alt text provided for this image

Most midsized plus data centers deploy a Clos network to support scaling up and out. The diagram below represents a traditional four-plane 5 stage, Clos, for up to 98,304 servers non-blocking.

Assuming a ToR (Top of Rack) based architecture, the typical ratio in the rack of switching to the server is between 1:32 to 1:80, meaning for every ToR switch, you have 32-80 servers associated with it at the rack power rating of 19.2-35KW in modern racks (in the diagram above shows 48 servers per rack).

Assuming the ToR is even targeted for the co-packaged switching solution, which is not the case at this time, the power saving per rack is around 50-100W, which is 0.1-0.3% of the overall rack power; this indicates an extremely low impact on data center power on the rack level.

The additional switches in the above example in a fully non-blocking solution add up to additional 384 switches (out of a total of 2432 switches) that will bring an overall power savings of the additional switches to 19.2KW-38.4KW (384x50-100w) for a 40MW data center, totaling 0.05%-0.1%% of the overall power, not including any WAN connectivity power and assuming perfect PUE.

The OpEx cost saving associated with that solution is around $235,000/year for the non-ToR switches, which are the target for the co-packaged solution. A typical kWh cost is about $0.07/kWh, which makes the savings 0.9% out of the overall data center power cost of $24.5M/year.

As you can see, the power consumption impact and the OpEx cost saving can hardly justify such a herculean effort to develop and integrate a co-packaged switching solution to the data center.

Faceplate density increase

Today, the faceplate density with QSFP-DD or OSFP is around 32 ports per 1RU, assuming 400G modules that reflect 12.8T per 1RU, and with 800G modules 25.6T per 1RU, assuming we can scale QSFP-DD or OSFP to support 800G modules.

For standard LC connectors, we can fit 72-128 connectors per 1RU; this more than doubles the number of ports per 1RU compared to pluggable modules.

With that said, as you can see above, the ratio of the number of racks needed for the networking portion of the data center vs. the compute is a ratio of 1:200 (10 racks of networking vs. over 2000 racks of compute); this indicates that the sensitivity to RU utilization for networking 1RU gear in the core of the network is not significant to justify the extremely high density of LC plugs vs. pluggable modules. Also, very high-density faceplate connectors create major operational issues with cabling and maintenance.

Cost Reduction

While there is a potential for a cost reduction, it is not clear yet what will be the yield for the co-packaged solution and the system, so it is too early to be able to predict the potential for cost reduction at this point.

Technology disaggregation

For the last ten years, the data center industry is working towards disaggregation of technology that does not share the same development pace and innovation cycles. We are doing it for CPUs, memory, disk drives, and specialty technology.

The switching chip design and the optical technologies historically were developed with very different development cycles and methodologies. Switching ASICs operate in a 12-18 capacity doubling cycle, while the optical development cycle is around 2-4 years 4x steps.

Combining the two technologies will cause a significant delay in the switching silicon deployment cycles since it will have to operate on the optical development timelines. The direct implication to the industry is a 2x slowdown in new switching technology. To justify this, we need to have excellent reasons and a very significant problem we are trying to solve.

Innovation

Innovation is created by need and competition; we will not generate innovation in optics anymore since the optical solution will always be integrated with the switching chip, and there will be no place to disrupt the industry with new types of optical technology that is easily inserted and tested with a modular solution. This will create a significant slowdown in our industry growth and create berries for small companies to succeed with new optical design innovation. Without switching chip silicon integration, they will have no path for integration to the networks. This will work in the long run against the growth in our industry.

Driving the SerDes' future development to focus on ultra-short-reach & low power solution will inhibit any future advanced SerDes development for copper-based interconnect and will not enable the industry to continue to leverage or even get the next generation copper interconnect solutions in the system and cross-system connectivity.

New challenges with Co-Packaging

I will touch on some of the chip design challenges in creating a combined module that can be integrated into a system. I will focus mostly on the challenges outside of the integrated module. I would expect a deep dive from the chip design teams and the optical suppliers on the co-package integration difficulties.

Integrated chip design

Chip design with multi-die, chiplets, or discrete components is a well know and proven technology, however until now, at a large scale, it has been only done around the integration of electrical sub-components. The optical co-packaging introduces new challenges and innovations to ensure chip-to-chip interconnect and power integrity while mixing technologies.

Integrated co-packaged chip power

When dealing with hybrid technologies that have each special requirement, it is essential to ensure that we can create mixed power distribution that will provide power integrity and stability, especially in high-intensity transient conditions.

The integrated co-packaged chip/module will be rated between 750W-1500w, which means over 1000A of current will flow through the package, assuming off-package power step down. Delivering or producing such high current can have a severe impact on power integrity and stability on the silicon level and the substrate/package level; this is a much more complicated problem to solve than any integrated solution to date and will require special design and conditioning.

Module Cooling & Laser integration – Thermal

One of the biggest challenges in system design with chips rated at 750W-1500W is cooling, it is clear that pure air-cooling solution will not apply here, and new technologies will have to be developed to cool such a high-power module. Such technologies exist today (like direct o chip liquid cooling) but require special design to support such a chip/module.

On top of this, there is different operating temperature and temperature sensitivity for the switching silicon and the optical side, specifically around the lasers. A new cooling system or technology split will have to happen to make the solution viable; putting a 500W chip (oven) in close proximity to the laser will not get good results. While split laser/optical module solutions are in the process, this creates a significant challenge for the system designers.

This is a very complex problem and can dramatically delay the timing of the solution availability.

System integration and testing

System integration with tens, hundreds, and in some cases, thousands of fiber strands in the box will create significant challenges in the manufacturing, assembly, and test. This capability is not mainstream at switching suppliers' manufacturing floors and will require investment, training, and very complex test and debug procedures. In the short run, this will hurt the system's reliability.

Sourcing

This will be a significant issue for the end-user/customer. Today we have the luxury of purchasing switches from a selection of suppliers, OEMs, ODMs, and our manufacturing partners. We have interoperable plug-in ports for optical modules. We have a slew of suppliers to fill these modules slots with a very diversified set of modules, technologies, and prices. With the integration of the optical and electrical part of the solution, we will probably end up with a lockdown into 1-2 modules suppliers that 1-2 suppliers will have the assembly and test capability. As customers, we will lose anyway to diversify our supply chain having multiple sources for each part we need and optimize cost.

On top of this, we now have to purchase a switch based on the integrated optics type in it and cannot mix and match optical types in the same switch. This will dramatically increase the number of SKUs we need to work with and will not enable us to balance innovation with risk.

Operations

On average, in our industry, more than 35% of data center integration failures are miscabling or cable mishandling. We address this by a channel-by-channel trace and replacement of optical modules or re-routing of cables. With the integrated solution, this is not an option anymore. To a certain level, ANY issue with the optical part of the interconnect (even for a single lane) will force us to replace the whole switch requiring time, cost, network downtime, and special training while at the same time hurting our network availability. If you look and the diagram above, you can see that every path the redundant, and in theory, you can have a technician do a non-disruptive switch replacement in case of a lane failure. Still, the complexity and cost associated with the switch replacement, provisioning, and the test are unreasonable, especially in a large data center environment with millions of active lanes. 

Summary

In summary, optical and electrical co-packaging is a technology that should be researched and investigated to find the right insertion point to it in our industry. In my opinion, the 51.2T and 102.4T switches are not the right places to do so; as long as we can continue to use optical modules enabling innovation and competition in the market, we should continue to pursue pluggable data centers switching solutions in my opinion.

This paper is not written to challenge co-packaging efforts but to start an open and professional discussion about the technology, its values, and its drawbacks.

I will be happy to get your feedback on the above and start an open and professional discussion on every one of the items I discussed above as a group or privately.

Asaf Radai ??? ?????

???? ???????? ???"? ????? ?? ???? ???????? ????? ??????

2 年

Yuval, thanks for sharing!

回复
Han Arets

enjoying retirement

3 年

Supplier differentiation upstream may maintain the number of competitors ( and their margin opportunity) , but standardization for end users with huge purchasing power ( like Microsoft) means a boost to their business model in data services through improved competitiveness ( revenue)and improved cost competitiveness ( margin). Performance improvement: more data at higher speeds, lower power consumption, higher reliability and simpler vendor management. Switch capacity may improve as well in the near future. It is not just a cost and amortization issue, more important is the competitive advantages it would deliver IMO.

Karen Liu

Product Management at Nubis Communications

3 年

Thank you for contributing your expertise and time to foster open community discussion of a complex issue.

Hesham Taha

CEO and Co-Founder at Teramount Ltd.

3 年

Indeed, the overall data center power saving of CPO is a minor fraction. However, 10-30% power saving in 51.2T switch with pluggable optics is still significant for a proper operation of components inside a ~2kW "hot oven." Such high power will challenge thermal management existing solutions and will require expensive repeaters and PCB technologies. I totally agree that the industry has yet to establish a scalable CPO eco-system and make it more economic for the successful adoption of CPO switches, and, for the next generations of high speed connectivity e.g. chip-to-chip optical connectivity. In addition, a scalable and reliable CPO alternative to pluggable optics is a must for cost reduction, especially, given the high cost of >$50k of 64X800G TRx of a 51.2T switch with pluggable optics (assuming $1/G).

Yuval, thanks for stimulating this dialog! I am not deeply embedded in the space inside the box, but listening to my colleagues in Ethernet & OIF forums, I have come to the opinion that it may be difficult to justify & adopt CPO in the world of 100G SerDes. There we have "workable," if less efficient, solutions within the current paradigm. It seems to me that 200G electrical lanes may require 2-3 cm electrical traces, perhaps demanding CPO as a paradigm, but maybe not until then. In the meantime, I believe that 100G electrical lanes will have a long lifetime. So 100G VCSEL SR8 downlinks to servers, from a high radix switch, using MMF cabling, will be the lower cost & power method for server attachment over 30m for 5 years to come. (https://ieee802.org/3/db) LEAF to SPINE connections might be served well by DR4 and FR4, using a gearbox when 4x200G wavelengths are needed. This should give us plenty of time to work out the questions you pose carefully.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了