The Case Against Memory Upgradeability
Crucial 2 x 16GB DDR4

The Case Against Memory Upgradeability

The picture above is a pair of 16GB DIMMs. There are four packages of two 16Gbit die on each DIMM. Consider that most client systems can run fine at 16GB total memory and almost all requirements are met at 32GB. The question: why is memory not in the processor package? Doing so would give up a memory upgrade path. The processor and memory combination must be set at the time of initial purchase. Why give up a feature that we already have? Of course, brain-dead pundits incapable of whole picture, real world thinking would complain loudly, as they have always done so. The fact is: an intelligent and practical decision on the processor and memory combination can be made at purchase that will meet requirements through the systems productive life time for all but a very few cases. The imposition of memory upgradeability has a cost that is not commonly understood. The driving reason to bring memory into the processor package is to remove obstructions to critical performance gains posed by memory externally connected.?

For more than a decade, we have not had anything close to 40% year-on-year improvement either at the core or processor level, except for very few applications that leverage the new SSE/AVX SIMD instructions in each generation. One major reason for this is the huge disparity between the processor core clock cycle and the round trip memory access time. The clock cycle time of a core at 4.0GHz is 0.25ns versus round trip memory access of 70ns, a difference factor of 280X. Any software that does not run inside the on-die cache or rely on streaming memory access will hit a hard wall in performance on memory latency.

Both the processor and DIMM signal pads seem small at less than 1 milli-meter, but this is around 10,000 times greater than the transistor linear dimensions in the low tens of nanometers. Note the semiconductor manufacturing process name of 7nm really just means transistor density is supposed to be double (1.6X after accounting for non-scaling elements) that of the previous 10nm process. The correlation to transistor gate length was given up long ago.

No alt text provided for this image

The image below shows the metal levels above the silicon die for one of Intel’s 14nm processes. The metal traces at the lowest level connecting transistors are around 50nm wide. The upper level signals traces are around 10,000nm or 10μm (micrometer or micron) wide, much larger than the first level, though still smaller than the signals that leave the chip.

No alt text provided for this image

The diagram below shows the silicon die in a package. The die reside on a substrate. Signals travel from the die to the substrate before leaving the package to external world.?

No alt text provided for this image

The image below has additional details showing differences in signal connections from silicon to the substrate and the outside (not shown) versus connections from one silicon die via a silicon bridge to a second silicon die.

No alt text provided for this image

To send a signal off first the silicon die, and then off the substrate, exiting of the package to outside world, it must go through a series of buffering circuits (on the silicon die) to greatly amplify the current (ampere) strength at the source transistors (10 nanometer scale), to the millimeter scale of circuit board wires and connectors. This has cost in die area as one of the non-scaling elements. The higher amperage of the external signals impacts power consumption. Buffering adds signal propagation delay(?) and latency is critical.

In moving memory from its location outside the processor package to inside, there is some reduction in the length and size of the wires between the processor and DRAM. If we use existing components designed for off-package signals, the bumps at the silicon die are currently around 100μm. Also, the silicon is designed (and set?) to raise current to a level sufficient for external signaling. If components were designed for in-package connections, the bumps could be reduced to 50μm (perhaps much more in future designs?) and would operate at lower current? Intel and others have already done this with HBM and FPGAs in certain products. Apple puts the DRAM package on their M1 processor package??

The current Intel desktop processor LGA1200 package dimensions are 37.5mm x 37.5 mm. The next generation LGA 1700 package will be 37.5 x 45 mm. DRAM vendors do not like say what their die size is, but the DRAM package is 9 or 10 mm x 11 mm. The 16Gbit DRAM die should be smaller? Also the package size changes with each incremental process density gain. A consolidated processor + DRAM package could fit in comparable dimensions. Though an actual layout would want to optimize the memory controller to DRAM path? ?

No alt text provided for this image

Intel client processors largely fall into four major groups. The Celeron and Pentium lines have prices below $100, the Core i3 in the low 100s, the Core i5 in the $150-250 range, and the i7 and i9 are above $250. The current retail price of 32GB memory is $160 (somewhat elevated due to the supply-demand situation, it was about $110 last year?). It is not difficult to argue that the low-end processors should be configured with 4 and 8GB, the midrange with 8 and 16GB, and the high-end with 16 and 32GB. The brand and sub-brands could be restructured to divisions of 4, 8, 16 and 32GB DRAM + 3rd channel.

The Core i9 should have specially binned DRAM parts for lower latency. This set would cover the very large majority of use cases without affecting the number of SKUs necessary to cover the market spectrum.?Note: the heat sink on top of both CPU and memory could reduce the memory operation temperature range, allowing a lower latency setting(?), depending on whether heat from the processor spills over?

That said, there still is legitimate demand for flexibility, specifically very large memory configuration. This could be handled by a processor with a third memory channel that goes off package to DIMM slots on the motherboard. Perhaps a more important future direction is a processor with in-package eDRAM (DRAM manufactured on a logic process, which has lower density, but also lower latency) or even SRAM.?Memory upgradeability was once an important system feature. Going forward, it has become a ball and chain road blocking continuing progress in performance.

Appendix

Below is a 4MB memory board for the VAX 8600, made with 152(?) x 256K DRAM chips.

No alt text provided for this image

Below is the board cage for the VAX 11/785. In this era, 1M of memory was several thousand dollars, and memory upgradeability was absolutely essential. Today, memory upgradeability is a legacy artifact holding us back from making progress is computing performance.

No alt text provided for this image

The Apple M1 with 8GB, so 2 packages of 2x16Gbit DRAM chips?

No alt text provided for this image


要查看或添加评论,请登录

Joe Chang的更多文章

  • US Population from Neilsberg vs SSA (Elon)

    US Population from Neilsberg vs SSA (Elon)

    update : data from census.gov, not sure if actual or estimated, additional bracket for 80-89, 90-99 and 100+.

    1 条评论
  • single to multi-socket scaling

    single to multi-socket scaling

    Lenovo and AMD recently published TPC-E benchmark result for the 2-socket EPYC 9554. Most recent AMD TPC-E results have…

    5 条评论
  • SQL Server Performance from Intel Comet Lake to Raptor Lake

    SQL Server Performance from Intel Comet Lake to Raptor Lake

    A year ago, I reported on the performance characteristics of basic SQL Server operations (L2 Cache Size…

  • Insert, Update and Delete Tricks

    Insert, Update and Delete Tricks

    The previous articles Insert, Update and Delete Plan Cost, Asynchronous IO and Storage Arrays and IUD Performance…

    1 条评论
  • Filtered Statistics Tricks in SQL Server

    Filtered Statistics Tricks in SQL Server

    Data distribution statistics is one the foundational elements of cost-based query optimization in modern relational…

    2 条评论
  • L2 Cache Size & SQL Performance

    L2 Cache Size & SQL Performance

    L2 Cache Size Impact on SQL Server Performance Edit 2023-04, Update for Raptor Lake L2 Cache Size Impact on SQL Server…

  • SQL Server Joins - 2 SARGs

    SQL Server Joins - 2 SARGs

    SQL Server Join Costs, 2 SARG The previous Join costs covered join with a search argument on one source only at DOP 1…

    2 条评论
  • SQL Server Parallel Plan Cost

    SQL Server Parallel Plan Cost

    SQL Server Parallel Join Costs Up until the early 2000's, microprocessor vendors focused on improving the performance…

  • SQL Server Join Costs

    SQL Server Join Costs

    SQL Server Join Costs, 1 SARG Here we look at Loop, Hash and Merge joins with an equality value search argument (SARG)…

    2 条评论
  • SQL Server Key Lookup & Scan

    SQL Server Key Lookup & Scan

    SQL Server Key Lookup & Scan, Plan vs. Actual Costs Previously in Plan Cost Cross-Over, we looked at the SQL Server…

社区洞察

其他会员也浏览了