Intel's P-Core & E-Core Hybrid Architecture For Alder Lake CPUs Detailed
Intel deep-dived into its upcoming E and P architectural design and it looks like a true general leap in power and efficiency.
Stephen Robinson (Microprocessor Architect at Intel) has Said :
"Our primary goal was to build the world's most efficient x86 CPU core. We wanted to do that while still delivering more IPC than Intel's most prolific CPU microarchitectures to date: Skylake. We also set an aggressive silicon area target so that multi-core workloads could be scaled out using as many cores as necessary with these architectural anchors in place. We also wanted to deliver a wide frequency range. This allows us to save power by running at low voltage and creates headroom to increase frequency and ramp up performance for more demanding workloads."
Intel wanted to provide a rich ISA features such as advanced vector and AI instructions that accelerate modern workloads. Thanks to a deep front end, a wide back end, and a design optimized to take advantage of Intel 7, this CPU core delivers a breakthrough in multi core performance. Let's now dive deeper into the details, starting with the front.
The first aspect in driving efficient IPC, is to make sure CPU can process instructions as quickly as possible. This starts with accurate branch prediction. Without accurate branch prediction, much of the work ends up being unused, which is wasteful. Intel implemented a 5,000 entry branch target count. Intel complemented it with a long history based branch prediction. This helps CPU quickly generate accurate instruction pointers. With accurate branch prediction things like instruction cache misses can be discovered and remedied early before becoming critical to program execution. Workloads, like web browsers, databases, packet processing, these all benefit from these capabilities.
Alder Lake have a 64 kilobyte instruction cache. That keeps the most useful instructions close without expending power in the memory subsystem. This micro architecture features Intel's first on-demand instruction length decoder which generates pre decode information that's stored alongside the instruction cache. Where the code that has never been seen before is decoded quickly. The next time it's executed. We bypass the length of the decoder and save energy. The new core also features Intel's revolutionary clustered out of order decoder, that enables decoding up to six instructions per cycle, while maintaining the energy efficiency of a much narrower core.
The second main aspect to achieving performance is ensuring CPU extract, any parallelism inherent in the program. With five wide allocation, a wide retire, a 256 entries out of order window and 17 execution ports, this microarchitecture delivers more general IPC than Intel Skylake core while consuming a fraction of the power. The execution ports are scaled to the unique requirements of each unit which maximizes both performance and energy efficiency.
Four general-purpose integer execution ports are complemented by dual integer multipliers and dividers. For Vector operations, CPU have three SIMD ALUs. The integer multiplier supports Intel's, virtual neural network instructions (VNNI). Two symmetric floating point pipelines allow executing two independent, add or multiply operations (Advanced Vector extensions).
Alder Lake can also execute two floating-point multiply add instructions per cycle. Advanced crypto units round out the vector stack, which provide AES and Shaw acceleration. The final aspect to achieving efficient performance, is a fast memory subsystem. Two load pipelines, plus two store pipelines, enable 32 byte read and 32 byte bandwidth at the same time. The L2 cache which is shared among four cores can be 2 or 4 megabytes depending on product level requirements. The large L2 provides high performance and power efficiency for single-threaded workloads by keeping data close.
It also provides enough bandwidth to serve all four cores. The L2 can provide 64 bytes of bandwidth per cycle with 17 cycle latency. The memory subsystem has deep buffering, and each four core module, can have up to 64 outstanding misses for the last level cache and beyond. Advanced prefetches exist at all cache levels to automatically detect a wide variety of streaming behavior. Intel Resource Director technology ensures that software can control resources, among the cores.
Now, If we compare four of Alder Lake CPU cores against two Skylakes running four threads, we deliver 80 percent more performance while still consuming less power. y. We deliver the same throughput while consuming 80% less power. This means that Skylake would need to consume five times the power for the same performance as you can imagine, these are very exciting results! This is incredible when you consider that we can deliver four of our new cores, in a similar footprint as a single Skylake core!
领英推荐
In single-threaded applications, a single P-Core (Golden Cove) delivers a 50% single-threaded performance increase over E-Core (Gracemont) within the same die area and power package. Intel's hybrid design, on the other hand, shows its prowess in multi-threaded performance, & delivers a 50% increase compared to a 4 P-Core solution.
The hybrid design featured 2 P-Cores (Golden Cove) and 8 E-Cores (Gracemont). The Hybrid design does offer 50% more threads to obtain its 50% lead over the standard 4 P-Core design (12 threads vs 8 threads) but it does so within the same package and power constraints.
Intel also talks a little bit more in-depth regarding its Thread Director technology and based on the below pictures, Alder Lake cores will be segmented into specific IPC groups. This is a nice approach as IPC doesn't necessarily remain the same across all workloads or cores. The OS scheduler needs to adapt to performance and efficiency bits of the architecture and this is where the Thread Director plays a crucial role.
For instance, in some scenarios, scheduling a thread for a small workload can bring overall better performance than scheduling for a large core and the same is true in regards to efficiency. Intel's Alder Lake CPU will be the first hybrid x86 outing on mainstream consumer platforms & requires a lot of work on the scheduling end to make the architecture work as intended. Both Microsoft and Intel are working to deliver stable performance at launch with Alder Lake and the Windows 11 OS.
Note : This Intel architecture reminds me the AMD's architecture!!!
View the "7nm APU : The Ryzen Mobile 4000 Series" article published on March 25, 2020 (https://www.dhirubhai.net/pulse/7nm-apu-ryzen-mobile-4000-series-farshad-vahidpour/).
What do you think?
?Special Thanks to Stephen Robinson ( Intel CPU Lead Architect) and Adi Yoaz (Intel Core CPU Chief Architect)