On HPC, FP16, Sapphire Rapids, AVX512 and how Vendors can actively hurt their users

On HPC, FP16, Sapphire Rapids, AVX512 and how Vendors can actively hurt their users

The past year I've been working on developing the initial infrastructure, the documentation, and the underlying code base to support IEEE-754 compliant FP16 on 英特尔 's upcoming Sapphire Rapids Xeons CPUs, and available on their Alder Lake consumer CPUs, as part of my Open Source (#foss ) contributions in the world of #hpc . All the while #Intel went through active efforts to block developers, gamers and everyday users from being able to access the feature sets.

This mainly meant developing #BLAS (basic linear algebra subprograms) compute kernels. Think a library you call that makes your #python #matlab #C #C ++ or #rust code much faster for #science or #ai math like matrix multiplies, dot products etc. This is done by leveraging special register level features built into modern #CPUs using #assembly and specialized compiler tricks to short cut around naive implementations that users normally implement. For example, #OpenBLAS , where my reference implementation is being upstreamed, is commonly used as the default BLAS implementation within #numpy .

Half the battle towards being able to actually develop for these unreleased/bleeding edge devices has been getting my hands on CPUs that supports the relevant instructions in hardware, so I can properly test and optimize this code.

Here ladies and gents, is where the tale truly begins. I'm pulling the following directly from a Git gist I wrote up while talking to the head of Compute over at University College London and discussing use cases for custom Linux kernels, specialized CPU micro-code, and how it can be hard to develop for a hardware vendor when that same vendor is actively working against you.

The original, unedited post follows:

A not so brief discussion of Alder Lake, the new AVX512 FP 16 extensions, Sapphire Rapids, its history, and why it requires a custom kernel

Warning: This is going to be a long one.

I'm assuming general knowledge of x86_64 hardware extensions, and some insight into the workings of large hardware vendors.

Understanding why AVX512 is useful not only in HPC, but also for gamers in emulation, or more efficient use of executions ports is a bonus. You don't need to have published 2 dozen papers on optimizing compute architecture.

Corrections/clarifications/suggestions greatly appreciated

TLDR;

Intel had what I’d kindly describe as a “silly” moment. It resulted in a situation where, if you want to develop for their own next generation Xeons and relevant floating point extensions you need:

  • A chip that isn't supported
  • Special revisions of BIOS’s
  • Specific versions of specific motherboards
  • CPUs made before a certain date
  • Build a custom kernel using specific older, special revisions of microcode that may be vulnerable to existing undisclosed CVE.

And you need to do all that while Intel?actively?tries to remove all of those capabilities.

Some Brief* background

Soooooo you may have heard about Intel's newest desktop chips, 12th Generation Core series, codenamed Alder Lake. The headline feature is that they're made up of both E(fficiency) cores and P(erformance) cores. The Pcores use the Golden Cove micro architecture, the same as what is slated to be used in Intel's next generation "Sapphire Rapids" Xeons. These cores support some wicked features for HPC/library devs in addition to anyone wanting fast simulations on a workstation. Namely these cores feature AVX512, and are the first x86 chip to support native IEEE-754 FP16 datatypes with 16-bit vector slices in SIMD at line rate speed. This is Huge! Especially with C/C++23 having _Float16 as a native datatype. There are some subtle differences in cache and port layout, between Alder Lake and Sapphire Rapids Golden Cove cores, but they’re so minor as to be negligible in this context.

The Ecores are fast, approximately original Sky Lake, only support up to AVX2, and forgo supporting hyper threading. They also have some really interesting designs, having 17 execution ports, but that’s a different discussion. Think of it like having 1 or 2 Skylake i5 6600k’s 4c4t CPUs bolted on to a very modern 6 to 8 core CPU, but those Skylake chips draw in the 5-50 W range.

Because the Ecores don't support AVX512, we have a problem. If you were to run a program compiled with AVX512 op codes, and the kernel were to schedule that thread/job on an ecore core, it would fault the moment it hits the unsupported opcode. The same way that trying to run AVX2 codes on an old Pentium is going to result in a?bad time. This applies to everything from CFD/AI codes, or file system memory copy codes that use AVX512 acceleration in the kernel. If those op codes were scheduled onto an ecore, it won’t be pleasant.

Intel’s solution for this, seeing as Alder Lake is a "consumer" chip, was to disable the AVX512 execution port at 2 and later 3 levels: Bios, firmware, and finally during OS bootstrapping (by re-inserting updated microcode that re-disables the execution port if it's somehow been enabled).

Yet I’m typing this today from an Alder Lake CPU with 8 Pcores running AVX512, so something must have worked out. Things get a little cheeky from here on in.

When the initial boards were launching alongside these CPUs, some of the vendors realized “Well one option to get around the op code on the wrong core issue would be to disable the Ecores!”. They themselves would write BIOS and firmware level enablement's to get around Intel's microcode at those 2 levels. It would allow for AVX512 on Alder Lake, bonus! There is however, a cost. For one you’re losing out on 4 to 8 cores depending on your chip. The i5 was 6P+4e, i7 8P+4e, i9 8P+8e.

Part of the goal with Alder Lake was a hybrid power model. Similar to what ARM created with big.LITTLE, when you don’t need the performance, spin down the performance cores to near sleep, run everything on lower power cores. At medium-high load, shift non focused tasks to the efficiency cores and shift the main tasks to the Pcores. If you have a single high-power app, spin down the ecores to give more public L3 and power/thermal headroom to the Pcores to stretch their legs.

Back to the timeline: Reviews come out; we have foot notes. At launch, Intel hadn't (officially or not) taken a stance on AVX512, outside of stating a year prior that it would not be supported. For a few months nothing major happens outside of a few intel devs on mailings lists (namely GCC) posting patches for both Sapphire Rapids and Alder Lake.

When Alder Lake comes out some reviewers mention how in BIOS, in advanced settings if you first disable all ecores, then go into AVX sub menu's a new option appears. This option is for AVX512 enablement. Not only do we now have AVX512, but also a new extension to the x86 ISA.

Around here is the second part of where things get “juicy”. The first sign that Intel had decided that AVX512 on Alder Lake was not going to be “a thing” came with mailing list updates post Alder Lake launch. Of note was a patch series of cost tables for Alder Lake into GCC-12, during which an Intel compiler engineer was asked what, if any, position Intel held on the topic of enabling AVX512 on the platform. The engineers replied that AVX512 would be explicitly unsupported.

At this point the speculation had been that there would be 3 options for Intel to move forward with:

Option 1:

Intel would not validate the feature, but not disavow its usage. If you want to risk using it, that’s on you. Similar in concept to how overclocking is “allowed” on K series CPUs, but not explicitly endorsed, nor recommended. Overclocking to this day does technically void your warranty. This was the “ideal” option.

Option 2:

Intel would mandates/sternly/strongly request that vendors remove the feature ASAP, and only allow for re-enablement on future workstation class motherboards. This is what is happening at this very moment with ECC support on Alder Lake. If your motherboard is sporting the w680 chipset, the i7 12700k supports the full suite of ECC related features, but on any other chipset, those features are disabled. Speculation was that on W series you would have an explicit option for AVX512 enablement with platform validation.

Option 3:

Intel mandates/sternly/strongly requests that all vendors remove the feature ASAP, creates new microcode revisions that explicitly disables the feature upon loading, pushes that firmware/MC as part of the Linux and Windows updates up streaming efforts to have up to date microcode (broadly speaking a good thing). That MC is then auto loaded on top of whatever “normal” MC is built into your MBs system. After all those tools are in place, also decide that it isn’t enough, and have the fabs create a new lithographic to explicitly fuse off the execution port so that, no matter what trickery/clever work devs/enthusiast put in, the platform stays exactly as technical marketing envisioned

Can you guess what option they chose?

Dramatic tension.

More dramatic tension.

MORE DRAMATIC TENSION.

If it wasn’t obvious, it was 3. To put it kindly, it was “silly”.

Intel Folks, this part is for you

Side Bar: Intel folks that may be reading this, I love you, the work you do, and the amazing contributions to openness and FOSS in general.

But this was a questionable choice at best, and though I haven't said it myself, it has been characterized as an "ass backward decision."

It hurt workstation users who haven’t been properly served in years. You’ve pushed them towards AMDs Threadripper to get their work done.

11th gen was bad, let’s not go there, and Ice Lake-W is a server product rebadged as a workstation chip if we aren’t kidding ourselves.

The decision hurt the emulator community who thrived off AVX512 chips at the low to medium end, not for the large registers, but for the masking capabilities. These allowed for massive performance uplifts when simulating platforms like the PS3, finally allowing people to experience their favourite games on what may be their favourite platform.

It hurt the student gamer community trying to get started in modern development. Students to whom I wanted to recommend a modern development platform with modern instruction sets were stuck getting 11th gen at best. I think we can agree it wasn’t Intel’s best showing. Tiger lake mobile was a good step. Side to the side bar, why in the world are there three different “11th” gen chips? Tiger Lake, Ice Lake and Rocket Lake are all “11th Gen”, and it’s a mess.

There’s a plethora of issues that I’m sure played into the decision. In favour of removing the feature would be Intel stability reputation. Another is that Intel was already receiving criticism on Hybrid architecture due to scheduler issues in Windows 10, 11 and to a smaller extent the Linux community.

Some reviewers had come out in favour of disabling ecores in general, even without the AVX512 angle, mainly due to early scheduler teething issues. The thinking may have been that “for those on the fence, another feature may have tipped the balance”.

Even if you, Intel, worked with vendors to lock it to Z690 mother boards only, and even K series CPUs only, as an unvalidated feature, that would have been a reasonable compromise. I’d argue it would have been seen as part of the “K SKU overclocking perks”. A silent “We’re not getting involved” would have served as a wink of “just don’t ask us if something breaks”. Instead of taking a hands-off approach, there was an active decision to remove the capabilities from any chips made after the 3rd week of 2022.

I still like you, still think you do great things. But please embrace the same spirit of innovation, love of exploration and new ideas that I see pouring out of so many Intel folks I interact with, and pursue that everywhere. The same way Intel has shown they want to try new things in software, in graphics and so on, bring that to how you work with hardware and the community that truly embraces those products. Side bar over.


GTTFP

OK I KNOW I KNOW, I'M GETTING THERE!

Onto the main the main reason I’m writing this darn thing:

I am an ASM(asochist) developer, with a love for HPC, CFD, FOSS, etc. If I had a super hero style catchphrase it would be

“FOSS+BLAS, I make things FAST!”

Alder Lake seemed like a fascinating platform, especially since it would?finally?allow for true IEEE-754 compliant 16-bit floating point on x86 instead of relying on heterogeneous compute and aliasing when moving data back to the host system.

In the past if you needed fp16 compliance on x86, you needed a platform that was Ivy Bridge or newer (3rd Gen Core, generation before Haswell) that supported the f16c conversion instruction that allowed conversion to and from FP16 in 32bit slices of 256bit AVX1 floating point registers. Problem is those instructions are?slow?to the point of being useless (latency in the 7-8 clocks at best, without any density benefits like doing twice the amount of arithmetic per cycle).

It was also the first platform to introduce DDR5, full stop. Speaking towards the HPC/Development box capabilities its value was?great. The i7-12700k provided 8 Pcores with Hyper-threading, 25MiB of L3, the newest generation of memory, access to new 5th generation of PCIe. All of this was?super cool. And for board, CPU and RAM you could be out for 800 maple syrup dollars! Toss it into an old case with a PSU and drive of your choice, load up your favorite distribution, install VIM, make sure the scourge of emacs isn't installed (definetly not targetting anyone in particular with this one?James?;)), and you're ready to rock!

It’s also seemed like a great tool for those of us preparing for Sapphire Rapid servers, which were at that point expected in early second half 2022, but are now slated for first half 2023.

So I bought into the platform in ~November 2021, put my head down, and got to work. I had one of the first Alder Lake i7-12700k's off the production line, a week 36 2021 chip. That means that it?did?support AVX512, and by carefully choosing my motherboard, I also had access to AVX512 enablement options.

Now to call back to that timeline from earlier: Around January Intel pushed new microcode upstream that explicitly made it impossible to enable AVX512, as well as pushing vendors to update all their new BIOS’s to include the new microcode.

In practice this means that if you:

A) Update your BIOS? You lose AVX512

B) Update your kernel? You lose AVX512

Solution A? Never update your BIOS.

Solution B? Dealing with the kernel is a little more complicated. They are flags to disable early microcode loading in grub/boot-loaders but they also taint the kernel, not ideal. The way around this is custom kernel. By building the kernel with either the old microcode embedded in it, or building the kernel with the ability to load microcode disabled, we can get around this issue.

Unfortunately, my board would eventually die and take my chip with it around the months of April-May. An RMA later and I thankfully lucked into getting an older chip from week 47 of 2021. As a reminder, chips from the 3rd week on of 2022 had the ports physically fused off during lithography.

Let get going

“I’m back in business”

Specifically for my purposes and passion for CFD, I’ve been working on new homogenous platform BLAS libraries and reduced precision CFD solvers, while also publishing small snippets and guides on FP16 on x86 around the web, on GitHub, Stack overflow, tech forums in addition to sharing what little sanity I have left on twitter.

It’s key for this knowledge to be in place and accessible with C/C++23 being on the doorstep and having FP16 as a mandatory inclusion. With LLVM supporting the type, Rust won't be far behind. Also little plug for the venerable FORTRAN, which has had FP16 support since FORTRAN-18.

Since then, it’s pedal to the Silicon medal developing compute kernels for FP16 in OpenBLAS. The infrastructure PRs for detection and enablement are in, both for Alder Lake and Sapphire Rapids.

Branching off from there (wink wink?Branch predictor/code branching? Anyone? Anyone??runs away from rotten tomatoes)

FP16 and BLAS do introduce new problems: Turns out that there’s no official standard for what to call BLAS kernels in FP16, even if we’ve had them on ARM and GPU’s + ASICs for years now.

The nomenclature is?mostly?HGEMM, HGEMV etc. where H is representing Half, the short form of “half precision floating point”, but it’s not standardized.

Oh, and complex? Don’t even think about it.

There is no name that’s agreed upon between any two implementations. Want to know why? Because nobody has bothered to implement proper complex plane BLAS kernels for FP16. Some implementations treat it as a specialty case of 2 different HGEMM kernels using relevant transforms, but it's not standard, and it doesn't follow the standard for both single and double precision set in the main BLAS specification in the first place. This forces the user to implement their own transforms, something the library is supposed to take care of in the first place.

I’ve written to the BLAS Technical Forum on this topic to hopefully get that rectified, but don’t hold your breath. There’s also the ISO C++ working group, AKA WG21 that is working on moving the BLAS standard into C++ as a language spec. Problem is they explicitly avoided adding any extensions beyond the current standard. It also didn’t make it in for C++23, which means we’re looking at C++26 at the earliest for standard naming in C++, then presumably C++29 for extended types.

Anyway, I’ve got another post I’m working on to discuss that, but I’ve also got more papers to dig through from the 70-00s of the old BLAS extended precision specification decisions and justifications to go through first.

There’s also some work being done for enabling AVX512 on Pcores while also having the ecores active.

Working title if I get the chance to publish is something along the lines of “Exploration of approaches to Hybrid ISA Schedulers”. Examples of approaches that can be taken:

Boot the device as 2 NUMA nodes at the Pcore/ Ecore boundary, segregate different PCIe roots to each one and go from there. This is a silly approach.

A second is to (ab)use the power of process/thread pining which already exists in the scheduler. Move all of kernel space to the ecores, which handle IO well due to their relatively massive parallel io access, then dedicate Pcores to user space. Limit the kernels ISA to that of the ecores, user space to that of pcores. It’s not great, but a nice bonus of this is you can disable many of the mitigations for speculative execution, since kernel branches are never in the same space as user (AKA the attacker) branches. You also get less frequent needs to context switch, a fundamental performance good that’s often being fought. There is a penalty from passing data from one core type to another, but it’s not the worst thing.

There are another 5/6 approaches I’ve done preliminary work on but not enough to ask for review.

Also, I was ghosted by some of Intel’s CPU team after they had reached out offering help to support hybrid ISA scheduling, so that might be playing into why I’m not exactly impressed with them.

Another discussion for another day.

Feel free to catch me talking about compute, hardware, or making some truly horrid puns over on twitter?@FelixCLC_

This is where the posted ended, but because LinkedIn and SEO are a thing

#Fortran #HPC #supercomputers #AVX512 #coding #ISO #IEEE #CPU #Linux #opensource #GPU #CFD #heterogeneouscompute #homogeneouscompute #FCLC

要查看或添加评论,请登录

社区洞察

其他会员也浏览了