Setting up Your AI/ML Home Lab: Newbies Challenges and Solutions
Photo by Matheus Bertelli: https://www.pexels.com/photo/smartphone-laptop-working-industry-16094066/

Setting up Your AI/ML Home Lab: Newbies Challenges and Solutions

Earlier this year, We were having thoughts about getting a new machine for our home studio that doesn't produce any noise like the Lab HPE Prolaint Server we currently own, and somehow we ended up getting a new M2 Mac Studio which we thought back then with all the new Apple advancements with M2 Chip and maxing it out, this machine will be our new single stop shop machine for everything coding, video production and even our AI/ML learnings.

However, learning the hard way we came to know that long story short you need an Nvidia GPU to make some real progress and benefit from the community-wide support around Nvidia hardware.

But long story long we wanted to share our journey here with you in case you're getting started in the domain like ourselves and save you some of the hassle of the confusion and the underlying choices to be made with the reasoning behind it.

Why or Why Not M1/M2 Chip for AI/ML self-learning and explorations?

Though Apple recently announced its Metal Performance Shaders, or MPS as a framework for GPU primitives and various functionality processing like linear algebra, ray tracing, and other machine learning featured APIs that AI/ML libraries like PyTorch and TensorFlow can benefit from, there are however a few caveats to that promise from a market-wide usability perspective as of today ( which may or may not change in near future )

PyTorch has limited support for MPS to only Mac OS Host machines ( As of version Stable (2.0.1)

This means if you're in a situation where you have the need to run your docker containers with AI software that is utilizing PyTorch like HayStack you simply can't do so!

No alt text provided for this image
PyTorch Supported configurations for MPS per Pytorch.org prerequisites tool

CUDA Compute Platform is the way to go but .... You need to know this!

From above we came to learn about CUDA which is the Nvidia toolkit that provides you with the development environment and runtimes you need for implementing your GPU/Parallel Processing which is what you need for your AI models.

At that point obviously, we decided to get an Nvidia GPU.

From the PyTorch.org prerequisites tool, we learned that we would need CUDA 11.8 or at least 11.7 to get to run it on a Linux host which adds one more complexity factor to the equation which is whether would work with any Nvidia GPU?

Or there is some sort of a CUDA/GPU compatibility aspect that is out there?

And that turned true in a nutshell your CUDA versions have to be compatible with your GPU Compute Capability version.

No alt text provided for this image
Source: https://docs.nvidia.com/deploy/cuda-compatibility/

After a bit of research and being on the hunt for an Nvidia GPU that supports the latest and greatest CUDA versions that are out there, we decided on the GeForce RTX 3060Ti 8G and Yes that is better than the RTX 3060 12G simply because it has 4,864 CUDA cores over 3,584 CUDA contained in its non Ti competitor so more RAM is not always the only factor especially with about +28% boost in performance in our case according to gpu.userbenchmark.com

Interesting Thought!

Can you run an Nvidia GPU on an M1/M2 Mac host?

We decided to go the extra mile to come up with how we could run Nvidia GPU on a Mac OS and hit two birds with one stone! but then we hit a snag!

First of all, how would you get it connected? Mac studio is a little box that the GPU is most likely wider than its width, Looking around for a solution for that we found an eGpu ( External GPU) enclosure which are external case where you can fit your GPU inside and secure its power supply with the ability to connect it to your PC through USB C/Thuderbolt kind of connection!

Miraculous right? Not quite! yet there is another challenge, which is the Nvidia GPU driver for Mac M1/M2 Chip, we could find a driver out there for Mac Intell Chip because that has been supported forever anyway, however for Mac M1/M2 Chip, Apple seems to have dropped its support and not made any mention of planning to support 3rd parity GPUs in future, so back to square zero.

We decided to cut the chase on Mac M1/M2 and go back to AMD/Intel PCs

AMD CPU was a better bang for the buck

We put together an AM4 / AMD Ryzedn 5 5600X 6-Core 12-Thread CPU on an MSI motherboard with a 128G RAM DDR 4 and thought we were ready for the show time as we figured out multiple pieces of the puzzle! but Nah a couple of more issues blocked our way!

One which was really quick is that the Motherboard gave us a black screen After a bit of research we figured it was due to the GPU being newer than the motherboard though we still have physical AM4 support on the motherboard that needed to be upgraded and we did exactly that with the help of the motherboard manual, the onboard CMOS button for flashing the motherboard and an MBR / FAT 32 formatted USB Drive (DON'T USE GPT FORMATTED PARTITIONS)

GPU Passthrough

The second problem faced is we learned that in order for us to share the Nvidia GPU with our virtual machine we need to configure what is called GPU Pass-through which is a technology implemented in most HyperVisors to allow the Linux kernel to directly present the internal PCI GPU to a virtual machine. and as we are not in a data center kind of setup NVIDIA doesn't really allow sharing the GPU to multiple hosts, that is only allowed for high-end GPUs like Tesla for the enterprise data mining setups.

This means at any given point of time a single GPU can only be attached to a single host whether it is a physical or virtual host! and we could trick the Linux OS into thinking it's a CPU host rather than a virtual machine so it can recognize the GPU card but that was not the real challenge!

The real challenge was as you must have guessed by now that with that constraint we needed another GPU, one for our physical box display and the other for our VMs and Docker containers running on top which took us back to the built-in GPU that came along with our motherboard but one second!

We have got AMD Ryzen 5 5600X and in order to turn the built-in motherboard integrated GPU on we must get AMD Ryzen 5 5600G as that is the one that does the magic and has the GPU chip that enables the motherboard GPU and we have done exactly so!

Now we're ready

The next steps are all about deciding on the Virtualization OS as well as deciding which VMs we need and getting the GPU passthrough working from the Physical host to the VM to the Docker Engine on top of that! Which will introduce us to the Nvidia Linux GPU drivers, CUDA Toolkit, and Nvidia Container Toolkit next.

We'll discuss all that along with the tips and tricks to have an AI/ML multipurpose Home Lab Server up and running in our next post.

If you've enjoyed this article please consider sharing it with others!

Acknowledgment

The Author would like to thank the support and contribution provided by

Ayman H. ( Homains CTO) part of this exercise journey.

About the Author

Amr Salem was born in Cairo, Egypt. He is a technology geek. He currently holds the position of Principal Engineer-System Architecture with Verizon, Temple Terrace, FL, USA. He plays a pivotal role in providing innovative solutions across Verizon’s Network Systems. Before joining Verizon, he was with the IBM Clients Innovation Center, where he honed his skills and expertise in the technology field. His diverse talents and dedication make him a valuable asset in the technology industry and a source of inspiration for aspiring writers.

Wow Amr Salem. Trying AI/ML home lab Server for Virtualization enabled environment looks super cool ?? choice. Excited and will be awaiting to see tips and tricks in next article!

Wow! You’re definitely cut out to be a technocrat. What detailed research and analysis! I’d love to see the final configuration in your next post. Keep your passion fueled.

Mohamed Shahat

Software Strategist | Sales Engineering

1 年

Thanks very much for the post, Amr. I have gone through a home lab exercise recently and yes it takes several steps and a ton of investigation with every step. I am an Apple macs fan, but it’s proving day over day that they are no more machines for beyond the straightforward job

???? ???? ? ??? ??? ????? ????? ???????

要查看或添加评论,请登录

Amr Salem的更多文章

社区洞察

其他会员也浏览了