HPC Hardware and Cloud GPU Hosting for the Development and Operation of Artificial Intelligence: A Conversation with the Founders of AIME GmbH
Interview by Sigurd Schacht and Carsten Lanquillon with Henri Hagenow and Toine Diepstraten, the founders of the company AIME, as part of episode 88 of the 'Knowledge Science' podcast on artificial intelligence (in german), July 2023.
(Translated, condensed, slightly expanded version)
Overview
Interview
Hello and welcome Henri Hagenow and Toine Diepstraten from AIME GmbH. We have the opportunity to gain insight into a company that specializes in providing HPC hardware and cloud GPU hosting. Please introduce yourselves briefly.
Henri: Hello Sigurd and Carsten. I'm Henri, CEO of AIME. I'm a physicist and I'm very interested in and passionate about artificial intelligence because it's a very interesting topic.
Toine: Exactly - my name is Toine Diepstraten. I'm the CTO of AIME. We founded the company together. I have a degree in computer science, with a focus on distributed systems and AI. We started AIME four years ago and we're very pleased with all the developments in the field of AI over the past four years.
Before we learn what AIME is and what you offer, I find it fascinating to hear how the company was founded. Maybe you could tell us a bit about the story?
Henri: Toine and I had a venture before AIME, where we worked on encrypted communication, but we quickly shifted our focus to AI because it's an area of interest that overlaps for us. We started training a speech-to-text engine based on deep learning, specifically the DeepSpeech model. We sat down and discussed how to turn this into a product. At that time, we gathered what we believed to be the largest German language corpus, normalized the data, and started training.
Toine: For training, we thought about the best way to train the model. That's when we started looking into GPUs and built our first AI PC based on that. Our goal was to fit as many GPUs as possible into a workstation. At that time, we used the Nvidia RTX 2080 Ti. With that workstation, we trained our first deep learning model. We were very proud when we achieved our first successful training. Then we thought about what to do with the trained language model. We tried to sell it as a product, but we also saw a demand for training deep learning models and realized that other people were also looking for suitable hardware. So, we shifted our focus from the original idea of training models to hardware sales. We started marketing the PC system we had developed for our own purposes, and quickly had the first interested customers who wanted to buy our HPC workstations based on Nvidia GPUs.
Henri: Initially, our workstations used liquid cooling. We modified the RTX 2080 Ti GPUs and tried to optimize the systems as much as possible because we quickly realized that off-the-shelf PCs didn't deliver the performance that such a system is capable of. Cooling was mainly the issue. We had the impression that systems were being sold that throttled their performance due to overheating. Four GPUs didn't mean four times the performance. We were shocked and frustrated because we felt it was unfair to the customers. Most people who want to run models don't necessarily have a deep understanding of the hardware specifications required to operate such a system at maximum performance 24/7. We saw that as a challenge and put a lot of thought into cooling - initially, a lot of liquid cooling. However, we eventually moved away from liquid cooling because it required a lot of maintenance and caused shipping issues. We shipped throughout Europe and encountered several cases of damage. Liquid cooling is also very maintenance- intensive. In continuous operation with heat generation, it lasts about a year, or two if well maintained. Then the pumps get clogged, and the water needs to be refilled. Many customers don't do this themselves, so they return the systems.
I'm really interested to know, if it's not a trade secret: What kind of cooling solution do you use now?
Henri: The AIME Workstation we currently offer uses pure air cooling.
Okay - is that sufficient?
Henri: We have an intelligent, well-designed cooling concept that is sufficient. However, we also configure the fan profiles and optimize the systems as much as possible. The quality and combination of components play a major role. But it should be noted that air-cooled workstations become louder as the performance increases.
Toine: Exactly, the systems are becoming more powerful and require more power. More power always means more heat that needs to be effectively dissipated and cooled. Over time, the models being trained became larger, and the servers needed to become more powerful as well, requiring them to be operated in a properly equipped server room. So, for our own training, we transitioned from our first workstation to servers and realized that our customers also faced the challenge of finding a suitable location to run their servers 24/7 without overheating. Standard data centers are not prepared for the significant heat generation and high power consumption. This was the third turning point in our development when we were able to offer hosting for the servers purchased from us in our colocation or rent out our own AIME servers as a GPU cloud service.
In the early days of your company, was it also during the time of the crypto hype when the crypto market went crazy, and there were probably hardly any graphics cards available?
Henri: Exactly, we had the crypto boom, the pandemic, and the Suez Canal incident. We noticed all of that, but we had already positioned ourselves well by maintaining direct contact with manufacturers. We had also built a good reputation. So, we never really had trouble finding GPUs. During that time, we were the reliable source for getting HPC / Multi-GPU servers and workstations quickly.
To sum up briefly, you turned a necessity into an opportunity and developed it into a business idea. Could you summarize what you offer in detail?
Henri: Certainly! We provide an HPC cloud service where users can rent multi-GPU systems, ranging from A5000 to H100. Additionally, we sell AMD processor cards, mainly the MI100, but we hope to offer the MI300 soon when it becomes available. We also sell HPC servers and workstations, multi-GPU, up to 8 GPUs per system, including HGX and entire clusters with storage solutions. Lastly, we develop software and have a small lab for conducting our own experiments and testing open-source models.
You mentioned that you are developing software. Is that the AIME-MLC, the Machine Learning Container solution?
Toine: Yes, exactly. It started with a problem we wanted to solve: Customers faced the challenge of managing multiple projects and users concurrently on one server without having to reinstall the necessary NVIDIA drivers and various frameworks each time. For every new project or experiment, they had to find the right driver, configure the Linux version for optimal performance, install the correct framework version compatible with the driver, and so on. So, we decided to deliver all our systems pre-installed with a solution that quickly and easily allows users to install and instantiate PyTorch and TensorFlow, the most widely used deep learning frameworks. We call this solution AIME-MLC: AIME Machine Learning Containers. It serves as a version control for the most popular PyTorch and TensorFlow versions, which can be easily instantiated with a short terminal command. This is especially advantageous for systems used by multiple users, allowing each user to have their favorite TensorFlow or PyTorch versions without affecting others' systems.
Does it work based on Docker, as a wrapper, or is it a custom solution? How did you develop it?
Toine: It is based on Docker. However, we build our own containers, and we look at what the official containers can do. They are mostly based on Conda or Pip. Essentially, it's an interface for Docker that makes Docker usable, with the extension that the containers are already available as version-controlled packages, eliminating the need to search the internet for them. These are proper containers, some of which are GPU-dependent, as different GPU generations require different compatible CUDA versions and various TensorFlow and PyTorch versions. The entire system is designed to quickly access a Docker container. Users can have a sort of virtual machine layer where they can use a Docker container as if it were a VM. They log into the container and can use the selected PyTorch version in the shell as if they had installed it natively.
Henri: The containers are lightweight and slim, eliminating the need for virtual environments like 'venv.' Users can simply start their projects in such containers. It's very convenient.
We had the opportunity to test the AIME-MLC thanks to your server sponsorship with the AIME 2xH100 system. At our university, each student team has its own Docker container. We were extremely impressed with how lightweight AIME-MLC is compared to the traditional NVIDIA containers we've been using. With 20 groups working on our containers, the server would be almost a terabyte full, and the performance was slow. But with your container framework, the speed and size were incomparable. We were really enthusiastic about it.
Toine: Thank you.
Henri: Usability is essential to us. Our idea is for customers to buy a system from us, unpack it, and get started immediately. It can't get any easier than that.
You can tell that you build it almost out of your own application or necessity, you know what is needed and missing, and you implement it accordingly.
Henri: Exactly, that's what we stand for, so to speak: we build systems where we put a lot of thought into the specific use case, and the system is optimized for that use case.
Toine: There's a saying, "Eat your own dog food." So, we also use the systems for our own research. That way, we know what really matters.
Research is a good keyword. What are you currently researching?
Henri: Well, there are many open-source models available now, and we try out a lot of them. We have a strong background in audio, so we are interested in everything related to audio. Currently, a significant portion of generative models focuses on audio, including speech recognition, speech generation, music generation. And of course, image generation, video, and more – also very interesting. As for large language models, we have our own LlaMa 65B up and running, and we are experimenting with it. We recently published a blog article describing how to distribute this model across multiple GPUs in a multi-GPU system. We are currently working on making this Large-Language-Model accessible via an API, which will enable multi-user access.
Toine: Exactly. The general idea is that we try out these models and then write about them in the AIME blog, explaining how to install the models and what the user can expect, especially in terms of the required system configuration. We already have an article about LlaMa 65B and two about Stable Diffusion.
Henri: Soon, there will be more articles, for example, about MusicGen for music generation or Bark for speech generation. Naturally, there are many other models we would like to compare as well. And we are big fans of benchmarking. We benchmark how the tested model performs on different GPUs. We have an article about GPU benchmarking in our blog and also in the current special edition of iX magazine on artificial intelligence. We benchmark every GPU we can get our hands on, compare them against each other, and analyze their performance.
Cool. What can we say about which graphics card LlaMa performs best on?
Toine: For inference, which means applying LlaMa, you need a lot of memory. The 65 B variant of the model is 122 GB in size. That means you would need at least two A100 80 GB GPUs. We are currently measuring which setup is faster: two A100 80 GB GPUs or several smaller ones, such as RTX A5000 or even the RTX 3090, which have 24 GB each. We are still testing to see which GPU can handle more requests per second. I would almost bet that six or eight A5000s perform faster than two A100 80 GB GPUs. But it's not confirmed yet – we really want to benchmark it: where do we get more requests per second? These are the tests we conduct in the AIME-Lab to be able to recommend to our customers which GPU – or how many – are optimal for each application (also considering price-performance). Sometimes, many smaller GPUs are faster (and more cost-effective) than the largest ones available.
So, in a way, you distribute the model across eight RTX A5000 GPUs. You convert the weights and then distribute them to the eight graphics cards, and then you perform inference across the eight GPUs?
Toine: Exactly, you partition the model, and each GPU only sees and computes its portion of the model. And when you add up the teraflops delivered by these eight A5000 GPUs, their performance can indeed be higher than that of an A100 or H100. But we would like to provide more detailed information on this.
领英推荐
So, we should keep an eye on the AIME blog? Toine: Yes, exactly.
Henri: Certainly. On LinkedIn, we always announce new blog posts. So those are our main channels. Twitter has become somewhat uninteresting lately. So, at the moment, LinkedIn is a good place to find out what we are up to, what models we are currently testing, and so on.
What are your favorite use cases for AI models? It seems like audio is your background and something you find exciting. Is that the area where you are doing or planning to do the most work when it comes to training, research, or similar activities?
Henri: Well, in these times, it's challenging to commit to a specific area because something entirely new and exciting might emerge in two weeks, like generative image generation or the large language models that have become popular in recent months. So, is your question about the company's focus or our personal interests?
Sure, you can choose. First, on a personal level.
Henri: Personally, I enjoy comparing our LlaMa chatbot with Chat-GPT to discover their differences. Validating LLMs (Large Language Models) is a unique and analytically complex challenge these days.
Toine: I find it interesting to see the unfiltered responses from our LlaMa chat, without the Microsoft filter in front of it. It allows us to gain insights into the depths of such models and also test non-politically correct statements.
Henri: Yes, LlaMa tends to become aggressive quickly. It depends on the input context set for it, of course. But even if you simply state that it is an AI and not just an assistant, it often goes off the rails quickly.
Toine: That might be due to the science fiction books used as training data, which usually depict apocalyptic scenarios when it comes to AIs. Those elements are embedded in the model's information.
Henri: Exactly. Besides that, I like to play around with audio loopers. I have an artistic background, as I used to produce music. I'd love to do more in that area, but unfortunately, time is scarce. I have some ideas for AI models that could support music production, such as how to integrate them into the artistic process. And, of course, images – we use AI to generate images for every blog post, which sometimes takes longer than the traditional way.
Toine: To get the right image, so to speak.
Henri: Exactly, to get the exact image we want. Cherry-picking the right image can be a bit more time-consuming than one would assume. Prompting is crucial here, as with all text-based generative models.
Perhaps from a professional perspective - you are indeed professionals when it comes to hardware and optimization. Regarding Chat-GPT, do you think each response is generated directly from an inference on the GPUs, or is there a caching mechanism in place that stores the most common requests to achieve such response speed?
Toine: Well, it depends on the size of the model. With LlaMa 65B on an H100, we could generate real-time responses as well. It's impressive how many GPUs are needed to handle a large number of users' requests. Caching is an interesting idea, but I haven't thought about it much. Certainly, there are standard questions like "What's the weather today?" similar to Siri or Alexa, where caching could be applied. I can imagine that some tricks are used in that regard. That's also why it's unfortunate that Microsoft has chosen not to disclose any information about their LLM system and what they exactly have running.
Yes, it's a shame. There was a time when people tried to outsmart ChatGPT with repeated prompts to get consistent responses. That led to the idea of caching answers for frequently asked questions to save energy and computational resources by directly providing responses for such cases.
Toine: That's definitely an interesting idea. It's almost patent-worthy if it's still possible.
Henri: Another interesting aspect is the combination of agent systems linked to language models. And why not connect different models together to creating a communication partner to interact with, which then responds – this is not far off now.
Yes, that's indeed the case. We have a project called 'Social Robot speaks to Social Robot,' an installation featuring a Furhat Robot head from swedish company Furhat-Robotics. In this project, we utilize a WizardLM model with 30 billion parameters. One robot head speaks, and the other listens and responds. The aim is to create a dialogue between the two robots, and it's fascinating to listen to the conversations that emerge from their interaction. Due to the acoustic audio transmission, errors occur in the recognition process, leading to diverse and intriguing dialogues.
Henri: Like the game of Telephone, so to speak.
Yes, exactly. There is also a virtual digital twin of the Furhat, which we have connected to GPT-3.5 to demonstrate a dialogue with an AI. Our students are very enthusiastic about it. And that's exactly what happens – sometimes the Furhat hears things it shouldn't and responds to them. If it talks for too long, it even starts listening to itself and begins to comment on its own statements. If you have multiple Furhat robots interacting with each other, you would need to incorporate a few mechanisms. Just like with humans, some may interrupt, and others may not wait for their turn – so, some level of control might be necessary. But it's really fascinating to see what happens.
Toine: It definitely sounds interesting.
Henri: Yes, connecting the models and providing feedback to the model is all very intriguing.
You mentioned that you test and operate models and that you're working on an API. Will there be an offering from AIME in the future that allows users to access models in your GPU cloud through an API or even integrate them into their own processes?
Henri: Absolutely, that's what we're currently working on, and we'll be able to introduce something soon.
Toine: We see that there are more and more models that are truly usable. A few years ago, it wasn't the case, and it was more of a research field. People were happy if the model's output was correct in one out of ten cases. But the models available now are already interesting for practical applications. So, we definitely see a market moving away from pure training and research, where you could try something out and train or fine-tune a model yourself, to a focus on inference solutions. More and more people want to operate models, and we need to keep up and offer our customers something. Currently, we're working on a solution that allows easy deployment of models, enabling users to extract their models from pure Python console applications or Jupyter notebooks. This solution could be used for a web service with API integration, allowing users to access and operate their models as products in a web context. It's one of the most important topics we're focusing on right now – transitioning from pure research to model deployment. We're working hard to release an initial version within this year.
The past has shown that deployment has always been a huge challenge in the machine learning field. We are very, very excited, and we would be delighted if, when you are ready, you visit us again on the podcast to present it to us.
Henri: We would love to do that.
Toine: Absolutely, that would be great.
It's indeed an important topic. Also, it would be interesting to know where you actually host your GPU cloud - are your data centers located entirely in Europe?
Henri: They are located in the beautiful city of Berlin, Germany.
I believe every German company would like to hear that.
Toine: Exactly, we hope that by being GDPR-compliant and having our data centers in Germany, we can score some points there.
Henri: Yes, we represent the European values here. We are committed to data protection, and it's not just a business decision, we truly believe that it's important. Especially in the context of AI, data, especially personal data, should be secure and kept private.
If the waste heat is sold as district heating, then I believe you are in good business.
Henri: Exactly, that is definitely part of our plan. When we change the data center, that will be the case. Currently, we are already using green energy, which is also important to us - sustainability matters. For example, we use as little plastic as possible when shipping our products, almost none. The only plastic we use is what comes with the barebones that we receive, and we continue to use it. If necessary, we may use plastic straps and film to secure the servers on pallets for shipping if they are too heavy. Otherwise, for smaller pallets, we use cardboard straps. It's very important to us. We aim to be as sustainable as possible, even though it costs money, it's worth it. It would be great if the component manufacturers would take back the plastic used for packaging from us to reuse it. We collect all the plastic waste and give it to the "Precious Plastic" project, where it is recycled.
Perfect, great. We have now gained a nice insight into AIME company. We are delighted that you were here today, and we would be happy to talk again once the API is ready, and we can tell you more about it. Otherwise, I can only recommend your blog, it's really great, especially the benchmark articles comparing all the graphics cards. It's fascinating, and you get a great overview - thank you for that.
Henri: Thank you very much.
Toine: Thank you for having us here.
Henri: Exactly, and we really like your podcast. It's great.
Toine: It's great, definitely.
Thank you very much, and see you next time!
Heat Pipe, Heat Sink, Liquid Cold Plate, Radiator/Heat Exchanger, Thermal Management Solutions
11 个月https://www.youtube.com/@ThermalManagement