Where’s my underwear? High performance storage in the age of gen AI.
Source DALL·E: An illustration of a bald man scratching his head while standing next to an overflowing suitcase.

Where’s my underwear? High performance storage in the age of gen AI.

The advent of generative AI is causing a dramatic rethink of the types of compute architectures required to support the enormous volumes of data it consumes. Training and tuning foundation models to adapt them to specific tasks, and then creating the applications themselves, require the adoption of technologies that can deliver high performance and low latency while also ensuring scalability.

Outside the obvious mass-adoption of GPUs and AI-specific custom silicon, such as the ARM Ethos-N Series, there are other areas of consideration. This includes the distributed computing frameworks employed, the networking infrastructure supporting it, the extract, transform, and load (ETL) pipelines that prepare the data for consumption or storage, and finally the storage mechanism itself. That’s what I’m going to dig into in this post.

Having just returned from a family vacation, I can positively attest to the fact that not all storage techniques are equal. Take a look at my suitcase versus my wife’s versus my kids' and you’ll see a distinct disparity in packing techniques. With mine, items are easily accessible but do not make optimum use of space, while the opposite is true for my wife, who’s admittedly a lot less OCD than me. Is that important? Well, it depends on if you need low-latency access to items or scalability.

The same is true for storage. There are different approaches that can be adopted based on the type of data being deposited and the characteristics required for its retrieval. Is it structured like a spreadsheet, unstructured like text documents and media files, or semi-structured like code? Is it more important to access data quickly or to scale? Is it more efficient from a cost perspective, or less complex to manage? Ultimately, all these attributes - and more - must be weighed to determine the optimal solution for each application.

While file and database storage options are ideal for classic content management and transactional systems, respectively, we typically look to block and object storage solutions when it comes to generative AI. Block storage is favored for the training phase of a foundational model, where fast, consistent, and reliable data access is required. The net result is improved training efficiencies and better utilization of the all-important (and expensive) underlying resources. Featuring consistent, high-quality, and highly curated records, training datasets are usually labeled and standardized using structured formats, making them ideal for block storage.

As the name suggests, block storage works by dividing data into fixed sizes. Each has a unique address, allowing for direct access to the information located there. If you’re thinking this all sounds too familiar, it’s because it is. Block storage is the exact same technique that’s been employed since 1956, from early IBM mainframes to today’s server SSDs. Consequently, you’ll be familiar with the various block storage standards, such as SCSI, SATA, and NVMe - each flavor with its own key features, but almost all requiring some form of third-party software overlay or hardware underlay to meet the requirements for error correction, corruption prevention, encryption, and replication.

I’m no boomer, but a common thread through all my posts is this notion of old fundamental technologies continuing to find purpose in modern applications. I always get a kick out of it. And when correctly implemented, block storage techniques can provide not only the performance, consistency and reliability noted previously but also the fine-grained control, and integration with the modern high-performance compute environments generative AI model training demands. These include the hyperscale cloud providers, who all have block storage offerings, including Amazon’s Elastic Block Storage, Google’s Persistent Disks, Microsoft’s Managed Disks, plus IBM and Oracle’s Cloud Block Volumes. Now get off my lawn.

Oh – wait – we’re not finished yet. Indeed, we’ve barely started. The training of foundational models is, of course, only a small part of the generative AI story. It’s their application that will continue to revolutionize how we live and work. We can loosely categorize these in buckets that include text, image, audio, and video generation, 3D modeling and animation production, document and report creation, plus numerous other AI-powered creative tools. These applications have fundamentally different demands of their underlying infrastructure – not least in the area of storage.

Unlike training data, generative AI applications run a little more rogue. Their datasets are derived from a variety of sources, presented in diverse formats, making them more unstructured in nature. The data is more real-time, noisy, incomplete and, unlike training data, foregoes the rigorous preprocessing. With origins tracing back to the late 1990’s, object storage techniques elastically scale to accommodate the large volumes of unstructured data typically seen in generative AI applications – and they do so far more cost-effectively than block storage mechanisms.

Object storage implementations also support extensive metadata, aiding in the organization and retrieval of information, which can be performed using standard (and very familiar) HTTP/HTTPS. Along with robust RESTful APIs and SDKs, this aids integration with web-based interfaces and supporting services. The downside to adopting these protocols is generally higher overheads, resulting in higher latency than block storage, but this is offset by higher throughputs. Once again, all public cloud service providers have object storage services, including Azure Blob Storage, Google Cloud Storage, and Oracle’s OCI Object Storage, to name a few. However, with its robust programming interfaces, Amazon’s S3 (Simple Storage Service) is generally viewed as the most widely adopted among this crowd.

But any developer embracing a hyperscaler’s object storage implementation is left with a dilemma: be shackled to a single hosting platform or rewrite their codebase for each. While the multi-cloud debate still rages, it’s fair to say that, in general, software vendors and their customers generally feel more comfortable when presented with an application that’s easily ported – if not run simultaneously – across disparate clouds. The answer may be to adopt an independent object storage implementation that can be spun-up on any cloud instance - public or private. Throw in support for S3 interfaces, as the heir-apparent to de facto standard status, and the issue of cloud supplier lock-in can be thoroughly negated.

At this point, a self-destructive degree of honesty compels me to make a disclosure: This post came about because of research I performed when interviewing for a technical marketing position at MinIO – an open-source, multi-cloud, S3-compatible object storage offering. As anyone reading this superficial post with any actual experience in this space can probably guess (coupled with the admission that I had to research it in the first place) I bombed out of the hiring process, rather unceremoniously but expectedly, in the first round.

MinIO is by no means the only open source, multi-platform and S3-compatible object storage solution. Indeed, alternatives such as Ceph, OpenIO and SeaweedFS all generally promote the same fundamental features, give or take. Reed-Soloman error correction (developed in 1960) provides high-performance error correction by adding redundant data in the form of parity symbols, in the same way data transmissions can be protected. Hashing algorithms are employed to prevent the systematic degradation and corruption of information over time. Large-scale replication and strong encryption algorithms protect and secure data, while federation allows multiple clusters to operate and be managed as a single system.

So, like most evolutions, we take some technology from the past and apply a liberal sprinkling of modern innovations to create something new. The stakes are obviously a little higher than my underwear, though mine tend to ride up a little, if I’m totally honest – and, as previously ascertained, I am to a detrimental degree. Generative AI requires more than a boatload of specialized compute hardware. Other aspects, like storage, must be carefully considered and will vary based on specific requirements. It’s fair to say, though, that object storage solutions are about to enjoy a renaissance, of sorts, as more AI applications evolve. Which variant ultimately gains the largest share of this space remains to be seen.

要查看或添加评论,请登录

Simon Dredge的更多文章

社区洞察

其他会员也浏览了