Don't Let Data Hold You Back: Understanding AI-Ready Data Infrastructure

Don't Let Data Hold You Back: Understanding AI-Ready Data Infrastructure

As mentioned in the previous article, AI-ready data infrastructure has transformative power for all industries .

Today, let's have a look at what exactly AI-ready data infrastructure is and the major characteristics it needs to have to address challenges in the AI era.


What is "AI-Ready Data Infrastructure"?


AI-ready data infrastructure refers to data storage software and hardware systems designed for AI applications and services.

This infrastructure offers large-scale data ingestion and preprocessing, high performance and consistency, and superb resilience to enable AI compute clusters to efficiently analyze data and learn.


Must-have features to turbocharge your AI


To keep up with AI trends, many organizations are stacking computing power for performance gains, which leads to ever-increasing computing concurrency. This in turn requires larger datasets to be quickly loaded into compute clusters for faster deep learning, which ultimately enhances the capabilities of large AI models.

Therefore, building AI-ready data infrastructure requires comprehensive preparation to ensure that it offers:

  1. Large-scale data ingestion and preprocessing
  2. High performance and consistency
  3. Superb resilience
  4. Data intrinsic resilience

Today, let's dive into the first two of them.


1. Large-scale data ingestion and preprocessing

Many enterprises store their data in different data centers or on different storage devices in the same data center. Service O&M personnel know what data they have, but do not care where the data is stored. IT O&M personnel know where and how much data is stored, but do not care what the data is.

As a result, enterprises lack a unified view for managing scattered data, let alone carrying out effective data ingestion and preprocessing to support the use of AI computing power for training.

Therefore, AI-ready data infrastructure must enable data owners to perform large-scale data ingestion and preprocessing.

  • Ingestion of dispersed data from multiple sources

Using a unified namespace can enable visualized data asset management and policy-based data ingestion from multiple sources and different locations. This approach ensures data visibility, manageability, and availability, which facilitates efficient data access via GPUs/NPUs.

  • Data preprocessing

To ensure that there are high-quality datasets for training, consider adopting a framework that simplifies data cleansing, conversion, and standardization.


2. High performance and consistency

In the model training phase, the following processes are greatly influenced by data infrastructure:

  • Loading training datasets

  • Reading and writing checkpoints

These processes are critical to computing power utilization. As the GPU/NPU quantity in an AI cluster increases from tens of thousands to hundreds of thousands, storage performance needs to keep up to ensure these processes are executed smoothly and efficiently.

  • Loading training datasets mainly involves accessing massive amounts of small files. This requires a performance density of million-level OPS per PB to minimize the loading time.

  • Reading and writing checkpoints are bandwidth-intensive processes. They require a performance density of TB/s bandwidth per PB to minimize the fault recovery time for a compute cluster.

The performance density must be achieved in both low- and high-capacity scenarios (for example, tens of or even hundreds of PB-level storage capacity) to meet the performance needs of ever-growing compute clusters. This requires data infrastructure to have strong scale-out capabilities for non-disruptive expansion and a near-linear increase in both performance and capacity.

In addition to high performance, the strong consistency of checkpoints written by a compute cluster to the storage is critical. If a compute cluster fails, the latest checkpoint[N] needs to be read to resume training. However, solutions like distributed caching cannot ensure strong consistency of checkpoint data. To resume training in these cases, a compute cluster has to revert to:

The latest complete and available checkpoint[N – x] (where x is an integer ≥ 1).

This renders all training after checkpoint[N – x] invalid, which wastes time and resources.

During data loading or recovery, GPUs/NPUs remain idle. This reduces AI cluster utilization, resulting in a massive waste of resources. Analysis shows that proper storage performance improvement (not just capacity expansion) can greatly reduce the GPU/NPU idle time caused by training dataset loading and checkpoint reads and writes. This can improve cluster utilization by about 10%.

The absolute value of FLOPS can be increased either by improving the performance and consistency of data infrastructure or by stacking computing power, but the former approach is greener, more efficient, and twice as cost-effective.


Summary


Now, we know what AI-ready data infrastructure is and the major characteristics it should have to help you address challenges in the AI era.

In the next post, we will look at Brilliance in Resilience: How AI-Ready Infrastructure Is Shaping Tomorrow's World.

Huawei is an industry leader with over 20 years of extensive investment in data infrastructure. It offers a broad range of products, solutions, and case studies to help you handle AI workloads with ease. Learn more out our award-winning OceanStor Data Storage and how to unleash the full potential of your data.


Download Huawei's AI-Ready Data Infrastructure Reference Architecture White Paper .



Muchiu (Henry) Chang, PhD. Cantab (Cambridge, UK)

Consultant in Patent Intelligence and Engineering Management

1 个月

Huawei IT Products & Solutions If you were NOT able to find the data you want from the data storage, you will be crazy, NOT just held back. Metadata is a must for data search. Let's try a realistic go/no-go test which AI failed last year (2023). Is there any data tool that can answer the following questions of business intelligence? "Who, in the Ontario province of Canada, have new US patents granted on the nearest Tuesday (Eastern Time), when the USPTO releases the newly granted US patents on a weekly basis?" "Who, in the "江蘇" province of China, have new US patents granted on the nearest Tuesday (Eastern Time), when the USPTO releases the newly granted US patents on a weekly basis?" With our intellectual property (IP), a Chinese-English multilingual metadata, we can get the full list answers for the above questions. This is a fact. Do you or any of your contacts need our expertise/IP to do the data analysis that AI can't do? Metadata is an enabler. It is like a treasure map for treasure hunting. Without metadata, NO data can be found/retrieved, even by the most advanced technologies, like AI, high-end chips, supercomputers, etc. https://lnkd.in/g-aJFnXR

回复
Oluwatobiloba Olatunji

Aspiring Computer Hardware Engineer | Semiconductor & IoT Enthusiast | Covenant University Student

2 个月

Insightful breakdown of AI-ready data infrastructure! The high-scale data ingestion and high performance have rightly been pointed out. I have worked with AI systems and can vouch for how important all these features are. The unified namespace approach for data ingestion remains one of the most interesting, and this might alleviate a lot of headache concerning data management. Won-dering about the performance density numbers here: millions of OPS level per PB for small file access and TB/s bandwidth per PB for check-pointing sound pretty good - is it one or more specific technologies / architectures make it possible to reach these levels? That's a very important point, it is always so easy to let consistency in checkpoints slide. Great reminder: Data infrastructures have direct implications on the effectiveness of AI model training. Any post about resiliency, please, is another key component of doing this for me when scaling workloads of AI. Great work, Huawei! #AIInfrastructure #DataManagement #MachineLearning

回复

要查看或添加评论,请登录

Huawei IT Products & Solutions的更多文章

社区洞察

其他会员也浏览了