Defeating Data Gravity? - Hammerspace
I worked in life sciences for a few years. During my time in the industry, we strived to overcome data gravity. Identifying and accessing life science data globally is a significant challenge. The data gravity theory dictates that moving processing resources closer to the data is cheaper than moving data closer to the data processor. The theory is being tested in the era of AI and the limited access to accelerated computing. At AI Field Day 4 (#AIFD4), Hammerspace argues for intelligently moving the data closer to your accelerated computing, creating an AI pipeline.
In my conversations with data scientists, these practitioners can spend several weeks organizing and preparing data for either model training or inferencing. This preparation phase may be extended If that data is spread across several sources. Hammerspace offers a parallel file system as a proxy between multiple data sources.
So, data scientists have a single view of the various file system metadata in the preparation phase of AI. How does this solve the data gravity problem? If your GPUs are located 150ms away, the latency may prove too much to be useful. This is where Hammerspace's replication features come into play.
Hammerspace allows data syncing between locations while maintaining a consistent metadata file system. Paraphrasing Floyd Christofferson , Hammerspace VP of Marketing, it's not a copy of the data but a local cache presented by the global filesystem.
领英推荐
A solution to Data Gravity and complexity?
So, is Hammerspace the ultimate solution to data gravity in creating AI pipelines? I have yet to put the solution into production to know all the nuanced challenges. However, Hammerspace does offer a better user experience on the surface. The solution isn't magic. If you have PBs of data across a global landscape, you have to strategically place GPUs around centers of data.
As data administrators, you must consider data movement and the governance of that data. While the solution may allow the viewing and accessing that data cross geo-political boundaries, you must consider the reprocussions of caching that data near your processing centers.
On the surface, it's another powerful tool in a set of tools as you look to overcome the challenges of data gravity and GPU availability.
You have to split the discussion into two parts. First the metadata, holding the file system structure and details on individual files, including how they map to physical disk. Second the physical content itself. Metadata optimisation is essential for distributed file systems. Hammerspace has lots of lazy reading techniques for onboarding content without needing to walk the file system tree. Metadata syncing and searching also need to be super-efficient. It's no surprise, BTW that InfiniteIO founder Mark Cree is now at Hammerspace. Then there's the physical content. All data can be broken down into chunks, which can then be deduplicated and fingerprinted. Imagine having a copy of data on GCP and AWS. If either copy changes, only the differences need to be moved. Of course, you could also cache a copy as read on-demand. The next level of data mobility is to (accurately) predict data I/O profiles. That's what I was working on in 2016 (and had working). There's an argument for building AI into this process and enabling the file system to "learn" over time. Then you only move data around where it's needed, attempting to predict the process to reduce latency. So far, Hammerspace seems to have the best solution.
Founder: DTR | MSFT Alum, K8s, 4IR, Computational Consciousness | ???????
1 年cc: Brian Kuebler