ApertureData Problem of the Month: Duplicate Data
ApertureData
Database purpose-built for Multimodal AI: Combine scalable vector search with memory-optimized graph and data management
Keep reading to read about my latest attempt at making a new cocktail!
Tackling the Problem of the Month: Cost of Multimodal Data Copies
It’s no surprise that ML models require huge amounts of storage for the complex data they handle. But there’s another cost that’s often overlooked: the cost of making copies of that data. Data engineers often make copies of different datasets in order to tweak and train their models, leading to duplicate data and ballooning cloud storage costs.
Since today’s most advanced models work on multiple data types - think PDFs, images, videos, and more - we’re talking about potentially terabytes, even petabytes, of data to store! Multiply this problem by how often your data evolves and you can see how scaling your storage can easily become an incredibly expensive problem to solve.
Improving accuracy with data
The truth is you can get pretty far with today’s off-the-shelf models, particularly if your needs are relatively generic. But for many teams with specialized training goals - like recognizing shopper attributes in videos or identifying medical conditions from MRI scans, for instance - ML models rarely perform with the best accuracy right off the bat. That’s why you need so much data to begin with.
While the exact target for accuracy varies by use case, in order for a model to be useful in production, most organizations strive for 90%+ accuracy. But what happens to model accuracy when real-world variables evolve? Sure, ML teams can choose to iterate on algorithms and model parameters. But for many teams, the most significant improvements come from improving the underlying data by:
- Acquiring more representative data?
- Using synthetic data
- Augmenting existing data to avoid over-fitting on training data
How teams wind up with duplicate data
Looking specifically at augmentations, in a perfect world, you’re applying these on-the-fly in order to provide the most statistical variation in input data. But if your datasets contain large unstructured data types (or are otherwise very large), this can be a massive drain on compute resources.
That’s why ML teams tend to create copies of their data with the relevant operations applied, typically at storage time. Cloud storage costs in the cases of datasets in terabytes or petabytes can certainly balloon due to these copies!
Messy data provenance and the resulting compliance burden
Beyond storage costs, there’s also the cost of being able to track the origin of all these copies, especially as duplicate datasets are saved as different versions. If you have sensitive data, this makes it incredibly difficult to maintain compliance. Sometimes, data science teams even create copies of subset of data on their machine instances, which is even harder to track or share!
领英推è
The opportunity cost of not experimenting with different training techniques
The fear of increasing cloud costs due to data copies and the complexity of keeping track of data origin can cause ML teams to hesitate in experimenting with the different augmentations. Using static data, though, simply isn’t an option in today’s hypercompetitive world.
What if you had a database that allowed you to specify such augmentation operations on complex multimodal data as part of its query language? What if there was zero or very low overhead to this pre-processing because it was parallelized and run close to the storage? What if this frequently resulted in lowering your network bandwidth requirements and reduced compute load on training and inference servers without delays??
This is what we, at ApertureData, envision for all data engineers and scientists. To learn more about how we do that, check out our documentation here:
What we are listening to:
- From Cost Center to Value Center: Aligning Your Data Team with Business Initiatives - https://open.spotify.com/episode/5y6zHoO3SVmoddVQnUzlCk?si=09a34ce91269475e
- Another podcast filled with nuggets on the latest in AI - https://open.spotify.com/show/2sU6BQhwZxCA4CVzWQEiAL?si=5cef62a0f0b14ff5&nd=1&dlsi=a7102ca836ec4e15?
Now for the cocktail, Lemon Drop - never disappoints:
Lemon Drop - Invented in San Francisco, the lemon drop can be served in a cocktail glass or as a shot. There's plenty of variations on the drink out there, but you can't go wrong with the classic formulation - vodka, triple sec, fresh lemon juice and simple syrup. And of course, the sugar rim.?
aiXplain Developer Advocate | Podcast Host | Experienced Data Scientist
10 个月Thank you for the shoutout for the podcast. Here is the YouTube link for anyone interested: https://youtu.be/tA5DyxLnAyM