ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

ApertureData Problem of the Month: Duplicate Data

ApertureData

Database purpose-built for Multimodal AI: Combine scalable vector search with memory-optimized graph and data management

å‘å¸ƒæ—¥æœŸ: 2024å¹´4æœˆ25æ—¥

+ å…³æ³¨

Keep reading to read about my latest attempt at making a new cocktail!

Tackling the Problem of the Month: Cost of Multimodal Data Copies

Itâ€™s no surprise that ML models require huge amounts of storage for the complex data they handle. But thereâ€™s another cost thatâ€™s often overlooked: the cost of making copies of that data. Data engineers often make copies of different datasets in order to tweak and train their models, leading to duplicate data and ballooning cloud storage costs.

Since todayâ€™s most advanced models work on multiple data types - think PDFs, images, videos, and more - weâ€™re talking about potentially terabytes, even petabytes, of data to store! Multiply this problem by how often your data evolves and you can see how scaling your storage can easily become an incredibly expensive problem to solve.

Improving accuracy with data

The truth is you can get pretty far with todayâ€™s off-the-shelf models, particularly if your needs are relatively generic. But for many teams with specialized training goals - like recognizing shopper attributes in videos or identifying medical conditions from MRI scans, for instance - ML models rarely perform with the best accuracy right off the bat. Thatâ€™s why you need so much data to begin with.

While the exact target for accuracy varies by use case, in order for a model to be useful in production, most organizations strive for 90%+ accuracy. But what happens to model accuracy when real-world variables evolve? Sure, ML teams can choose to iterate on algorithms and model parameters. But for many teams, the most significant improvements come from improving the underlying data by:

Acquiring more representative data?
Using synthetic data
Augmenting existing data to avoid over-fitting on training data

How teams wind up with duplicate data

Looking specifically at augmentations, in a perfect world, youâ€™re applying these on-the-fly in order to provide the most statistical variation in input data. But if your datasets contain large unstructured data types (or are otherwise very large), this can be a massive drain on compute resources.

Thatâ€™s why ML teams tend to create copies of their data with the relevant operations applied, typically at storage time. Cloud storage costs in the cases of datasets in terabytes or petabytes can certainly balloon due to these copies!

Messy data provenance and the resulting compliance burden

Beyond storage costs, thereâ€™s also the cost of being able to track the origin of all these copies, especially as duplicate datasets are saved as different versions. If you have sensitive data, this makes it incredibly difficult to maintain compliance. Sometimes, data science teams even create copies of subset of data on their machine instances, which is even harder to track or share!

é¢†è‹±æŽ¨è

Q&A with Data Engineers

Realtime Recruitment 1 å¹´å‰

How Do Artificial Intelligence, Machine Learning and Data Science Coincide and Diverge?

How Do Artificial Intelligence, Machine Learning andâ€¦

Pratibha Kumari J. 2 å¹´å‰

The Power of Big Data

Samantha Glover 1 å¹´å‰

The opportunity cost of not experimenting with different training techniques

The fear of increasing cloud costs due to data copies and the complexity of keeping track of data origin can cause ML teams to hesitate in experimenting with the different augmentations. Using static data, though, simply isnâ€™t an option in todayâ€™s hypercompetitive world.

What if you had a database that allowed you to specify such augmentation operations on complex multimodal data as part of its query language? What if there was zero or very low overhead to this pre-processing because it was parallelized and run close to the storage? What if this frequently resulted in lowering your network bandwidth requirements and reduced compute load on training and inference servers without delays??

This is what we, at ApertureData, envision for all data engineers and scientists. To learn more about how we do that, check out our documentation here:

Image and Video Preprocessing or Augmentation

Video and Clip Examples

What we are listening to:

From Cost Center to Value Center: Aligning Your Data Team with Business Initiatives - https://open.spotify.com/episode/5y6zHoO3SVmoddVQnUzlCk?si=09a34ce91269475e
Another podcast filled with nuggets on the latest in AI - https://open.spotify.com/show/2sU6BQhwZxCA4CVzWQEiAL?si=5cef62a0f0b14ff5&nd=1&dlsi=a7102ca836ec4e15?

Now for the cocktail, Lemon Drop - never disappoints:

Lemon Drop - Invented in San Francisco, the lemon drop can be served in a cocktail glass or as a shot. There's plenty of variations on the drink out there, but you can't go wrong with the classic formulation - vodka, triple sec, fresh lemon juice and simple syrup. And of course, the sugar rim.?

Multimodal AI in Real World

479 ä½å…³æ³¨è€…

è®¢é˜…

Sonam G.

aiXplain Developer Advocate | Podcast Host | Experienced Data Scientist

10 ä¸ªæœˆ

Thank you for the shoutout for the podcast. Here is the YouTube link for anyone interested: https://youtu.be/tA5DyxLnAyM

èµž

å›žå¤

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

ApertureDataçš„æ›´å¤šæ–‡ç«

See all articles

ApertureData Problem of the Month: Duplicate Data

ApertureData

Database purpose-built for Multimodal AI: Combine scalable vector search with memory-optimized graph and data management

é¢†è‹±æŽ¨è

Multimodal AI in Real World

479 ä½å…³æ³¨è€…

ApertureDataçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

The Power of Databricks Data Intelligence Platform

Big Data Rules for AI: How to Build a Foundation That Actually Works

Spotlight on Data Intelligence Platform

Unmasking Data Superheroes: The Roles Driving AI and Machine Learning

Data Technology Growth in the new age

Exploring the Best Auto Labeling Methods with Microsoft Purview

Navigating the Shifting Sands of Data, Analytics & AI and the Problem That is DAMA

From Data to Wisdom: Navigating the Depths of Insight in a Digital Age

POST COVID WORLD IS LIKE CRYSTAL GAZING BALL:DATA SCIENCE/CLOUD COMPUTING IS A FUTURE SHOCK FOR GLOBAL TRADE/ECONOMY,THROUGH DATA LITERACY.

Demystifying Inference Pipelines in Data Science: From Data to Decisions

é¢†è‹±æŽ¨è

Multimodal AI in Real World

479 ä½å…³æ³¨è€…

ApertureDataçš„æ›´å¤šæ–‡ç«

ApertureData Challenge Of The Month: Guardrails for Data

ApertureData Problem Of The Month: Preparing For Multimodal AI

ApertureData WiDS Summit, AI User Group, and MLOps World Recap - BONUS: Multimodal AI Challenge!

ApertureData Problem of the Month: Data-centric Take on Multimodal AI

ApertureData Problem of the Month: Keeping up with AI

AIQCon & GenAI Zoo Recap - BONUS: Dataconnect Mixer!

ApertureData Problem of the Month: Accessing Meaty (But Sensitive!) Data for Model Training

ApertureData Problem of the Month: Debugging Data in the Dark

Managing Annotations: Problem of the Month

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

The Power of Databricks Data Intelligence Platform

Big Data Rules for AI: How to Build a Foundation That Actually Works

Spotlight on Data Intelligence Platform

Unmasking Data Superheroes: The Roles Driving AI and Machine Learning

Data Technology Growth in the new age

Exploring the Best Auto Labeling Methods with Microsoft Purview

Navigating the Shifting Sands of Data, Analytics & AI and the Problem That is DAMA

From Data to Wisdom: Navigating the Depths of Insight in a Digital Age

POST COVID WORLD IS LIKE CRYSTAL GAZING BALL:DATA SCIENCE/CLOUD COMPUTING IS A FUTURE SHOCK FOR GLOBAL TRADE/ECONOMY,THROUGH DATA LITERACY.

Demystifying Inference Pipelines in Data Science: From Data to Decisions

é¢†è‹±æŽ¨è

479 ä½å…³æ³¨è€…

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†