登录查看更多内容

ApertureData Problem of the Month: Debugging Data in the Dark

ApertureData

Database purpose-built for Multimodal AI: Combine scalable vector search with memory-optimized graph and data management

发布日期: 2024年5月16日

+ 关注

Keep reading to read about my latest attempt at making a new (deadly) cocktail!

Tackling the Problem of the Month: Debugging Data in the Dark

When you’re training a model or if a trained model fails, one of the first things you’d probably do is understand the data and annotations you are working with. Do we need more representative data? Is anything off with our annotations? If there’s something awry with our training data, we want to know as soon as possible!

To train and fine-tune models in order to improve and to accommodate new data is a very fundamental requirement from AI teams. Regardless of whether the dataset pre-existed, was collected in-house, or purchased; data science and analytics teams need to view or navigate through it in order to understand what the data looks like. Better understanding can lead to better models, faster.?

And yet, I’ve spoken to data scientists and ML engineers who often blindly embark on model training without truly understanding the underlying datasets. It’s not for lack of trying! Even though it may seem like a simple requirement - to be able to visualize and analyze your dataset - turns out, it’s not so straightforward when you’re dealing with complex multimodal data types like images and videos.

Why visualizing and analyzing multimodal data is tough

It’s one thing to click on a URL and use your browser to view it, but a whole can of worms to want to search for specific information or labels and visualize your dataset efficiently, and at scale. It gets even harder when the data you are dealing with is large, in domain specific formats, stored in storage systems that may be slow or expensive to access, or if you want to verify whether annotations are done right or not.

When your data is complex, even just querying a subset to scan through a few records can feel like a Herculean effort: we’ve heard countless stories of how folks are grappling with finding relevant images, downloading them in local folders, finding the right viewers, struggling with encodings in some cases – and then suffering even more trying to see the results of every round of augmentations they might make to the data.

We’ve heard all the hacky workarounds, from writing scripts to generate HTML files to display the images in the desired format to spinning up web pages just to filter by one identifying metadata property so the images pop up, i.e. just barely meeting the definition of a UI. And that’s just for images!

This problem is even worse with videos. With videos, folks often have to wrangle various video encodings and have trouble when trying to process them or when the files themselves are too large.

The opportunity cost of not understanding your data

Given the need for visualizing data is so prominent, if organizations don’t spend resources building bare minimum tools as described above, users tend to find (often poor) substitutes to work around the problems. Teams can try to repurpose model testing, data curation, or labeling tools in an attempt to visualize and understand parts of their workflows and how their data fits into it. After all, you can’t debug what you can’t see! And if teams can’t debug effectively, they might conclude, perhaps unfairly, that their models don’t work. Blame the data and the tools, not the models that fail to perform!

领英推荐

Handling imbalanced data with SMOTE

Fabio Rebecchi 4 年前

Data-Parallelism in Rust with the Rayon?Crate

Luis Soares 7 个月前

CLASSIFICATION OF DATA STRUCTURE

Yochana 8 个月前

Regardless of what method you choose for visualizing your data, it’s important to consider how easily you can properly analyze and inspect the data to identify root causes of model underperformance. How will you associate labels or access data for training or inference? Can you examine the metadata information to start filtering? Can you see what the pre-processed or augmented version of the data would look like? Will you be able to create custom queries which could then be used within ML pipelines?

ApertureDB UI gives our users an easy way to visualize and analyze data easily with ApertureDB. Like any database UI, ApertureDB UI allows them to query and explore the supported data types. Check out our demo video to see the UI in action and read more about it.

?? Actively growing communities we are part of:

MLOps - https://mlops.community/ : Not only is the slack channel filled with a lot of practical nuggets, you can also learn about new tools, use cases, and technologies from their blogs and newsletters. Thank you for publishing our blog on the need for purpose-built databases! Don’t miss their in person conference, AIQCon in SF this June!
AICamp?-? a thriving global community with in-person meetups in a lot of big cities, they draw a crowd of curious folks and have some great talks and workshops to help keep up with the AI hype. We were lucky to get to present here in SF to an audience of over 130+ AI folks! We made the slides from our presentation available for free here: Are Vector Databases Enough for Multimodal Data?

Now for the cocktail, Corpse Reviver #2!

Corpse Reviver #2: Don’t let the deadly name worry you. It tastes delicious and is quite strong! I made it with Lillet Blanc for the white wine aperitif. It's a fresher, citrus-led drink compared corpse reviver #1

Multimodal AI in Real World

478 位关注者

GrowScale.Win

10 个月

Understanding and visualizing complex data sets can indeed be challenging. How do you tackle the task of debugging data in your AI projects, and what strategies have you found most effective for effective visualization and analysis?

1 次回应

要查看或添加评论，请登录

ApertureData的更多文章

See all articles

ApertureData Problem of the Month: Debugging Data in the Dark

ApertureData

Database purpose-built for Multimodal AI: Combine scalable vector search with memory-optimized graph and data management

Tackling the Problem of the Month: Debugging Data in the Dark

Why visualizing and analyzing multimodal data is tough

The opportunity cost of not understanding your data

领英推荐

?? Actively growing communities we are part of:

Now for the cocktail, Corpse Reviver #2!

Multimodal AI in Real World

478 位关注者

ApertureData的更多文章

社区洞察

其他会员也浏览了

Navigating the RAG Landscape: A Deep Dive into Frameworks like LangChain, LlamaIndex, and?Beyond

DATA Pill #094 - PyAirbyte and why Gemini 1.5 are bullish for RAG

5 steps to Decode Data Science

#178 The Evolution of Data Lake Formats: Delta Table and JSONL

Mastering Vector Embeddings: A Comprehensive Guide to Revolutionizing Data Science

Beware the data science pin factory: The power of the full-stack data science generalist and the perils of division of labor through function.

House-holding / Entity Resolution with Neo4j & Graph Data Science Library

Edition 54: Was the MoD Cyberattack Avoidable?

What does a data scientist do?

Understanding Shuffle Operations in Spark: An In-Depth Look

Tackling the Problem of the Month: Debugging Data in the Dark

Why visualizing and analyzing multimodal data is tough

The opportunity cost of not understanding your data

领英推荐

?? Actively growing communities we are part of:

Now for the cocktail, Corpse Reviver #2!

Multimodal AI in Real World

478 位关注者

ApertureData的更多文章

ApertureData Challenge Of The Month: Guardrails for Data

ApertureData Problem Of The Month: Preparing For Multimodal AI

ApertureData WiDS Summit, AI User Group, and MLOps World Recap - BONUS: Multimodal AI Challenge!

ApertureData Problem of the Month: Data-centric Take on Multimodal AI

ApertureData Problem of the Month: Keeping up with AI

AIQCon & GenAI Zoo Recap - BONUS: Dataconnect Mixer!

ApertureData Problem of the Month: Accessing Meaty (But Sensitive!) Data for Model Training

ApertureData Problem of the Month: Duplicate Data

Managing Annotations: Problem of the Month

社区洞察

其他会员也浏览了

Navigating the RAG Landscape: A Deep Dive into Frameworks like LangChain, LlamaIndex, and?Beyond

DATA Pill #094 - PyAirbyte and why Gemini 1.5 are bullish for RAG

5 steps to Decode Data Science

#178 The Evolution of Data Lake Formats: Delta Table and JSONL

Mastering Vector Embeddings: A Comprehensive Guide to Revolutionizing Data Science

Beware the data science pin factory: The power of the full-stack data science generalist and the perils of division of labor through function.

House-holding / Entity Resolution with Neo4j & Graph Data Science Library

Edition 54: Was the MoD Cyberattack Avoidable?

What does a data scientist do?

Understanding Shuffle Operations in Spark: An In-Depth Look