ApertureData Problem of the Month: Debugging Data in the Dark

ApertureData Problem of the Month: Debugging Data in the Dark

Keep reading to read about my latest attempt at making a new (deadly) cocktail!

Tackling the Problem of the Month: Debugging Data in the Dark

When you’re training a model or if a trained model fails, one of the first things you’d probably do is understand the data and annotations you are working with. Do we need more representative data? Is anything off with our annotations? If there’s something awry with our training data, we want to know as soon as possible!

To train and fine-tune models in order to improve and to accommodate new data is a very fundamental requirement from AI teams. Regardless of whether the dataset pre-existed, was collected in-house, or purchased; data science and analytics teams need to view or navigate through it in order to understand what the data looks like. Better understanding can lead to better models, faster.?

And yet, I’ve spoken to data scientists and ML engineers who often blindly embark on model training without truly understanding the underlying datasets. It’s not for lack of trying! Even though it may seem like a simple requirement - to be able to visualize and analyze your dataset - turns out, it’s not so straightforward when you’re dealing with complex multimodal data types like images and videos.

Why visualizing and analyzing multimodal data is tough

It’s one thing to click on a URL and use your browser to view it, but a whole can of worms to want to search for specific information or labels and visualize your dataset efficiently, and at scale. It gets even harder when the data you are dealing with is large, in domain specific formats, stored in storage systems that may be slow or expensive to access, or if you want to verify whether annotations are done right or not.

When your data is complex, even just querying a subset to scan through a few records can feel like a Herculean effort: we’ve heard countless stories of how folks are grappling with finding relevant images, downloading them in local folders, finding the right viewers, struggling with encodings in some cases – and then suffering even more trying to see the results of every round of augmentations they might make to the data.

We’ve heard all the hacky workarounds, from writing scripts to generate HTML files to display the images in the desired format to spinning up web pages just to filter by one identifying metadata property so the images pop up, i.e. just barely meeting the definition of a UI. And that’s just for images!

This problem is even worse with videos. With videos, folks often have to wrangle various video encodings and have trouble when trying to process them or when the files themselves are too large.

The opportunity cost of not understanding your data

Given the need for visualizing data is so prominent, if organizations don’t spend resources building bare minimum tools as described above, users tend to find (often poor) substitutes to work around the problems. Teams can try to repurpose model testing, data curation, or labeling tools in an attempt to visualize and understand parts of their workflows and how their data fits into it. After all, you can’t debug what you can’t see! And if teams can’t debug effectively, they might conclude, perhaps unfairly, that their models don’t work. Blame the data and the tools, not the models that fail to perform!

Regardless of what method you choose for visualizing your data, it’s important to consider how easily you can properly analyze and inspect the data to identify root causes of model underperformance. How will you associate labels or access data for training or inference? Can you examine the metadata information to start filtering? Can you see what the pre-processed or augmented version of the data would look like? Will you be able to create custom queries which could then be used within ML pipelines?

ApertureDB UI gives our users an easy way to visualize and analyze data easily with ApertureDB. Like any database UI, ApertureDB UI allows them to query and explore the supported data types. Check out our demo video to see the UI in action and read more about it.


?? Actively growing communities we are part of:

  • MLOps - https://mlops.community/ : Not only is the slack channel filled with a lot of practical nuggets, you can also learn about new tools, use cases, and technologies from their blogs and newsletters. Thank you for publishing our blog on the need for purpose-built databases! Don’t miss their in person conference, AIQCon in SF this June!
  • AICamp?-? a thriving global community with in-person meetups in a lot of big cities, they draw a crowd of curious folks and have some great talks and workshops to help keep up with the AI hype. We were lucky to get to present here in SF to an audience of over 130+ AI folks! We made the slides from our presentation available for free here: Are Vector Databases Enough for Multimodal Data?


Now for the cocktail, Corpse Reviver #2!

Corpse Reviver #2: Don’t let the deadly name worry you. It tastes delicious and is quite strong! I made it with Lillet Blanc for the white wine aperitif. It's a fresher, citrus-led drink compared corpse reviver #1



Understanding and visualizing complex data sets can indeed be challenging. How do you tackle the task of debugging data in your AI projects, and what strategies have you found most effective for effective visualization and analysis?

要查看或添加评论,请登录

ApertureData的更多文章

社区洞察

其他会员也浏览了