ApertureData Problem of the Month: Accessing Meaty (But Sensitive!) Data for Model Training

ApertureData Problem of the Month: Accessing Meaty (But Sensitive!) Data for Model Training

Craving something dessert-like? Keep reading for a simple but very chocolatey cocktail!

Tackling the Problem of the Month: Accessing Meaty (But Sensitive!) Data for Model Training

What kind of data are we talking about?

Companies love hoarding data over their lifetime: saving screenshots and emails and logs–from customers, employees, and vendors–”just in case” they may need it one day. And sure, maybe they will, and maybe there’s actually a treasure trove of useful information hidden in multiple modalities that were collected over the years. For example, hospitals keep a lot of their patient records, state governments have a lot of geological data, and stores record a lot of videos. Deep learning models started unlocking the possibility of processing live data to give insights in the last few years. LLMs, RAGs, especially the recent multimodal capable models have taken these capabilities a step further to accept and glean information from multimodal data types from old and new times to help unlock valuable insights hidden in those data types for use within the company for the company and their customers. Perhaps there was a tumor that appeared a certain way on all relevant CT scans which can now be found and included in analysis to build on old knowledge and new, to solve diseases.?

Why do we need this data, anyway?

While off-the-shelf models have gotten pretty good for most generic ML use cases, it’s rare to find one that’s perfect for your business use case. Techniques like RAG, GraphRAG, can mitigate some of these challenges but the ultimate solution, even if it's on a smaller subset of data, will likely be to train your own model that’s tailored to your business. The best way to do that? Use your own rich, complex datasets for training! But training this treasure trove of rich (and often, sensitive!) data is much easier said than done, due in large part to the data permissions required to access it.

Why is it hard to access this data?

Security and privacy best practices in any medium to large company, especially one that deals with PII (personally identifiable information), make it hard to access this data–on purpose. Heard of the principle of least privilege? Never grant unnecessary permissions, always make people jump through hoops to get those permissions, and remove them as soon as possible! And yet, what’s inherently unsafe to security professionals is an opportunity for data scientists to do their very best work. The more relevant data they can get access to, the better they can deploy AI techniques to get what their organizations and customers need.?

Is it really worth the trouble to tap into this data?

If you care about winning, yes. In the race to develop the best performing AI applications, companies need to evolve or be left behind. Who would you bet on? A company that trains state of the art AI models utilizing both their historical and new knowledge, or someone who relies on manual scanning or half way solutions and hopes for the best?

How do we safely unlock the treasure?

Pick a database solution that not only lets you make the firehose of multimodal information accessible to anyone deploying ML, but also can maintain provenance of data, allow for governance, be searchable, and be kept consistent. Easier said than done, we know :-)

Even with the right solution in place, it can still be a challenge to navigate permissions. As one of these aforementioned multimodal database solutions, it still took us at ApertureData months to help one of our Fortune 50 customers query their data, simply because we needed different permissions to setup the database in a cloud project that was accessible to the relevant teams as well as to set up the regular ingestion from various tables to this unified source for the data scientists.?

It was a price worth paying, but it was gnarly to get there! At least the results speak for themselves: now, it’s 10x easier for them to access whatever data they need and enhance with new information that their ever improving models afford them.?


Don’t miss these events in June:

AIQCon : The MLOps Community is thriving on slack, their newsletters give voice to a wide array of issues in ML / GenAI space, and now they are bringing their first in-person conference in SF. If you’re working on AI systems, you don’t want to miss this one! Use this Discount Code to get a 15% discount "APERTUREDATA" - https://www.aiqualityconference.com/

GenAI Zoo: Given the number of tools in the LLM / RAG space, zoo is really the best representation of the space, where you will have a bunch of founders talk about the various tools all in one day. ?https://www.dhirubhai.net/events/2024midyeargenaizoo7189825259305349120/theater/

OSS4AI talks and meetups:?Tuesday Tech Talks hosted by Yujian Tang : Multimodal Search for Generative AI. Multimodal search is one of the hottest buzzwords in generative AI. Come learn about how ApertureData is revolutionizing multimodal data search by leveraging graphs, vectors, labels, and more from the founder herself - Vishakha Gupta . RSVP here: https://lu.ma/y30zcpeb


Now for the cocktail that is my current favorite:

Chocolate Martini: The Godiva liqueur is the crowning jewel of this luxurious take on the chocolate martini. The flavor is so elegant that you don't even need to add a sugary chocolate syrup like most recipes! (But of course, a little chocolate garnish never hurts)



要查看或添加评论,请登录

ApertureData的更多文章

社区洞察

其他会员也浏览了