Why Machine Learning Projects Fail: Data
Introduction
I’m in this almost decade-long love affair with computer vision.?If AI enables computers to think, computer vision enables computers to see, observe and understand. Artificial Intelligence (AI) is huge.?Computer Vision is just one of the many pieces of AI.?But Computer Vision is my favorite piece of AI for sure.?What interests me about computer vision is it has the potential to save lives as much as it has the potential to destroy them.?One can parallel its power and potential to Nuclear power or Quantum Computing.?What do I mean? ?Just like Nuclear Power, Computer vision has the potential to save the world while at the same time ruin it.?Nuclear Power gave us the Microwave oven…and the most destructive weapon of human history.?Quantum computing promises to cure diseases we never dreamed of attacking…while at the same time breaking all of today's cryptography in minutes, effectively rendering all internet security impotent.
But this article is not about the Ethics of AI.?This article is about why we are not solving more of humanity’s cancer and diseases; why we are not saving more of the world with machine learning projects.?The answer is not what you might expect. ?
The Problem
While thousands of experts worldwide are working on some of the most complex problems and are making huge progress in the algorithms required for machine learning, many still lack access to the large amount of quality data that a machine learning project requires.
With any machine learning project, computer vision or not, you need to train a significant amount of quality data. The problem is it is very difficult to produce the amount and quality of data needed to do the machine learning.?In computer vision projects, the problem is exacerbated by the requirements for annotating those huge amounts of data annotating with high consistency and accuracy.?For instance, every picture labeled with a “fire hydrant” needs to really be a fire hydrant. Every picture labeled “Coke” needs to be a Coke. Perfection in quantity and accuracy matters when it comes to training an AI model.
Use Cases and Barriers
Let me elaborate just some of the barriers to acquiring data for machine learning projects:
Annotation
Assuming you can acquire the sometimes thousands, if not millions, of pictures you need for a computer vision project, you now have the Herculean effort of drawing a bounding box around what you want to be recognized and annotating that box with what’s inside.?Take a simple retail use case.?Ignoring potential privacy issues, say you want to recognize when someone is holding a Coke or Pepsi in a store.?Forget about why you’d want to do that for now, because it can get really creepy, quickly.?But trust me.?It is being done.?Regretfully, I’m that guy that used to work on computer vision projects that bordered on the ethical bounds of privacy – which is the reason most of my keynotes these days are focused on the Ethics of AI.?But I digress…
Ok, to annotate a picture to prepare it to be machine-learned, you use your mouse to trace a rectangle around that Coke or Pepsi in someone’s hand.?Then you choose “Coke” or “Pepsi” to label it.?You most likely will be required to have pictures of people holding other types of sodas, so that the machine learning algorithm can be trained on what a Coke or Pepsi is not.?Great, now you need to do that 10,000 times.?Also, consider this: you have to find and acquire these pictures in the first place.?If you do a simple internet search on “Coke” then go to the images tab, you’ll get a few hundred pictures.?But the problem is that many of them, if not the majority, will be on a white or transparent background.?Computer vision projects need pictures of the item to be recognized in as many different backgrounds as possible for accuracy.?Ok, so you switch your internet search to “Person Holding a Coke.”?Feel free to try it.?You’ll only get 10 or so quality pictures you can use.?This is where I used to scratch my head, then try other internet searches of desperation, only ending up failing.
Costs
In the use case above, ultimately, internet searches will lead you to several online picture databases that cost money.?Which is fine and valuable for onesy and twoseys, like for a picture of good-looking young businesspeople for your web site.?But you need thousands of pictures.?It just wouldn’t be cost-effective, even if these online companies had thousands of pictures of people holding Pepsis…. They don’t.?
Regulation
Another barrier to the data is regulation.?Consider your health data or your financial data.?It’s protected by regulation as a result of privacy law.?Say you want to do a “save the world” computer vision project to identify a specific type of cancer from an x-ray or MRI image.?Those images are protected by HIPAA (Health Insurance Portability and Accountability Act). ?An internet search of x-rays and MRIs containing cancer won’t yield much.?If you get access to cancer researchers, physicians, etc. who have access to the types of images you need, they cannot legally release them to you because of HIPAA restrictions.?
We once did a project to detect the early onslaught of Alzheimer’s Disease.?With some simple testing on a small dataset we acquired from the cancer researchers.?And with an algorithm we worked on, we had some confidence we could succeed with a solution that could save lives.?The initial proof of concept yielded results that were impressive and exciting.?To overcome t HIPPA restrictions, certified professionals (i.e., doctors) needed to scrub the pictures of any text containing information before release.?That is a huge manual effort that could not be overcome.?The project failed because we could not get the amount of data we needed.?
领英推荐
Hope for the Solution
I cannot tell you how often an entrepreneur has come to me with the statement (paraphrasing), “I have a great business idea. If you can build the software, I know it’s a winner,” then proceeded to describe a system that requires computer vision.?I typically said something like, “Yeah, we can build the application.?Do you have access to the data that needs to be machine learned?” Typically, I’d get a blank stare with a subsequent statement like, “I thought that is what you guys do.”?I’d have to explain that we build enterprise-grade software solutions.?Then I’d explain what a data scientist does, and recommend firms I like that specialize in the creation and training of computer vision algorithms and models.?But I would always say, “They are going to need you to find the data too.”?And that is when the great ideas die.
It was late in 2021 when I walked off stage after presenting “The Ethics of AI” and Peter Harlan of Innodata approached me to introduce himself.?The highlight of the 10-minute conversation can be summed up nicely with two statements:
Peter: Tim, I’ve been in ML since 2004, and the we agree that the data has always been the hardest part. Well, we’re that company that does the data.?
Tim:I have been looking for you forever.
Full disclosure: my relationship with Innodata from that day meeting Peter now has blossomed into a small consulting project with Innodata, where my mission is to advise on the Developer’s POV, and to help get the word out about this best-kept secret in machine learning… that has been around for over 25 years.?
Innodata is really “a secret weapon” behind some of today’s truly powerful ML and AI initiatives. You don’t typically hear the Innodata name because their work is protected by NDA.?You typically hear about the great successes their customers have.?As I understand it, they originally started back in the ‘80s, as the “data curation and conversion backend” of large, syndicated data providers. They built their own Natural Language Processing AI to do it; learned a whole lot about AI. They learned how to make great training data in the process.?Now they help a lot of companies?large and small with their AI development.
It has been clear to me for a while that a company needed to do what Innodata is doing: ?taking on the pretty mundane but critical job of sourcing, sometimes synthesizing and labeling training data with accuracy, at serious scale.?This is crucial so that the AI/ML developers can focus on their expertise :feature engineering, model iterations and tuning, and deploying and testing AI apps and features.?It’s a different approach to the historical workflow, and sure, you probably have to have a certain level of “volume” in your AI Dev activity to warrant spinning up a new Partner like Innodata, removing the Data Sourcing & Prep out of the hands of your DS/ML teams.?But the largest AI companies like the the Big 3 Clouds and a large amount of startup and growth stage AI/ML ISVs have figured it out. I am confident more will figure it out in the coming years. It just makes sense.
There are famous stories of Computer Vision screw ups like the Kinect for Xbox having significant trouble recognizing a specific ethnic group and the US military training the recognition of tanks on white backgrounds and completely failing.?
Expertise
To build ethically responsible Vision AI you need not just quantity and quality of your training images, but also real data diversity. You need ethnicity and gender, but also camera angle, lighting, resolution, frame rate, setting, props and uniforms, languages, dialects, and accents spoken, image “occlusions” (like glasses or a mask).?
To achieve precision annotation of magnetic images like x-rays, MRIs and Ultrasounds (assuming acquired a significant enough quantity and overcame the privacy law involved), you likely need labelers that already “speak medical”; experts that know the acronyms and industry terminology.
Conversely, to achieve precision annotation with the kind of accuracy that Finance folks need from Bloomberg feeds, or lawyers needing case law data on Lexis Nexis, you typically need labelers that already “speak legal” or ”speak finance”; experts that know the acronyms and industry terminology.
To achieve consistency (which is what accuracy really means in training data) you need something called “multi-pass, arbitrated labeling,” where every labeled item has been reviewed and labeled by at least two skilled humans.?Then settled by an even more skilled human where those first ones disagree. Sometimes, just to get enough data in the first place, that isn’t owned by anyone else or restricted in its “usage rights,” you have to actually make the data, synthetically to generate “reality accurate” documents, or do “Scenario Based Ground Truth Data Capture,” where you create the setting to capture the data or “stage” images with the diversity needed.?This is exactly what Innodata does.
As far as I know, Innodata has no competition that does what they do end to end….yet.?
Summary
Typically, the requirements for quality machine learning data are way too much to ask of your Data Science or ML team. Or from folks from a “crowd.” For the last 20 years or so Innodata has built significant expertise in exactly that, with efficiency and scale
The Achilles heel of machine learning projects is the access and precision annotation of the data required for developing great models. ?We have the raw CPU Power.?We have experts working on the algorithms.?It’s the expertise and tools of companies like Innodata that will pave the way for AI solutions that are not only innovative and make the world a better place.?But that are also lifesaving.?
Junior Fellow at Global Research Network Think Tank | Founder of Malaysian AI Legal Pathfinders
1 年Thank you for breaking down how training ai works in layman's term!
CMO | VP of Marketing | B2B | SaaS | Entrepreneur | Advisor
2 年Excellent article Tim - interesting and informative! Thanks.