登录查看更多内容

Machine Learning approaches to reduce Domain expert's time

Dattaraj Rao

Innovation with Agentic AI | Foundation Models | Responsible AI | Author | 11 Patents | Chief Data Scientist | ex GE

发布日期: 2022年4月4日

As with any system that depends on data inputs,?Machine Learning?(ML) is subject to the axiom of “garbage-in-garbage-out.” Clean and accurately labeled data is the foundation for building any ML model. An ML training algorithm understands patterns from the ground-truth data and from there, learns ways to generalize on unseen data. If the quality of your training data is low, then it will be very difficult for the ML algorithm to continuously learn and extrapolate.

Think about it in terms of training a pet dog. If you fail to properly train the dog with fundamental behavioural commands (inputs) or do it incorrectly/inaccurately, you can never expect the dog to learn and expand through observation into more complex positive behaviours because the underlying inputs were absent or flawed, to begin with. Proper training is time-intensive and even costly if you bring in an expert, but the payoff is great if you do it right from the start.

When training an ML model, creating quality data requires a domain expert to spend time annotating the data. This may include selecting a window with the desired object in an image or assigning a label to a text entry or a database record. Particularly for?unstructured data?like images, videos, and text, annotation quality plays a major role in determining model quality. Usually, unlabeled data like raw images and text is abundant – but labeling is where effort needs to be optimized. This is the human-in-the-loop part of the ML lifecycle and usually is the most expensive and labor-intensive part of any ML project.

Data annotation tools like Prodigy, Amazon Sagemaker Ground Truth, NVIDIA RAPIDS, and DataRobot human-in-the-loop are constantly improving in quality and providing intuitive interfaces for domain experts. However, minimizing the time needed by domain experts to annotate data is still a significant challenge for enterprises today – especially in an environment where?data science?talent is limited yet in high demand. This is where two new approaches to data preparation come into play.

Active Learning

Active learning is a method where an ML model actively queries a domain expert for specific annotations. Here, the focus is not on getting a complete annotation on unlabeled data, but just getting the right data points annotated so that model can learn better. Take for example healthcare & life sciences, a diagnostic company that specializes in early cancer detection to help clinicians make informed data-driven decisions about patient care. As part of their diagnosis process, they need to annotate CT scan images with tumors that need to be highlighted.

After the ML model learns from a few images with tumor blocks marked, with active learning, the model will then only ask users to annotate images where it is unsure of the presence of a tumor. These will be boundary points, which when annotated will increase the confidence of the model. Where the model is confident above a particular threshold, it will do a self-annotation rather than asking the user to annotate. This is how active learning tries to help build accurate models while reducing the time and effort required to annotate data. Frameworks like modAL can help to increase classification performance by intelligently querying domain experts to label the most informative instances.

领英推荐

Scenarios: Which Machine Learning (ML) to choose?

Yair R. 3 年前

Machine learning operations: How to enhance AI…

Ario Giancarlo Cecchettini 2 年前

Unlock the Power of Machine Learning in Data Science &…

Pratibha Kumari 1 年前

Weak Supervision

Weak supervision is an approach where noisy and imprecise data or abstract concepts can be used to provide indications for labeling a large amount of unsupervised data. This approach usually makes use of weak labelers and tries to combine these in an ensemble approach to build quality annotated data. The effort is to try to incorporate domain knowledge into an automated labeling activity.

For example, if an Internet Service Provider (ISP) needed a system to flag email datasets as spam or not spam, we could write weak rules such as checking for phrases like “offer”, “congratulations”, “free”, etc., which mostly are associated with spam emails. Other rules could be emails from specific patterns of source addresses that can be searched by regular expressions. These weak functions could then be combined by a weak supervision framework like Snorkel and Skweak to build improved quality training data.

ML at its core is about helping companies scale processes exponentially in ways that are physically impossible to achieve manually. However, ML is not magic and still relies on humans to a) set up and train the models properly from the start and b) intervene when needed to ensure the model doesn’t become so far skewed to where the results are no longer useful and may be counterproductive or negative.

The goal is to find ways that help streamline and automate parts of the human involvement to increase time-to-market and results but while staying in the guardrails of optimal accuracy. It is universally accepted that getting quality annotated data is the most expensive but extremely important part of a ML project. This is an evolving space, and a lot of effort is underway to reduce time spent by domain experts and improve the quality of data annotations. Exploring and leveraging active learning and weak supervision is a solid strategy to achieve this across multiple industries and use cases.

Originally published at?https://www.unite.ai/

要查看或添加评论，请登录

Dattaraj Rao的更多文章

Reimagining Business with Generative AI: A New Era of Corporate Creativity

2024年12月25日

Reimagining Business with Generative AI: A New Era of Corporate Creativity

The relentless march of technology has forever altered the way businesses operate, pushing companies to either evolve…
State Graphs as Agent workflows

2024年12月12日

State Graphs as Agent workflows

In the evolving landscape of artificial intelligence, particularly with the advent of Large Language Models (LLMs), the…
One Agent vs Many - Managing expectations

2024年12月8日

One Agent vs Many - Managing expectations

Agents and Agentic workflows are the next evolution of intelligent applications. Frameworks like LangChain, LangGraph…
Evolution of the Cognitive Architecture

2024年9月14日

Evolution of the Cognitive Architecture

Dattaraj Rao - Chief Data Scientist at Persistent - sharing thoughts on Cognitive Architecture and Generative AI…
Recommender Systems with Large Language Models

2023年10月9日

Recommender Systems with Large Language Models

LLMs can enhance recommender systems, drawing insights from large volumes of data to guide next best actions…
Generative AI is all set to transform lives

2023年3月21日

Generative AI is all set to transform lives

This article was first published by the author in Sampada magazine Before we begin, let’s start with a piece of…

1 条评论
EU vs USA – AI act vs bill of rights

2023年3月7日

EU vs USA – AI act vs bill of rights

2023 seems to be a game-changer year for Artificial Intelligence, as far as regulations go, with Europe and USA upping…
ChatGPT vs Bing … why we need Responsible AI ... NOW!

2023年3月3日

ChatGPT vs Bing … why we need Responsible AI ... NOW!

Large Language Models (LLM) like GPT3, ChatGPT and BARD are all the rage today. Everyone has an opinion about how these…
AI fairness metrics...by hand

2021年4月26日

AI fairness metrics...by hand

Measuring fairness is a critical piece of today's Responsible AI systems. There are many libraries and frameworks like…

1 条评论
Sharing insights without compromising Data Privacy

2020年9月16日

Sharing insights without compromising Data Privacy

COVID-19 has brought about a shift in perception of data privacy. In the early months of the pandemic, tech companies…

1 条评论

See all articles

Machine Learning approaches to reduce Domain expert's time

Dattaraj Rao

Innovation with Agentic AI | Foundation Models | Responsible AI | Author | 11 Patents | Chief Data Scientist | ex GE

Active Learning

领英推荐

Weak Supervision

Dattaraj Rao的更多文章

社区洞察

其他会员也浏览了

Unlock the Power of Machine Learning in Data Science & AI

Breaking Down the Buzzwords: Understanding the Basics of Machine Learning

A Practical Guide to Federated Learning for Enterprise

Klassifier No Code Machine Learning

Blog 79 # Demystifying Machine Learning: Understanding the Limitations of Accuracy Predictions

Unleashing the Power of Machine Learning Algorithms: A Comprehensive Guide

How does machine learning Work? Its importance in 2024

Demystifying Machine Learning: A Guide for Everyone

Unleashing the Power of Big Data: A Comprehensive Look at Machine Learning Algorithms

Unleashing Potential: How a Machine Learning Course Can Drive Your Career Forward

Active Learning

领英推荐

Weak Supervision

Dattaraj Rao的更多文章

Reimagining Business with Generative AI: A New Era of Corporate Creativity

State Graphs as Agent workflows

One Agent vs Many - Managing expectations

Evolution of the Cognitive Architecture

Recommender Systems with Large Language Models

Generative AI is all set to transform lives

EU vs USA – AI act vs bill of rights

ChatGPT vs Bing … why we need Responsible AI ... NOW!

AI fairness metrics...by hand

Sharing insights without compromising Data Privacy

社区洞察

其他会员也浏览了

Unlock the Power of Machine Learning in Data Science & AI

Breaking Down the Buzzwords: Understanding the Basics of Machine Learning

A Practical Guide to Federated Learning for Enterprise

Klassifier No Code Machine Learning

Blog 79 # Demystifying Machine Learning: Understanding the Limitations of Accuracy Predictions

Unleashing the Power of Machine Learning Algorithms: A Comprehensive Guide

How does machine learning Work? Its importance in 2024

Demystifying Machine Learning: A Guide for Everyone

Unleashing the Power of Big Data: A Comprehensive Look at Machine Learning Algorithms

Unleashing Potential: How a Machine Learning Course Can Drive Your Career Forward