Machine Learning approaches to reduce Domain expert's time

Machine Learning approaches to reduce Domain expert's time

As with any system that depends on data inputs,?Machine Learning?(ML) is subject to the axiom of “garbage-in-garbage-out.” Clean and accurately labeled data is the foundation for building any ML model. An ML training algorithm understands patterns from the ground-truth data and from there, learns ways to generalize on unseen data. If the quality of your training data is low, then it will be very difficult for the ML algorithm to continuously learn and extrapolate.

Think about it in terms of training a pet dog. If you fail to properly train the dog with fundamental behavioural commands (inputs) or do it incorrectly/inaccurately, you can never expect the dog to learn and expand through observation into more complex positive behaviours because the underlying inputs were absent or flawed, to begin with. Proper training is time-intensive and even costly if you bring in an expert, but the payoff is great if you do it right from the start.

When training an ML model, creating quality data requires a domain expert to spend time annotating the data. This may include selecting a window with the desired object in an image or assigning a label to a text entry or a database record. Particularly for?unstructured data?like images, videos, and text, annotation quality plays a major role in determining model quality. Usually, unlabeled data like raw images and text is abundant – but labeling is where effort needs to be optimized. This is the human-in-the-loop part of the ML lifecycle and usually is the most expensive and labor-intensive part of any ML project.

Data annotation tools like Prodigy, Amazon Sagemaker Ground Truth, NVIDIA RAPIDS, and DataRobot human-in-the-loop are constantly improving in quality and providing intuitive interfaces for domain experts. However, minimizing the time needed by domain experts to annotate data is still a significant challenge for enterprises today – especially in an environment where?data science?talent is limited yet in high demand. This is where two new approaches to data preparation come into play.

Active Learning

Active learning is a method where an ML model actively queries a domain expert for specific annotations. Here, the focus is not on getting a complete annotation on unlabeled data, but just getting the right data points annotated so that model can learn better. Take for example healthcare & life sciences, a diagnostic company that specializes in early cancer detection to help clinicians make informed data-driven decisions about patient care. As part of their diagnosis process, they need to annotate CT scan images with tumors that need to be highlighted.

After the ML model learns from a few images with tumor blocks marked, with active learning, the model will then only ask users to annotate images where it is unsure of the presence of a tumor. These will be boundary points, which when annotated will increase the confidence of the model. Where the model is confident above a particular threshold, it will do a self-annotation rather than asking the user to annotate. This is how active learning tries to help build accurate models while reducing the time and effort required to annotate data. Frameworks like modAL can help to increase classification performance by intelligently querying domain experts to label the most informative instances.

Weak Supervision

Weak supervision is an approach where noisy and imprecise data or abstract concepts can be used to provide indications for labeling a large amount of unsupervised data. This approach usually makes use of weak labelers and tries to combine these in an ensemble approach to build quality annotated data. The effort is to try to incorporate domain knowledge into an automated labeling activity.

For example, if an Internet Service Provider (ISP) needed a system to flag email datasets as spam or not spam, we could write weak rules such as checking for phrases like “offer”, “congratulations”, “free”, etc., which mostly are associated with spam emails. Other rules could be emails from specific patterns of source addresses that can be searched by regular expressions. These weak functions could then be combined by a weak supervision framework like Snorkel and Skweak to build improved quality training data.

ML at its core is about helping companies scale processes exponentially in ways that are physically impossible to achieve manually. However, ML is not magic and still relies on humans to a) set up and train the models properly from the start and b) intervene when needed to ensure the model doesn’t become so far skewed to where the results are no longer useful and may be counterproductive or negative.

The goal is to find ways that help streamline and automate parts of the human involvement to increase time-to-market and results but while staying in the guardrails of optimal accuracy. It is universally accepted that getting quality annotated data is the most expensive but extremely important part of a ML project. This is an evolving space, and a lot of effort is underway to reduce time spent by domain experts and improve the quality of data annotations. Exploring and leveraging active learning and weak supervision is a solid strategy to achieve this across multiple industries and use cases.

Originally published at?https://www.unite.ai/

要查看或添加评论,请登录

Dattaraj Rao的更多文章

  • Reimagining Business with Generative AI: A New Era of Corporate Creativity

    Reimagining Business with Generative AI: A New Era of Corporate Creativity

    The relentless march of technology has forever altered the way businesses operate, pushing companies to either evolve…

  • State Graphs as Agent workflows

    State Graphs as Agent workflows

    In the evolving landscape of artificial intelligence, particularly with the advent of Large Language Models (LLMs), the…

  • One Agent vs Many - Managing expectations

    One Agent vs Many - Managing expectations

    Agents and Agentic workflows are the next evolution of intelligent applications. Frameworks like LangChain, LangGraph…

  • Evolution of the Cognitive Architecture

    Evolution of the Cognitive Architecture

    Dattaraj Rao - Chief Data Scientist at Persistent - sharing thoughts on Cognitive Architecture and Generative AI…

  • Recommender Systems with Large Language Models

    Recommender Systems with Large Language Models

    LLMs can enhance recommender systems, drawing insights from large volumes of data to guide next best actions…

  • Generative AI is all set to transform lives

    Generative AI is all set to transform lives

    This article was first published by the author in Sampada magazine Before we begin, let’s start with a piece of…

    1 条评论
  • EU vs USA – AI act vs bill of rights

    EU vs USA – AI act vs bill of rights

    2023 seems to be a game-changer year for Artificial Intelligence, as far as regulations go, with Europe and USA upping…

  • ChatGPT vs Bing … why we need Responsible AI ... NOW!

    ChatGPT vs Bing … why we need Responsible AI ... NOW!

    Large Language Models (LLM) like GPT3, ChatGPT and BARD are all the rage today. Everyone has an opinion about how these…

  • AI fairness metrics...by hand

    AI fairness metrics...by hand

    Measuring fairness is a critical piece of today's Responsible AI systems. There are many libraries and frameworks like…

    1 条评论
  • Sharing insights without compromising Data Privacy

    Sharing insights without compromising Data Privacy

    COVID-19 has brought about a shift in perception of data privacy. In the early months of the pandemic, tech companies…

    1 条评论

社区洞察

其他会员也浏览了