登录查看更多内容

Why you should be annotating in-house for NLP

Raza Habib

CEO and Cofounder Humanloop (YC S20) | Host of High Agency: The Podcast for AI Builders

发布日期: 2022年5月10日

Deep learning is only as good as the data it’s trained on, so text labeling is a crucial step for ensuring the accuracy of Natural Language Processing (NLP) models. Most data science teams used to outsource data labeling; it was the only way to achieve the scale they needed. However, recent breakthroughs in new techniques like programmatic supervision, active learning, and transfer learning have reduced how much annotated data you need and made in-house annotation more practical. The quality of your data is more important than quantity, so if you can annotate in house, you probably should. It allows you to have better control over quality, handle privacy more easily and make better use of subject matter expertise.

In this blog post, we review how people used to label and discuss the recent breakthroughs that are making in-house labeling practical and affordable.

What is text labeling, and Why is it important?

Most of the machine learning methods applied today are supervised. This means that to train an artificial intelligence (AI) model, you need a dataset where a human has provided examples of what you want the AI model to do. For text, you usually need some amount of domain expertise. Some common types of annotation are:

Classification:?Where the annotator assigns one or more categories to an entire document. For example you might want to know the sentiment of a tweet or the intent of a customer speaking to a chat bot.
Entity or Token Annotation:?Where the annotator assigns labels to individual tokens or words in sentence. This is commonly used to find named entities or for information extraction.

The reason that text annotation is so important is that your machine learning model is trained to mimic what the annotators do. If the annotations are inconsistent or incorrect, then these mistakes will degrade the performance of your model.

Text annotation might seem simple, but often there are nuances to what the correct annotation should be. For example, in sentiment labeling, different people might reasonably disagree about the sentiment of an example sentence. Trying to get an entire team of annotators to label a dataset perfectly consistently and quickly is usually the biggest challenge in text labeling.

Why did teams used to outsource annotation?

Until recently most teams building NLP models would outsource their annotation to service providers like?Mechanical Turk?or?Scale AI.

This was because data scientists often needed vast quantities of labeled data–potentially hundreds of thousands of human-provided annotations. To get this many labels from an in-house team was often too much work. It would usually mean hiring a lot of people, training them, and creating custom annotation software. Few teams other than the mega-tech companies that had a lot of funding were willing to do it.

领英推荐

Detecting And Eradicating Bias In NLP

Naveen Joshi 3 年前

Chunking Strategies for RAG

Sanjay Kumar MBA,MS,PhD 4 个月前

Week 9: Is NLP "dead"? Natural Language Processing…

Alaaeddin Alweish 6 个月前

There are big downsides to outsourced annotation, but the largest are that it’s much harder to control quality, and private data requires special data sharing agreements. Outsourcing companies also don’t usually have the domain knowledge you need. However, when you needed hundreds of thousands of annotations, it just wasn’t practical to do it in-house.

What’s changed to make in house annotation possible?

In the past two to three years, there have been three big breakthroughs that have made in-house annotation not just feasible but the best choice for most teams. The big breakthroughs were:

Transfer learning for NLP:?Starting in 2018 a huge new breakthrough began in NLP. Instead of training a model from scratch for every problem, it became possible to use pre-trained models and specialize these to your task. The first paper that got this technique to work was called ULM-fit but since then models like BERT and other transformer models have become very popular. Specializing a pre-trained model is called fine-tuning and it often only requires a few hundred labels to get great results.
Active Learning:?Active learning is a technique that allows you to work out what data is most valuable to label. Most datasets contain redundant information or noise. Active learning focuses your annotation efforts on the data points that will have the most impact and so makes sure that every label you make actually improves your model.
Programmatic Supervision:?Programmatic supervision avoids manual annotation altogether. Instead of applying labels by hand, you use approximate rules to annotate the data. Once the rules are applied, a model is used to combine all of the noisy sources of supervision together into a single clean dataset.

Taken together these changes make it possible to train state-of-the-art models with datasets of only a few hundred labels. Getting this many labels is much more feasible and is driving a trend towards in-house annotation.

How can you do this in practice?

If you want to annotate data for NLP in house, you’ll probably need annotation software. There are a bunch of solutions that offer annotation interfaces but if you want the benefits of things like active learning and programmatic labeling your options are more limited.

At?Humanloop?we built Programmatic to solve some of these UI/UX issues associated with other solutions. Programmatic lets machine learning engineers get a large volume of annotated data quickly by providing a simple UI to rapidly iterate on labeling functions with fast feedback. From there you can export your data to our team based annotation tools to annotate with active learning or use the data with other popular libraries like Hugging Face.

Programmatic is free forever!?Try it now here.

Anastasia Kalas

VP Strategic Partnerships at AdriStars

4 周

Raza, thanks for sharing!

Saman Zarandioon

Director of Engineering (AI/ML) at Truveta

2 年

1 次回应

查看更多评论

要查看或添加评论，请登录

Raza Habib的更多文章

AI Is Blurring the Line Between PMs and Engineers

2025年2月25日

AI Is Blurring the Line Between PMs and Engineers

Last year, I was speaking to an engineering leader at a publicly traded technology company when she said something that…

5 条评论
Humanloop for Large Language Models

2022年10月5日

Humanloop for Large Language Models

Today we're excited to open up access to Humanloop for all developers who want to build next-generation applications…

6 条评论

Why you should be annotating in-house for NLP

Raza Habib

CEO and Cofounder Humanloop (YC S20) | Host of High Agency: The Podcast for AI Builders

What is text labeling, and Why is it important?

Why did teams used to outsource annotation?

领英推荐

What’s changed to make in house annotation possible?

How can you do this in practice?

Raza Habib的更多文章

社区洞察

其他会员也浏览了

Power of Fine-Tuning Pre-Trained Models

Practical Application of Multimodal AI: Integrating LangChain with the OpenAI API

From Text to Intelligence: A Comprehensive Analysis of Text Annotation (with 2024 Trend Insights)

DistilBERT Benchmark: Distributed Training trains Model over 13 Times Faster by using 8 Times the Resources

NLP vs. LLMs: A Practical Guide for Engineering Teams

Unlocking the Power of Hugging Face for AI and ML

Testing NLP Models

The crucial role of AI and machine learning in intelligent document processing

What I Wish I Knew About NLP When I Started

?? RoBERTa Demystified: Supercharge Your NLP Projects Today! ??

What is text labeling, and Why is it important?

Why did teams used to outsource annotation?

领英推荐

What’s changed to make in house annotation possible?

How can you do this in practice?

Raza Habib的更多文章

AI Is Blurring the Line Between PMs and Engineers

Humanloop for Large Language Models

社区洞察

其他会员也浏览了

Power of Fine-Tuning Pre-Trained Models

Practical Application of Multimodal AI: Integrating LangChain with the OpenAI API

From Text to Intelligence: A Comprehensive Analysis of Text Annotation (with 2024 Trend Insights)

DistilBERT Benchmark: Distributed Training trains Model over 13 Times Faster by using 8 Times the Resources

NLP vs. LLMs: A Practical Guide for Engineering Teams

Unlocking the Power of Hugging Face for AI and ML

Testing NLP Models

The crucial role of AI and machine learning in intelligent document processing

What I Wish I Knew About NLP When I Started

?? RoBERTa Demystified: Supercharge Your NLP Projects Today! ??