登录查看更多内容

Sifting Through Data for Gold

Halosight

Your Intelligent Revenue AI Agent

发布日期: 2022年5月12日

An ounce of gold today is worth about $1750.00. To get that gold, companies process an astonishing 2 to 91 tons of rock. People panning for gold in rivers might be put off by that statistic. Switching metaphors, the “signal to noise ratio” of gold to rock means you have to be really good at filtering sediment.

Discovering gold in text

Artificial Intelligence (AI) for sifting through text is called Natural Language Processing (NLP) and thanks to the massive amount of text produced on the Internet each day, NLP has become a celebrity amongst machine learning algorithms.

Unfortunately many projects using NLP have failed to meet expectations. The facts might be boring. The graphs look dubious. Inexplicable data clusters its way to the top of the charts like an auto-tuned boy band with a questionable link to the Kardashians. What gives?

More Noise than Signal

Noisy text is the likely culprit. Cleaning noise prior to processing is often ignored in Natural Language Processing (NLP) projects, but it’s essential.

Text sources often contain duplicates, emojis, headers, multiple authors, and mixed languages. We may expect the fancy AI algorithms to magically ignore garbage. Unfortunately this isn’t the case. The AI blissfully processes the gold with the rock and spits out poor results without any indication something is wrong. Signal and noise become camouflaged in the output.

Unless text is written by people with degrees in Literature, it likely contains noise. At Halosight, we’ve discovered that more than 80% of data feeds can be noise. This leads to significant signal loss that hides insights.

For example, an email may contain a header block, salutation, body, legal and a disclaimer nobody really wants to include or read. It may even have threaded duplicate replies. The junk obscures the meaning. Wouldn’t it be nice if you could teach a computer to recognize junk?

Teach Computers to Filter Junk

Luckily, computers can be taught to sift the rock away from the gold. Modern approaches based on AI can attack the junk using several strategies.

Noise Canceling AI

Most communication formats are simple but lack standards. Variation is high. Textual clutter is similar to the background sound filtered by noise canceling headphones. Smart filtering removes the clutter and leaves music intact.

Simplistic approaches like keywords and search expressions yield mediocre results. Layering artificial intelligence adds flexibility and power to filtration.

What To Look For in a Solution

The right solution will free you to focus on your business problem instead of data hygiene. Ask the following questions when evaluating NLP data cleansing:

Content Types - Can it spot types of text and change strategies?
Threads - Can it handle threaded conversations?
Speakers/Authors - Do your documents contain multiple authors?
Scale - Can the pipeline scale to the volume and variety of data?
Languages - How does it deal with language and embedded languages?
Side Effects - Will the cleaned output be compatible with your UI?
Purpose Built - Is it flexible and targeted specifically for cleansing scenarios?
Configurable - Can the workflow be tailored to your use case?

Data Cleaning: Innovation At Work

Textual data cleaning doesn’t get as much attention, but it’s critically important to maximize the results from any text-based data mining project. Prioritize cleansing or tap a vendor who can help. It’s the single best thing you can do to find the surprising insights you seek.

要查看或添加评论，请登录

Halosight的更多文章

2022年3月31日

Why is Dark Data so important?

As a business, chances are you aren’t utilizing your data to its fullest potential. Yes, the data that are both…

1 条评论

Sifting Through Data for Gold

Halosight

Your Intelligent Revenue AI Agent

Discovering gold in text

More Noise than Signal

Teach Computers to Filter Junk

Noise Canceling AI

What To Look For in a Solution

Data Cleaning: Innovation At Work

Halosight的更多文章

社区洞察

其他会员也浏览了

8 Helpful Everyday Examples of Artificial Intelligence

How Retrieval-Augmented Generation (RAG) Helps Reduce AI Hallucinations

Understanding AI Llama: The Next Frontier in Artificial Intelligence

The Evolution of Transformer Models: Breakthroughs in Self-Adaptation and Long-Term Memory with Transformer2 and Titans

Core Concepts for Small Language Models

Understanding transformers from first principles - #artificialintelligence #115

AI Frameworks in Action: Building RAG Systems with LangChain, LlamaIndex, and Haystack!

Transformers: Understanding the Engine Behind Modern NLP and Generative AI

IMO Weekly Highlights - 02122024

Retrieval-Augmented Generation (RAG) and Artificial Intelligence

Discovering gold in text

More Noise than Signal

Teach Computers to Filter Junk

Noise Canceling AI

What To Look For in a Solution

Data Cleaning: Innovation At Work

Halosight的更多文章

Why is Dark Data so important?

社区洞察

其他会员也浏览了

8 Helpful Everyday Examples of Artificial Intelligence

How Retrieval-Augmented Generation (RAG) Helps Reduce AI Hallucinations

Understanding AI Llama: The Next Frontier in Artificial Intelligence

The Evolution of Transformer Models: Breakthroughs in Self-Adaptation and Long-Term Memory with Transformer2 and Titans

Core Concepts for Small Language Models

Understanding transformers from first principles - #artificialintelligence #115

AI Frameworks in Action: Building RAG Systems with LangChain, LlamaIndex, and Haystack!

Transformers: Understanding the Engine Behind Modern NLP and Generative AI

IMO Weekly Highlights - 02122024

Retrieval-Augmented Generation (RAG) and Artificial Intelligence