Sifting Through Data  for Gold

Sifting Through Data for Gold

An ounce of gold today is worth about $1750.00. To get that gold, companies process an astonishing 2 to 91 tons of rock. People panning for gold in rivers might be put off by that statistic. Switching metaphors, the “signal to noise ratio” of gold to rock means you have to be really good at filtering sediment.

Discovering gold in text

Artificial Intelligence (AI) for sifting through text is called Natural Language Processing (NLP) and thanks to the massive amount of text produced on the Internet each day, NLP has become a celebrity amongst machine learning algorithms.

Unfortunately many projects using NLP have failed to meet expectations. The facts might be boring. The graphs look dubious. Inexplicable data clusters its way to the top of the charts like an auto-tuned boy band with a questionable link to the Kardashians. What gives?

More Noise than Signal

Noisy text is the likely culprit. Cleaning noise prior to processing is often ignored  in Natural Language Processing (NLP) projects, but it’s essential.

Text sources often contain duplicates, emojis, headers, multiple authors, and mixed languages. We may expect the fancy AI algorithms to magically ignore garbage. Unfortunately this isn’t the case. The AI blissfully processes the gold with the rock and spits out poor results without any indication something is wrong. Signal and noise become camouflaged in the output.

No alt text provided for this image

Unless text is written by people with degrees in Literature, it likely contains noise. At Halosight, we’ve discovered that more than 80% of data feeds can be noise. This leads to significant signal loss that hides insights.

For example, an email may contain a header block, salutation, body, legal and a disclaimer nobody really wants to include or read. It may even have threaded duplicate replies. The junk obscures the meaning. Wouldn’t it be nice if you could teach a computer to recognize junk?

No alt text provided for this image

Teach Computers to Filter Junk

Luckily, computers can be taught to sift the rock away from the gold. Modern approaches based on AI can attack the junk using several strategies.

No alt text provided for this image

Noise Canceling AI

Most communication formats are simple but lack standards. Variation is high. Textual clutter is similar to the background sound filtered by noise canceling headphones. Smart filtering removes the clutter and leaves music intact.

Simplistic approaches like keywords and search expressions yield mediocre results. Layering artificial intelligence adds flexibility and power to filtration.

No alt text provided for this image

What To Look For in a Solution

The right solution will free you to focus on your business problem instead of data hygiene. Ask the following questions when evaluating NLP data cleansing:

  • Content Types - Can it spot types of text and change strategies?
  • Threads - Can it handle threaded conversations?
  • Speakers/Authors - Do your documents contain multiple authors?
  • Scale - Can the pipeline scale to the volume and variety of data?
  • Languages - How does it deal with language and embedded languages?
  • Side Effects - Will the cleaned output be compatible with your UI?
  • Purpose Built - Is it flexible and targeted specifically for cleansing scenarios?
  • Configurable - Can the workflow be tailored to your use case?
No alt text provided for this image

Data Cleaning: Innovation At Work

Textual data cleaning doesn’t get as much attention, but it’s critically important to maximize the results from any text-based data mining project. Prioritize cleansing or tap a vendor who can help. It’s the single best thing you can do to find the surprising insights you seek.

要查看或添加评论,请登录

Halosight的更多文章

社区洞察

其他会员也浏览了