A Look at Text Analytics
Mining for Gold
The first step in text analytics is text mining―the process of determining and
collecting high-quality information from unstructured text. The very first step of text mining is information retrieval—that is, building a database to analyze. This database can be virtually any type of text, from a mass of Twitter posts to a collection of scientific papers, depending on the focus of the organization conducting the analysis.
In the case of my company, Lexalytics, once a database of text has been established, we engage a number of sophisticated text analytics systems that aim to answer three broad questions:
- Who is talking and who/what are they discussing?
- What are they saying?
- How do they feel?
These three categories are roughly definable by our core features.
- Named entity extraction
- Themes
- Categories
- Intentions
- Sentiment analysis
- Summarization
Named Entity Extraction
Recognizing named entities means identifying named text figures: most often this means people, places, organizations, products, and brands, but Named Entity Extraction can be configured to whatever your organization requires. Names of trading stocks, specific abbreviations, even specific strains of a disease can be identified and tagged as an entity. In addition to specific named entities, Lexalytics identifies pattern-based entities such as street addresses, phone numbers and email addresses. Now that you’ve prepared the text, you can do things like extract the entities, and get the associated sentiment, themes, and summary (for that entity).
Themes
Contextual clues can be vital when dealing with words that have multiple meanings: the word crane, for instance, could refer to a machine used to lift heavy objects, a type of bird, or even a movement of someone’s neck. Lexalytics determines the context of entities through themes and facets, identifying the topics of discussion. Our context determination involves highly complex text mining techniques that will show you what consumers are saying and why they feel the way they do.
Themes are lexically important noun phrases. Think of them as the “buzz” from the document. They work really well when rolled up across many documents – so you can get a feel for what, exactly, are people saying. They are completely automatic. We can also tell you the themes that are lexically associated with an Entity, and not just the themes that are important inside a document.
Categories
Categories are the other side of the "determining context" coin from Themes. Themes are extracted completely automatically, where categories need to be configured ahead of time. This is useful for sorting content into buckets that are useful and relevant to a business. For example, with a retail establishment, they might be interested in categories like "staff, location, parking, stock availability, lighting, pricing, etc." We do have automatic categories, and these are very high level buckets for you to use to get a preliminary view of the content. With several different ways of categorizing content, from search queries to machine learning classifiers to Wikipedia-based categories, we provide the tools necessary to segment content exactly the way that is most relevant to any business.
Categories are pre-configured classification buckets that allow you to define “what is this content about” or “what concepts does this content mention”?
Intentions
Intentions are "predictions of future behavior." A very simple example is "Hey, I dropped my camera, guess I need to buy a new one." That's a buy intent. We have four intent types out of the box: Buy, Sell, Recommend, and Quit. Using intentions will let you find new customers as well as prevent customer churn. Unlike any other text analytic system that provides intention extraction, we don't just tell you that there is an intention, we tell you who is the "intender," what is the object of their intention, and what is the intention itself. This lets customers act immediately on the information to jump on any opportunities to build business, as well as respond to problems without delay. In Salience Server, you can create your own intention types as well - say you want to configure something for a "desire" or a "vote" intention - you have complete control over the intentions. Intentions are an important part of our Industry packs, as the language for an intention varies widely from industry to industry. The word "return" is a "Buy" or "Recommend" intention in the hospitality space, but is a "Quit" intention in the consumer packaged goods space.
Sentiment Analysis
Speaking of feeling, our Sentiment Analysis feature will show you exactly how consumers feel about their subject of discussion. Our sentiment analysis is the most powerful, accurate, and reliable in the business: beyond telling you whether a given document of text is positive, negative, or neutral, we assign a specific score to show just how strong that sentiment is. What’s more, we attach sentiment scores to entities, themes, facets, in addition to showing a general document sentiment score. This multi-level analysis can be configured and optimized to match your individual needs.
Summarization
Summarization is meant for humans to get a quick grasp on a long document. “Long” could be a 200 page analyst report you’re reading on your laptop, or a missive from your boss that you’re trying to scan along with another 20 emails on your phone. Lexalytics has highly tunable summarization technology to give exactly the right results for your application. One of the most interesting features is the ability to give Entity Summaries – very useful if you’re trying to crank through a few hundred large research reports trying to understand just what they’re saying about the one company (of dozens) about which you need to learn. The summaries we provide are based on the words actually in the document. We give you the most important sentences. We can also give you the summaries that are relevant to an entity – great for dealing with 200 page analyst reports.
The Upshot
Text analytics is no mean feat. At Lexalytics we've spent over a decade refining our systems so that you, the busy professional, can sit back and let our products save you time, money, and headache. This is because we're quickly moving into a world where if you can't hear what your audience is saying you cannot adapt. If you can't adapt, your company will die. Text analytics is no longer a luxury, it's a necessity. Every major company on earth now deploys text analytics in some capacity, allowing them to capitalize on waves, avoid pitfalls and better serve their increasingly global customer base. Perhaps it's time for you to explore the text mining tools your business needs to unlock the insights hidden in unstructured text.
Director of 'Rethink' | Helps Organisations Evolve | Breakthrough Solutions | All about People | Loves Data Too
8 年Thanks Seth. Great breakdown on how text analytics work.
Independent Researcher at n/a - between jobs - who wants me? I want to work!
8 年Nice. Just as in politics everyone has slightly different views of things, so definitional articles are very useful. I have understood by Text Analytics classification, clustering, summarization, and retrieval. Allow me to shamelessly toot my own horn and announce my effort to combine statistical methods with the Stanford Parser. Working on TF IDF right now, but keep running into programming mistakes I made. https://www.academia.edu/18302162/Linguistics_aids_Kullback_Leibler_Divergence_and_Naive_Bayes_Document_Classification
NLP applied to Indic scripts and languages
8 年Great "summarisation" of the state of the art. Lucide and easy to read.
Deputy CEO for R&D at Mediascope
8 年May be the best short review of Text Analytics fundumentals.