A text analytics view of extracting actionable business intelligence
Text analytics is a process for analyzing a large text corpus to help discovering information that is strategic to an organization. Sources of text include customer feedback, blogs, reviews, and interactions on social networks, some of which are openly available while the rest are company proprietary. For example, text analytics will discover people’s opinions on various blog sites about a company’s new product, meaningfully segment documents, articles, notes, blogs, etc. to extract topics, or analyze customers’ sentiment from text surveys. Text analytics applications include sentiment analysis, business and military intelligence analyses, e-service, scientific discovery, and search and information access.
[Disclaimer: The content is heavily borrowed from my book Computational Business Analytics, published by Chapman & Hall/CRC Press last year, and also commercial materials around Machine Analytics’ text analytics tool aText]
Discovering actionable business intelligence is much more than just looking for some pre-defined set of keywords in a corpus. The two most fundamental capabilities that provide foundations for text analytics are: 1) information structuring and extraction; and 2) text classification and topic extraction. Information structuring is hard. Why? Let us take a wider view along the information continuum. First of all, information structuring implies the presence of unstructured data. The following definition of unstructured data is perhaps the most succinct among those found on the web:
“Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner.” – Wikipedia
There are several points to be noted here. First, the above definition itself is full of “information” as absorbed by you while reading, but is certainly not “usable” by a computer program. Second, the concept of data needs to be distinguished from that of information, as the former is not usable by humans without a proper context (or meta-information). Information is the semantic interpretation of data, and represents relationships among data with meaning and purpose. These relationships can be captured well in unstructured natural languages or figures.
Structured data have become synonymous with relational data.
Structured relational data are organized and searchable by data type within the actual content to be queried by SQL, and highly unstructured data is commonly associated with file servers, bitmap images/objects, and document management systems. Data “in-between,” which includes XML data, HTML pages, PDF documents, emails, HTTP traffic and clickstream data, search results and application log files, is in a state of transition to a structured form. According to some recent estimates, unstructured data represents approximately 85% of enterprise data.
The structure of some data may not be defined formally, but can still be implied by exploiting linguistics and auditory and visual structures present in the data. Moreover, data with some form of structure may still be characterized as unstructured if the structure is not helpful for the desired processing task. One should also be aware of the data-information-knowledge continuum/hierarchy, and the concept of unstructuredness is applicable at every level of the hierarchy. So what is structuring and what do structures look like? A concrete example of structuring is shown in the figure below as a data structuring continuum.
A textual description of the picture is “Homer is sitting in a chair drinking beer.” A human observer may discover more objects in the picture than just Homer and a beer bottle, and may infer a lot more information from the “context” of this picture, including the possibility that Homer is depressed. Structuring involves representing this information using a suitable syntax. The example here uses both RDF triples and relational tables. Note that an added advantage of such syntax being declarative is that a human is able to read, add, and update, if necessary, in addition to the machine.
Once we have a structured representation, a machine can interpret and reason with it based on its semantic interpretation and positional knowledge of attributes.
For example, the name of a person would appear in the first position of an RDF-representation, and in the column headed by “Person” of a relational representation. This type of position-based convention is not feasible for unstructured texts, since the same picture can be described in multiple ways due to the free-form nature of natural languages.
Deep Natural Language Processing (NLP) techniques in conjunction with some Artificial Intelligence (AI) heuristics are needed for information structuring “semantically” in the form of subject-predicate-object triples as defined above. These triples along with some domain ontology specific to a vertical application will provide structured tuples such as those in relational databases. On the other hand, most text classification techniques are syntactic in the sense they rely primarily on word counting, associations, and co-occurrences that do not require any NLP. aText is Machine Analytics’s patent pending text analytics tool to automatically extract information from text corpii and to categorize text documents.
Text processing includes tokenization, stemming, tagging, named-entity recognition, co-reference resolution, and relations extraction and representation in Resource Description Format (RDF) triples via deep linguistics processing. Categorization techniques include the following mixture of supervised and unsupervised machine learning techniques: Na?ve Bayesian Classifier (NBC), k-dependence NBC (kNBC), SVM Classifier on Fisher Kernel (FK), Latent Semantic Analysis (LSA), Probabilistic LSA (PLSA), and Latent Dirichlet Allocation (LDA).
aText thrive on these fundamental capabilities for its in-built powerful sentiment and social network analyses, topic extraction, document summarization, and semantic search. We will cover these applications in a follow on post. aText also has the capability of building corpus by automatically extracting textual content from various web and social media sites (e.g. Twitter, Facebook). The tool is available in trial, academic, full, and developer API versions. Send an email at [email protected] for more information.
data governance, data management, information governance, records and information management, defensible disposition, R.O.T. cleanup, information seeking behavior process improvement, privacy
8 年ah Semantic triplets. That brings back memories on the "web of the future" in 2010.
Performance Management, Data Strategy and Analytics
8 年Just when I thought I had mastered drinking beer in my underwear, so much more to learn....Thanks Subrata Das!
Lead - Data Science | Data Engineering | Solution Architecture
8 年v nice
CTO Unmanned Systems, Raytheon Intelligence & Space
9 年Nice simplification of a complex topic.
Health Security Intelligence Expert | Emergency Management Support
9 年Dr. Das is a true leader in this domain. Nice post!