登录查看更多内容

A text analytics view of extracting actionable business intelligence

Subrata Das

Gen AI Professor & Principal AI & Data Scientist

发布日期: 2015年8月4日

Text analytics is a process for analyzing a large text corpus to help discovering information that is strategic to an organization. Sources of text include customer feedback, blogs, reviews, and interactions on social networks, some of which are openly available while the rest are company proprietary. For example, text analytics will discover people’s opinions on various blog sites about a company’s new product, meaningfully segment documents, articles, notes, blogs, etc. to extract topics, or analyze customers’ sentiment from text surveys. Text analytics applications include sentiment analysis, business and military intelligence analyses, e-service, scientific discovery, and search and information access.

[Disclaimer: The content is heavily borrowed from my book Computational Business Analytics, published by Chapman & Hall/CRC Press last year, and also commercial materials around Machine Analytics’ text analytics tool aText]

Discovering actionable business intelligence is much more than just looking for some pre-defined set of keywords in a corpus. The two most fundamental capabilities that provide foundations for text analytics are: 1) information structuring and extraction; and 2) text classification and topic extraction. Information structuring is hard. Why? Let us take a wider view along the information continuum. First of all, information structuring implies the presence of unstructured data. The following definition of unstructured data is perhaps the most succinct among those found on the web:

“Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner.” – Wikipedia

There are several points to be noted here. First, the above definition itself is full of “information” as absorbed by you while reading, but is certainly not “usable” by a computer program. Second, the concept of data needs to be distinguished from that of information, as the former is not usable by humans without a proper context (or meta-information). Information is the semantic interpretation of data, and represents relationships among data with meaning and purpose. These relationships can be captured well in unstructured natural languages or figures.

Structured data have become synonymous with relational data.

Structured relational data are organized and searchable by data type within the actual content to be queried by SQL, and highly unstructured data is commonly associated with file servers, bitmap images/objects, and document management systems. Data “in-between,” which includes XML data, HTML pages, PDF documents, emails, HTTP traffic and clickstream data, search results and application log files, is in a state of transition to a structured form. According to some recent estimates, unstructured data represents approximately 85% of enterprise data.

The structure of some data may not be defined formally, but can still be implied by exploiting linguistics and auditory and visual structures present in the data. Moreover, data with some form of structure may still be characterized as unstructured if the structure is not helpful for the desired processing task. One should also be aware of the data-information-knowledge continuum/hierarchy, and the concept of unstructuredness is applicable at every level of the hierarchy. So what is structuring and what do structures look like? A concrete example of structuring is shown in the figure below as a data structuring continuum.

A textual description of the picture is “Homer is sitting in a chair drinking beer.” A human observer may discover more objects in the picture than just Homer and a beer bottle, and may infer a lot more information from the “context” of this picture, including the possibility that Homer is depressed. Structuring involves representing this information using a suitable syntax. The example here uses both RDF triples and relational tables. Note that an added advantage of such syntax being declarative is that a human is able to read, add, and update, if necessary, in addition to the machine.

Once we have a structured representation, a machine can interpret and reason with it based on its semantic interpretation and positional knowledge of attributes.

For example, the name of a person would appear in the first position of an RDF-representation, and in the column headed by “Person” of a relational representation. This type of position-based convention is not feasible for unstructured texts, since the same picture can be described in multiple ways due to the free-form nature of natural languages.

Deep Natural Language Processing (NLP) techniques in conjunction with some Artificial Intelligence (AI) heuristics are needed for information structuring “semantically” in the form of subject-predicate-object triples as defined above. These triples along with some domain ontology specific to a vertical application will provide structured tuples such as those in relational databases. On the other hand, most text classification techniques are syntactic in the sense they rely primarily on word counting, associations, and co-occurrences that do not require any NLP. aText is Machine Analytics’s patent pending text analytics tool to automatically extract information from text corpii and to categorize text documents.

Text processing includes tokenization, stemming, tagging, named-entity recognition, co-reference resolution, and relations extraction and representation in Resource Description Format (RDF) triples via deep linguistics processing. Categorization techniques include the following mixture of supervised and unsupervised machine learning techniques: Na?ve Bayesian Classifier (NBC), k-dependence NBC (kNBC), SVM Classifier on Fisher Kernel (FK), Latent Semantic Analysis (LSA), Probabilistic LSA (PLSA), and Latent Dirichlet Allocation (LDA).

aText thrive on these fundamental capabilities for its in-built powerful sentiment and social network analyses, topic extraction, document summarization, and semantic search. We will cover these applications in a follow on post. aText also has the capability of building corpus by automatically extracting textual content from various web and social media sites (e.g. Twitter, Facebook). The tool is available in trial, academic, full, and developer API versions. Send an email at [email protected] for more information.

Kevin V?gen, CRM, MLIS

data governance, data management, information governance, records and information management, defensible disposition, R.O.T. cleanup, information seeking behavior process improvement, privacy

8 年

ah Semantic triplets. That brings back memories on the "web of the future" in 2010.

Jason Webster

Performance Management, Data Strategy and Analytics

8 年

Just when I thought I had mastered drinking beer in my underwear, so much more to learn....Thanks Subrata Das!

1 次回应

Ayan Ganguly

Lead - Data Science | Data Engineering | Solution Architecture

8 年

v nice

Waseem Naqvi

CTO Unmanned Systems, Raytheon Intelligence & Space

9 年

Nice simplification of a complex topic.

Jim W.

Health Security Intelligence Expert | Emergency Management Support

9 年

Dr. Das is a true leader in this domain. Nice post!

查看更多评论

要查看或添加评论，请登录

Subrata Das的更多文章

Nobel & AI

2024年11月1日

Nobel & AI

This year’s Nobel Prize in Physics has been awarded to two veteran AI scientists, while the Chemistry prize has…
Can generative AI produce realistic medical images?

2024年1月2日

Can generative AI produce realistic medical images?

The question above was posed to the students of my Generative AI class for graduate students at Northeastern, which…

3 条评论
Deduction in ChatGPT

2023年1月30日

Deduction in ChatGPT

Something fundamental to the intelligence of a system is to be able to make inferences of different types, such as…
Systems Engineering in Building Complex AI Systems

2021年3月16日

Systems Engineering in Building Complex AI Systems

An extended abstract of the invited presentation at the workshop Leveraging Systems Engineering to Realize Synergistic…
Factors inhibiting AI adoption

2019年11月5日

Factors inhibiting AI adoption

Despite the recent surge of activities in the field of data science and demonstrable benefits as a result, many…

1 条评论
Analysis of Text (aText) Tool in Python and Java

2019年10月29日

Analysis of Text (aText) Tool in Python and Java

Analytsis of Text (aText) is a Natural Language Processing (NLP) package developed over many years using machine and…
Categories of data scientists – where do you want to be?

2019年9月16日

Categories of data scientists – where do you want to be?

I lay out these three broad choices in front of an aspiring data scientist seeking advice: Do you want to be a slave of…

3 条评论
The Death of True Intelligence?

2017年5月11日

The Death of True Intelligence?

[Alternative title: Quest for True Intelligence] Much has been spoken recently about the danger of making computers…

8 条评论
Computational Business Analytics

2016年12月12日

Computational Business Analytics

1 条评论
Internet of Things - critical roles of data fusion, analytics, and intelligent agents

2015年12月11日

Internet of Things - critical roles of data fusion, analytics, and intelligent agents

Anywhere between twenty and a hundred billion physical objects and devices are expected to be interconnected via…

See all articles

A text analytics view of extracting actionable business intelligence

Subrata Das

Gen AI Professor & Principal AI & Data Scientist

Subrata Das的更多文章

社区洞察

其他会员也浏览了

Unlocking the Power of AI-Driven Insights and Analytics with Amazon QuickSight Q

Top 5 Business Intelligence (BI) tools to consider in 2024

Visual Data Analysis with AI - Use Jeda.ai to Analyze CSV & Excel Data

Top Priorities for Chief Data Officers

The Misuse of Terminology in Data Field Job Descriptions

A comprehensive guide to the roles shaping data-driven organizations

Sourcetable: Revolutionizing Data Management and Analysis with AI

RAG delivers context and meaning to your AI enabled Qlik analytics

30 Innocent Questions That Will Terrify Your Analytics Engineer

Understanding Structured vs. Unstructured Data: Definitions and Key Differences.

Subrata Das的更多文章

Nobel & AI

Can generative AI produce realistic medical images?

Deduction in ChatGPT

Systems Engineering in Building Complex AI Systems

Factors inhibiting AI adoption

Analysis of Text (aText) Tool in Python and Java

Categories of data scientists – where do you want to be?

The Death of True Intelligence?

Computational Business Analytics

Internet of Things - critical roles of data fusion, analytics, and intelligent agents

社区洞察

其他会员也浏览了

Unlocking the Power of AI-Driven Insights and Analytics with Amazon QuickSight Q

Top 5 Business Intelligence (BI) tools to consider in 2024

Visual Data Analysis with AI - Use Jeda.ai to Analyze CSV & Excel Data

Top Priorities for Chief Data Officers

The Misuse of Terminology in Data Field Job Descriptions

A comprehensive guide to the roles shaping data-driven organizations

Sourcetable: Revolutionizing Data Management and Analysis with AI

RAG delivers context and meaning to your AI enabled Qlik analytics

30 Innocent Questions That Will Terrify Your Analytics Engineer

Understanding Structured vs. Unstructured Data: Definitions and Key Differences.