How do we use structured queries to tackle unstructured (big) data?
I recently attended a talk by Linguamatics CTO David Milward on Structured Queries for Unstructured Data, delivered to the Data Insights Cambridge Meetup group.
The data science community wants to know:
- How can we deliver insights from big data?
- What are the optimal approaches to ‘handle’ (store, capture) and analyze (query, structure, repurpose) big data?
The amount of data we can store and generate is many times what we could store or capture just 10 years ago. SQL Database technology is able to handle structured data well and has not changed significantly since the 1980s. It’s easier to deliver insights from structured data for basic queries than it is for unstructured data in free text sources.
Unstructured data is the new frontier for data science
What drew so many people to David’s talk is the promise of the ‘data insights’ that are locked away in unstructured data. The audience spanned various industries, from those dealing with astronomical data to financial data sources, to many people concerned with health and life science unstructured data. Many industries rely heavily on data to inform their day to day business decisions. For healthcare and life science, where Linguamatics is the text mining leader, transforming how we understand and improve upon population health and patient outcomes will primarily entail extracting data insights from unstructured data sources.
Effectively mining unstructured data requires Natural Language Processing (NLP) technology
Unstructured data is challenging to dive into and analyze for business and health outcome-critical insights; it encompasses different syntactic constructions and patterns than are seen in structured data. This makes it difficult to identify entities and relations in the words, and identify relationships across different documents.
David illustrated how the upcoming version of Linguamatics NLP-driven text mining tool, I2E, addresses these challenges by normalizing data values. I2E maps the same concepts to each other no matter how they are expressed (ie non-smoker is the same as does not smoke).
If you query a large amount of unstructured data with a relatively straightforward question like “Which, What, Who?”, I2E can directly take you to the answers that matter. If you ask a broader question like ‘Tell me everything about X,’ I2E search will provide only the most relevant documents mentioning X by clustering facts extracted from all documents. This text mining approach allows the user to search for key information (e.g. a particular date, mutation, measurement, etc.) in unstructured source data regardless of how the information is expressed or formatted in the text.
Linguamatics upcoming release introduces Normalized Values and Advanced Range Search, which enable powerful range searches over unstructured data. For example a range search of "between 0.5kg and 2kg" will find weights expressed in the source text in different units e.g. 1.5lb, 600g, 1.5kg.
All of these insights coming from unstructured data sources are presented as structured results that draw your attention to the answer while linking back easily to the raw data.
David presented excellent examples of how structured querying can enable us to tap into the gold nuggets hidden within unstructured text. I look forward to seeing more examples of how NLP-based text mining is being applied at the upcoming Linguamatics Text Mining Summit, October 17-19 in Cape Cod.