Unlocking the Secrets of Unstructured Data
Today, information is power and knowing how to harness it can have a significant impact on any business’ profit margins. Typically, when we think of data analytics, we imagine data in tables, digging into Excel files, databases and weblogs to create graphs and charts, often bringing to light patterns you might not have expected. However, not all information sits in neat tables. Think about the documents on your shared drives, document management systems etc., a good proportion (if not most) are what is known as unstructured data. These unstructured files can be anything: Word documents, Power Points, PDFs. Utilising machine learning techniques, Aiimi can help unlock this unstructured data to visualise things you didn’t even know you had!
Imagine following sentence in a word document:
For a person, this is easy enough to read, but what if you have thousands of such documents? If you want to view statistics relating to people, organisations and settlements over your entire document set, machine learning can help you to analyse these. Using an entity extraction algorithm (such as OpenNLP), this is how a computer might see the same sentence:
A machine can be trained to identify, people, organisations and monetary values in a sentence. This technique can then be applied to every sentence in your document and every document in your file system, marking them up with the identified entities. Even more impressively, it is also possible to train an algorithm to notice that the word “settled” here implies that the £12,000 is a settlement fee in relation to Mr Wilson and Big Corp, and can be tagged as such for the purposes of your statistics! Such techniques could be used to inform further machine learning algorithms to predict and advise on settlement fees.
As an example of entity extraction in action, I have taken a set of Amazon movie reviews (provided by SNAP). This dataset contains millions of reviews, all with a raw text element to demonstrate on. I directly ingested the dataset into the powerful Elasticsearch datastore. Then, using InsightMaker, (an Aiimi data enrichment service), I applied entity extraction algorithms to identify people, locations and organisations. We can show this easily in a Kibana dashboard:
Here, I have produced some simple visualisations to show reviews over time, their star ratings and entities within them. The graph on the bottom left shows links between entities from appearing in the same reviews, and from the bar charts we can see the top mentioned location is “Hollywood” and the top person is “John Wayne”.
Combining the review metadata, (such as date, star rating or reviewer), with these extracted entities in the Elasticsearch tool opens the doors for a host of interesting explorations: what actors tend to be rated the most highly, and what authors are more generous in their film ratings?
In a business context, even the simple metadata found on all documents, such as author and creation date, can enable powerful insights. Aiimi can help you identify the root causes for spikes of activity and the most effective combination of roles in a team, all based on the hidden patterns in your data.
If you have ever wondered what secrets are locked away in your unstructured data or would like to hear more about what Aiimi can offer, please don’t hesitate to get in touch.
As published on the Aiimi website
Director of Customer Success | Aiimi
7 年Great read Jack, some powerful stuff here