Unlocking the Secrets of Unstructured Data

Unlocking the Secrets of Unstructured Data

Today, information is power and knowing how to harness it can have a significant impact on any business’ profit margins. Typically, when we think of data analytics, we imagine data in tables, digging into Excel files, databases and weblogs to create graphs and charts, often bringing to light patterns you might not have expected. However, not all information sits in neat tables. Think about the documents on your shared drives, document management systems etc., a good proportion (if not most) are what is known as unstructured data. These unstructured files can be anything: Word documents, Power Points, PDFs. Utilising machine learning techniques, Aiimi can help unlock this unstructured data to visualise things you didn’t even know you had! 

Imagine following sentence in a word document: 

 For a person, this is easy enough to read, but what if you have thousands of such documents? If you want to view statistics relating to people, organisations and settlements over your entire document set, machine learning can help you to analyse these. Using an entity extraction algorithm (such as OpenNLP), this is how a computer might see the same sentence: 

 A machine can be trained to identify, people, organisations and monetary values in a sentence. This technique can then be applied to every sentence in your document and every document in your file system, marking them up with the identified entities. Even more impressively, it is also possible to train an algorithm to notice that the word “settled” here implies that the £12,000 is a settlement fee in relation to Mr Wilson and Big Corp, and can be tagged as such for the purposes of your statistics! Such techniques could be used to inform further machine learning algorithms to predict and advise on settlement fees. 

As an example of entity extraction in action, I have taken a set of Amazon movie reviews (provided by SNAP). This dataset contains millions of reviews, all with a raw text element to demonstrate on. I directly ingested the dataset into the powerful Elasticsearch datastore. Then, using InsightMaker, (an Aiimi data enrichment service), I applied entity extraction algorithms to identify people, locations and organisations. We can show this easily in a Kibana dashboard: 

 Here, I have produced some simple visualisations to show reviews over time, their star ratings and entities within them. The graph on the bottom left shows links between entities from appearing in the same reviews, and from the bar charts we can see the top mentioned location is “Hollywood” and the top person is “John Wayne”. 

Combining the review metadata, (such as date, star rating or reviewer), with these extracted entities in the Elasticsearch tool opens the doors for a host of interesting explorations: what actors tend to be rated the most highly, and what authors are more generous in their film ratings?   

In a business context, even the simple metadata found on all documents, such as author and creation date, can enable powerful insights. Aiimi can help you identify the root causes for spikes of activity and the most effective combination of roles in a team, all based on the hidden patterns in your data. 

If you have ever wondered what secrets are locked away in your unstructured data or would like to hear more about what Aiimi can offer, please don’t hesitate to get in touch.  

As published on the Aiimi website

Paul Sliwinski

Director of Customer Success | Aiimi

7 年

Great read Jack, some powerful stuff here

要查看或添加评论,请登录

Jack Lawton的更多文章

  • Can you use AI for good where sensitive data is concerned?

    Can you use AI for good where sensitive data is concerned?

    I think we can all agree we’ve seen dramatic changes in the way we live and work as a result of the COVID-19 pandemic…

    1 条评论
  • 3 Reasons to Fall in Love With Databricks

    3 Reasons to Fall in Love With Databricks

    At the moment, I am working to develop an enterprise scale digital twin, from the ground up. When assessing the data…

  • General Election 2019: Twitter Analysis

    General Election 2019: Twitter Analysis

    In true Aiimi tradition, this year we once again spun up a Twitter analytics platform to follow the UK General Election…

  • Network Analytics for Novel Hydrophones

    Network Analytics for Novel Hydrophones

    Recently, I’ve had the privilege of working in Anglian Water’s Water Industry Award-nominated data science team. As…

  • Email Classification: The Road to Production

    Email Classification: The Road to Production

    In my previous blog, I introduced our latest project – an email classification system for large UK utilities supplier –…

  • Machine Learning: The Truth is Out There

    Machine Learning: The Truth is Out There

    In this blog I will help to demystify the complexities surrounding text analytics, machine learning and unstructured…

  • Aiimi Analyses: By-Elections

    Aiimi Analyses: By-Elections

    Although analysis of news and social media data is interesting, there are clear limits to this method and no guarantee…

  • Aiimi Analyses: Question Time

    Aiimi Analyses: Question Time

    This week Aiimi are predicting the UK General Election. In any 21st century campaign, social media plays an important…

  • Aiimi Analyses: General Election

    Aiimi Analyses: General Election

    Last year, the Aiimi analytics team took on a huge challenge. We analysed and successfully predicted the result of the…

    1 条评论
  • Pedalling Data: Bringing London to life in Kibana

    Pedalling Data: Bringing London to life in Kibana

    More and more, the biggest questions facing businesses are data science questions. In this increasingly digital world…

社区洞察

其他会员也浏览了