The cyclical nature of enrichment and analysis
Matthew Hardman
Hybrid Cloud | High Performance Applications | Data Ops | Strategy | Leadership
When most people think about Data Analytics, it's usually a linear journey which involves gathering data from a variety of sources, cleansing it, blending it, visualising it, and finally getting some sort of insight from it to present to the user.
The insight gained, while "insightful", is never really related back to the data. In a crude way its like making a sandwich, if you chose to put lettuce in the sandwich, you don't usually go back to the head of lettuce and leave a comment that this particular head of lettuce was used for a sandwich, it just doesn't seem necessary.
The thing is, that if we continue down this linear journey of understanding and developing insights, unless published to a wider community, the insights are kept with the person who uncovered them, and acceleration of knowledge will be hampered. However there is a way we can start to address this, and metadata is key to making it happen.
If you have ever read any of my articles, you will know of my fondness for metadata, and how it is just as valuable and useful as the data itself. Put simply metadata is known as data that describes data, the GPS coordinates of where a photo was taken, the date a file was created, the level of radiation used to produce a cranial scan. Metadata need not be just data that describes the data it is attached to, it can be any form of data that can enrich a particular piece of data on how it has been used, or additional "insights" about it.
A Practical Example
Let's imagine for a second the analysis flow for a medical research project. In a simple flow of steps you might do the following;
- Process a series of medical scans to identify a set of particular scans that contain the necessary information for your analysis.
- The actual analysis you want to run.
If you stop and think purely about the first step, this might involve identifying 100 particular scans from a repository of a million such scans. Something that might take a massive amount of time to process, to identify the correct images. That might be a tax that you are willing to pay and plan for, after all you want the correct scans for your project, but what happens if you want to run through the same process again, unless once you have identified all those images you copied them all to your own repository, you would go through the same process again, consuming unnecessary time.
Maybe you could look at it in reverse, and once you have identified all the images that are right for your study, you delete all the other images from the repository, so you focus only on what is important. Ok, so we know that is not going to work, just because you don't see value in those other scans, doesn't mean that another researcher won't find value.
Enter Metadata
Metadata provides the value here, once your scans have been identified they could be tagged with information indicating their status as suitable data for your analysis. In fact it might not be a tag that indicates its good for "your" analysis, but for all future projects that are based around similar analysis, such a tag could help other researches reduce massive amounts of time trying to find the right sources of information for their projects as well.
This is where the cyclical nature of enrichment and data analysis is realised. Once we have identified some sort of insight from a type of data, we can tag that source information with the insight itself to help reduce the need for additional processing at later date. An index that maintains the locations of the data, and maintains the information about the data is key here.
This continual cycle of analysis and enrichment not only speeds up the identification of relevant information for the individual, but can help accelerate the time to insight for an entire body of researchers in an organisation or a community itself.
Coming back to our example, we can get to a point where we eliminate the first step, meaning that organizations can focus on the analysis and research that comes from having good data, rather than spending expensive cycles in identifying the good data.
In Closing
As we continue to consume a greater variety of data from a more diverse set of data sources, metadata will be the key to ensuring that we identify the right data to drive the important outcomes.
Thanks for stopping by, and I hope you all have a great weekend!