Drowning In Data: Fundamentals First
Right now, this very second, there are billions of data transactions happening. It flows freely in a maelstrom of microwaves and bits. For those of us who commute to work on the infamous London Underground it's self-evident; how many people had their attention absorbed by the little piece of plastic and aluminium in their hands?
Most of these transactions you are completely unaware of, as the phone in your pocket relays a wealth of data points to the applications that make it the cornerstone of modern reality. This could be considered both a blessing and a curse, but irrespective of your personal perspective on the subject, it simply IS.
We have absent-mindedly bought into the reality of a digital existence; the services provided are now a significant part of our day to day lives, and data is the fuel in the tank that keeps the whole engine running. For better or worse, this presents a unique problem for the data driven professional - what on earth do we do with it all?
The big data revolution saw a fundamental shift in data acquisition strategies. Unstructured systems removed some of the headache nascent in storing data within tightly defined relational storage like databases, but all that really did is kick the can down the road.
At some point the data must be analysed, understood, and refined information extracted. Big data solutions like the eponymous Hadoop (and its various cousins like Spark, Storm etc.) did make it possible to capture and process vast amounts of data without fully appreciating what it was; and more importantly without assessing whether there's value in capturing it at all.
We need to consider our old friend GDPR too, as they have a stake in this conversation. Capturing reams of information without a clearly defined purpose will, in many cases, be illegal come the end of May. Gone are the days of 'grab everything possible and work it out later'; you need a reason to capture every interaction with your customers or it will quickly be time for some fines.
So, considering the legal implications and having validated your data processes accordingly, what next? There may be a signal in there somewhere, but how do you detect it when there's so much noise, so many distractions? How do you tell what's real? Correlation, after all, does not always equate to causation; just because two data points seem to have a relationship does not mean they are truly linked.
To give an example; consider production of honey at a bee hive. This will increase as the volume of flights to the Caribbean does from Heathrow airport. So, there must be a direct relationship, right?
Well no, they both increase due to seasonality in the UK - flowering plants blossom at the same time Brits are looking to top up their tans - but a computer doesn't understand this. An algorithm doesn't have that intangible human facet known colloquially as common sense.
Throwing all your data in a bucket and running a boat load of neural networks and other wonderfully abstract processes to pick through it might sound fun, and indeed could be worthwhile if you are a gargantuan data processor like Google or Amazon - but for the rest of us mere mortals it's simply not cost effective.
Harsh reality is that the returns do not justify the expense of running immense clusters churning through abstract data sets, only for them to detect a 'relationship' between the purchase of jelly babies and voting preferences in exit poll data from the Barnsley Central by-election in 2011.
Sure, there may be a pattern, but unless Basset sweets funded Labours campaign for that seat then I think it's safe to say this is a coincidence. As I alluded to earlier, correlation does not equate to causation.
There is a bigger question too; why it would even make sense to search for patterns in such abstract sources; what's the business benefit in knowing that this relationship exists? How will it help?
Just because something is possible, doesn't mean it's plausible, or indeed sensible.
This is where scientific rigour and a good dollop of rational judgement come to the rescue. Data Science is, not surprisingly, a science, and whilst there is an art to it there's a lot to be said for a methodical approach. The validation and classification of whether data is reliable, useful and relatable at the beginning of any data centric endeavour can save you a massive amount of effort further down the line.
Just as vital is the need to structure the effort logically, with consideration for thorough analysis at each stage. I think this can be best summed in four steps;
Hypothesise
What's the outcome? What are you trying to solve? What are you trying to prove?
Defining the challenge may seem like an obvious step but it's often overlooked, or not explored in sufficient detail. I have been guilty of this myself earlier in my career, rushing headlong into solving a problem that nobody cared about - quite simply because it looked like a cool problem to solve!
A few crestfallen experiences later, and an older (and allegedly wiser) version of myself always asks a simple question before anything else: so what?
Being honest about whether anyone will give a damn about your amazing idea to look for relationships between weather patterns in the Outer Hebrides and purchases of tinned spam across the region is vital. Cognitively we as humans have biases, and we always think our own ideas are amazing. Why wouldn't we, we thought of them!
Metacognition is a difficult skill to master (thinking about how you think), and a great data scientist must be able to understand and account for biases created by their own and other people’s assumptions. Framing the problem logically helps no end in this process.
A data science initiative will live and die by its ability to prove demonstrable value, so any hypothesis must link to a value proposition of some kind or a defined business KPI. A couple of days of thorough initial research and analysis backed up by evidence and subject matter expertise from partners in your own business or with customers will help no end in squashing some of these biases, but at the centre we must be honest that our endeavours bring a return on investment (ROI) for our customers; be they internal or external.
Time is money, so before you claw through gigabytes of weather data and sales ledgers, first validate that this is a problem that even needs solving.
Who knows, you may even find a more interesting challenge along the way…
Acquire
What data do we need? Do we have it? If not, how do we get it?
So, you have a hypothesis that's been validated by business representatives, and you have the green light. Great! Now let's get some data.
As I mentioned earlier in this article, data is voluminous currently. We are practically drowning in it; but data is not information, and not all data can become information either. Also, we have a legal obligation as processors to only use what we need and nothing more - we can't hoard personal data for customers (be they past, present or future) on the off chance it might be useful in future.
Our data acquisition strategy must link with our hypothesis, but also must be mindful of data quality.
Anyone who's worked in a large company can testify to the fact that data quality is often the elephant in the room. It’s hardly surprising when you consider the average £100m plus revenue company can have upward of 20-30 separate primary and secondary systems, each with different purposes and owned by different departments.
Some of this problem is solved by data warehousing, pulling together all related data points into singular, business process oriented data marts, but even this incurs risk. How do you know that the data was correct at the point of entry? Has the integration process introduced a rounding error that exponentially effects the result set? Are you looking at the true master? Are there any systems in use that aren't being included in the data mart?
Acquisition on the face of it seems simple, in reality you can spend a lot of time here validating the integrity of the sources and cleansing it into a usable format.
By being smart in your data acquisition, by considering which features will add value ahead of time and working with domain specialists to identify the best sources before going and hunting for them, you can save yourself countless hours of SQL hacking only to discover that the real master "was in another castle".
Analyse
What's the data look like? Is everything I need here? What are its characteristics?
Fantastic! We have data, it’s been cleaned and we have a problem to solve, happy days, just crack on a see what you can find right? Simple!
Well, not quite, analysis can quickly become haphazard and misdirect if it's not performed with discipline and rigour. The reality of a data science project is that there will be a deadline for this, and whether you are visible of it or not, a budget.
It is also likely that the hypothesis you have defined has a series of smaller sub-questions attached to it, and a crucial step in expressing the value of any initiative quickly is to prioritise these smaller outcomes based on which can express immediate business value.
It's vital to work with your stakeholders at the customer or within your organisation to prioritise these sub-queries; whilst it may take time to complete the full analysis outlined, providing some interim insights could help steer the business strategy in the right direction.
To use a metaphor, if you're flying for America, you may as well head west; and pick a destination as you get closer to the shores. This is much quicker than sitting on the tarmac for 6 hours then taking off directly.
You can always change course in flight, but you'll already be headed in the right direction.
Socialise
Are the outcomes clear? Will people understand? What do other people think of the results?
The numbers are crunched, ensembles aggregated and the Hadoop’s have finished hadooping. Congratulations, it's a dataset. You should be proud, and rightly so.
However, remember this one truth; everyone thinks their own creations look amazing. Someone coming to your analysis for the first time may have only glimpses of what it's taken to get this far - if any context at all.
Storytelling is a vital skill in data science, and the ability to take stakeholders on a journey from hypothesis to outcome using qualitive insights derived from in depth quantitative analysis without ambiguity is important; let's face it, that last sentence could quite easily send the uninitiated professional to sleep.
I think a little sales craft goes a long way in making a truly great data scientist, and having some charisma to keep your audience engaged can be the difference between adoption of innovative ideas due to your analysis, or the quiet storage of your comprehensive yet confusing analysis in a digital filing cabinet for all eternity.
Clarity is king, the recommendations should be short and concise, with visualisations that are crisp and clear to demonstrate your point in ways that are cognitively appealing to the human mind. Both Stephen Few and Edward Tufte are great visionaries in visualisation, and I urge those of you who wish to explore this aspect further look at their work.
When all is said and done, it all boils down to this; does your analysis answer the question? Does it address the hypothesis in one sentence? Do you understand why?
In some cases, the answer is no and that's fine - data science by its very nature is exploratory. I think many business professionals out their see the disproving of a hypothesis as intrinsically negative - and if the hypothesis was poorly defined or misunderstood they would be correct.
However, if the steps outlined above have been observed, business representatives have been involved from the start, and it has survived peer review throughout stage, then the analysis is valid; it will still inform business strategy and may even lead to further questions and analysis.
Thomas Edison best encapsulated this ethos for me;
“Negative results are just what I want. They’re just as valuable to me as positive results. I can never find the thing that does the job best until I find the ones that don’t.”
Each answered hypothesis, no matter the outcomes, adds to our collective understanding and knowledge, and from there we can continue to learn, adapt and improve.
Because if there's one universal truth, it's that there is always something more to learn.
City & County of Denver
6 年I love Dilbery comics. But seriously, you've got to motivate people to become top performers and pay them commensurate to keep them.