Data Observations - the most important aspect of fixing data quality
If you work with or rely upon data to perform your role, you probably hear phrases such as 'garbage in, garbage out' and face questions on data quality day in and day out. Sometimes you hear variations of the GIGO phrase expressed more colourfully, or perhaps not expressed at all but left tacit in the room, like an unwelcome, silent emission in an elevator.?
My point is that GIGO is always there. Understandably so, if we're making decisions to change, replace, implement, stop or fund something - we want to do so with confidence that the data which we're using to suggest and influence our decision is reliable and accurate.?
Before I move on to the most important aspect of fixing data quality
Firsly, GIGO itself these days is perhaps too simplistic a representation when talking about data quality. The underlying principles of GIGO were amply illustrated by Charles Babbage when he wrote:
"On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."
In terms of Programming, where the Input and the Output dance more closely together than my recently separated Uncle did on his first visit to a Latin dance class with his unfortunate partner, GIGO will always hold true.?
But there are many other instances when there's a significant gap via processing and steps between the GI and the GO. My company, Apptio, as an example, take in multiple raw data sources when our ApptioOne software builds a cost model, but the raw data doesn't instantly become output, it actually goes through a series of integrity checks, lookups, joins, logical conditioning and other transformations before it becomes output, which allows for garbage to be identified, cleansed and filtered out or indeed improved. The same can be true for other products focused on different topics too in the Business or Decision Intelligence
Secondly, when you hear (or smell) the GIGO phrase, I don't mean to shock anyone, but sometimes it is uttered by people who would rather not use, see or share the output for reasons other than the data quality. Not always, but you should at least be aware that occasionally it is a motivation in people raising it. I'd never immediately assume that was the intent, but it's always worth drilling into the topic to see if you can tease out more about why it was said and genuinely understand the objection
Let's get back to the important topic in hand, the single most important aspect of fixing data quality. It's not an earth-shattering revelation I'm afraid, but it is one that is universally true. The most important aspect can be summed up by the following:
If you want to improve data quality, you have to have the intent and reason to do so and act with a focused plan.
There you go; Data quality doesn't improve without a reason to improve it or a plan to achieve it.?
Why is that so important? It's because of how people behave.?
Here's a summary of the typical progression which can occur when dealing with data quality.
Here's how the process should work:
领英推荐
You can stop reading there if you want, or you can carry on to find out more context and experience.
Years ago, I took part in a worldwide meeting where various large organisations talked about the decisions they were supporting across their technology offerings using a data-driven approach that combined financial, operational, consumption & utilization and service style data. One of the presenters, from a large global networking company, talked about an experiment they ran on data quality, where they gave one division the freedom to do what they wanted and another separate division was given an approach and data guidance and tooling. Both divisions were tasked with putting together "Total Cost of Ownership" data which involved combining data with the appropriate context and nuance. The division who had the freedom to do what they want, didn't achieve anything. The team with the guidance, tooling and intent, made their data defensible within 3 months and they were already taking decisions and driving value from it.?
Data quality is much like life. You can either let it happen to you, or you can drive it with a goal. But the goal and the approach have to be relevant to each other. It's no good having a goal to get into crypto investments and coming up with research and data which will help you do that and applying it to your aim to learn a foreign language.?
Having worked with several hundred organisations with data spanning their financials, assets, cloud, people, applications, service and product data in the past 20 years, I'm yet to meet one who starts their journey with me where that company is happy or confident with the quality of their CMDB, their Fixed Asset Register or any other data source which helps describe what they do, why and how much it costs. I'm also yet to meet one who thinks they understand clearly what all of their Cloud investments do and provide for them, let alone one who feels they have a good handle on how their Technology offerings support their Business goals and outcomes at the start of their journey. I'm pleased with that, as it's a key driver in why they want help.
Sometimes a purpose comes along which will drive an improvement. In the earlier days of my career, we focused intent on understanding topics such as, which elements of performance in our service supply chains are failing or performing slowly so that we can meet the requirements and expectations of our businesses more effectively. This drove related questions such as, what is impacted if we make a change or if something falls over?
That drove a certain amount of improvement. I worked with innovative software companies who built ways of understanding end-to-end transactions through the IT supply chain. I worked with smart people and we worked out ways to discover and collect data and build a topology of a service in order that we could answer; where does a transaction start, which platforms, applications, databases, assets and datacenters does it run through, how long does it take, what is our capacity, when does it slow down, what happens if a piece of the puzzle falls over? Our goal was really to help companies understand and improve the quality and performance of their offerings, meet their SLAs for availability and performance, and it felt like important work for misson-critical offerings that helped IT support the business more effectively.
Then the economy became more digital and the data and need to understand how technology underpins it became even more critical. Technology was now embedded more firmly in direct revenue streams - it almost always was as soon as it existed - but this was a new level of tech and business working tightly together. Some of that early work in this space involved combining data on revenues through e-commerce sites in context of tech performance. A simple example would be to analyse in near real-time the monetary value in the 'online shopping baskets' as people selected goods and services and compare it to the monetary value at the committed payment/transaction stage. We would then correlate that with the tech performance data - network throughput, query response times and status, workload contention and utilization from the server and storage architectures. When the tech broke or slowed down - you could see the impact on revenue as online shopping baskets were abandoned and sessions closed. Better still, you could identify improvements that had an impact on revenue in terms of capacity or speed of response. Many of my customers at this time didn't take it to the level of combining financial or revenue data with the tech data, for them it was enough to have a really robust way of looking at the performance of their tech assets in the context of business services.
Some of you may be surprised that this was work taking place in the early to mid 2000's, way before I joined Apptio in 2010. But it also helps explain why I joined Apptio because it was such a natural progression for technology management to mature into combining business, financial, operational and performance data
That need has never gone away - it's just got more complex and I'd suggest more urgent and more relevant. The two biggest trends I've seen in the last 5 or 10 years have been in the adoption of Cloud and the transition to product strategies and the adoption of Agile methodologies to support development and product lifecycles. There's never been a bad time to know what a technology service costs and why, but the agility and elasticity of Cloud services with the freedom to change capacity and cost quickly, plus the more business objective aligned Product/Agile strategies has added a new nuance to these questions of cost. Increasingly, cost wants to be framed against revenue. All of my major work this year with clients has a profitability angle to it, some more explicit than others, but always there.?
Sometimes it's amusing to look over a long period in your career and see the same themes again and again. Sometimes it's good validation that topics you felt were important are reaching a stage of maturity and acceptance. For other topics, seeing them again and again is less amusing and more frustrating. Data quality issues fall more into the frustration camp because the lessons learned and experiences of having worked with data for so long don't always find a receptive audience.
Let's finish by summarizing some of those lessons:
As I said around 5 or 6 years ago, 50% of the people who raise data quality concerns have genuine reasons for doing so. 30% don't agree with the data or don't want others to see it. 15% of people raising concerns think that owning data and keeping it to themselves gives them security or power and 45% just don't understand data or statistical analysis!
Thanks for reading, feel free to share your own observations of data quality topics in comments or through contact.