Data Observations - the most important aspect of fixing data quality

Data Observations - the most important aspect of fixing data quality

If you work with or rely upon data to perform your role, you probably hear phrases such as 'garbage in, garbage out' and face questions on data quality day in and day out. Sometimes you hear variations of the GIGO phrase expressed more colourfully, or perhaps not expressed at all but left tacit in the room, like an unwelcome, silent emission in an elevator.?

My point is that GIGO is always there. Understandably so, if we're making decisions to change, replace, implement, stop or fund something - we want to do so with confidence that the data which we're using to suggest and influence our decision is reliable and accurate.?

Before I move on to the most important aspect of fixing data quality, it's also important to mention two salient factors related to the data quality topic.

Firsly, GIGO itself these days is perhaps too simplistic a representation when talking about data quality. The underlying principles of GIGO were amply illustrated by Charles Babbage when he wrote:

"On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."

In terms of Programming, where the Input and the Output dance more closely together than my recently separated Uncle did on his first visit to a Latin dance class with his unfortunate partner, GIGO will always hold true.?

But there are many other instances when there's a significant gap via processing and steps between the GI and the GO. My company, Apptio, as an example, take in multiple raw data sources when our ApptioOne software builds a cost model, but the raw data doesn't instantly become output, it actually goes through a series of integrity checks, lookups, joins, logical conditioning and other transformations before it becomes output, which allows for garbage to be identified, cleansed and filtered out or indeed improved. The same can be true for other products focused on different topics too in the Business or Decision Intelligence space and other sectors.

Secondly, when you hear (or smell) the GIGO phrase, I don't mean to shock anyone, but sometimes it is uttered by people who would rather not use, see or share the output for reasons other than the data quality. Not always, but you should at least be aware that occasionally it is a motivation in people raising it. I'd never immediately assume that was the intent, but it's always worth drilling into the topic to see if you can tease out more about why it was said and genuinely understand the objection. (Side note, that's always good advice when it comes to objections - really try to understand what it is so you can address it effectively.)

Let's get back to the important topic in hand, the single most important aspect of fixing data quality. It's not an earth-shattering revelation I'm afraid, but it is one that is universally true. The most important aspect can be summed up by the following:

If you want to improve data quality, you have to have the intent and reason to do so and act with a focused plan.

There you go; Data quality doesn't improve without a reason to improve it or a plan to achieve it.?

Why is that so important? It's because of how people behave.?

Here's a summary of the typical progression which can occur when dealing with data quality.

  1. A decision needs to be made, data is reviewed and someone brings up the GIGO concept.
  2. All agree on how important the data quality is, so it's proposed that the decision should wait until after a thorough review of the data.
  3. Separation now exists between the data and the decision, where the former can amble off leisurely with a period of time for review and analysis.
  4. In a few months, which may extend out to 5 or 6 as other imperatives take precedence, you revisit the topic, and nothing much has changed. Someone did analyse the data, but not with any context of the decision or topic you originally wanted to take or address, so their analysis summary doesn't move you forward that much.?
  5. At this point, you're no better off than you were, but time has passed, opportunities potentially have been missed, risks and rewards have increased or decreased, and the passive culture of resistance to change, which is always there in the corner of the room, raises itself briefly from a light snooze and smiles slowly, before settling down again with a greater sense of comfort.

Here's how the process should work:

  1. A decision needs to be made, data is reviewed and someone brings up the GIGO concept.
  2. All agree on how important the data quality is and the data topic is immediately split into relevant issues to the decision topic. Confidence is expressed in some areas, gaps are identified in others and specific actions are taken to review the integrity, freshness, completeness, accuracy that remain 'on topic' to the decision you want to take.
  3. Someone smart asks: "Would any of these data quality factors change/stop our intended decision or would they effectively provide more surety?"
  4. A decision is then reached, with an explicit plan of data quality being assessed and improved in tandem with and for the specific topic over a 2-week sprint.
  5. When that review period ends, the decision topic is still firmly in the driving seat, not the data. Indeed, because you had the decision topic uppermost in mind when reviewing the data - not only do you get a review, but you have a culture which has also examined, where can I get different data, what would actually help this decision topic and if we can't get this specific data, what is an acceptable proxy for it?

You can stop reading there if you want, or you can carry on to find out more context and experience.

Years ago, I took part in a worldwide meeting where various large organisations talked about the decisions they were supporting across their technology offerings using a data-driven approach that combined financial, operational, consumption & utilization and service style data. One of the presenters, from a large global networking company, talked about an experiment they ran on data quality, where they gave one division the freedom to do what they wanted and another separate division was given an approach and data guidance and tooling. Both divisions were tasked with putting together "Total Cost of Ownership" data which involved combining data with the appropriate context and nuance. The division who had the freedom to do what they want, didn't achieve anything. The team with the guidance, tooling and intent, made their data defensible within 3 months and they were already taking decisions and driving value from it.?

Data quality is much like life. You can either let it happen to you, or you can drive it with a goal. But the goal and the approach have to be relevant to each other. It's no good having a goal to get into crypto investments and coming up with research and data which will help you do that and applying it to your aim to learn a foreign language.?

Having worked with several hundred organisations with data spanning their financials, assets, cloud, people, applications, service and product data in the past 20 years, I'm yet to meet one who starts their journey with me where that company is happy or confident with the quality of their CMDB, their Fixed Asset Register or any other data source which helps describe what they do, why and how much it costs. I'm also yet to meet one who thinks they understand clearly what all of their Cloud investments do and provide for them, let alone one who feels they have a good handle on how their Technology offerings support their Business goals and outcomes at the start of their journey. I'm pleased with that, as it's a key driver in why they want help.

Sometimes a purpose comes along which will drive an improvement. In the earlier days of my career, we focused intent on understanding topics such as, which elements of performance in our service supply chains are failing or performing slowly so that we can meet the requirements and expectations of our businesses more effectively. This drove related questions such as, what is impacted if we make a change or if something falls over?

That drove a certain amount of improvement. I worked with innovative software companies who built ways of understanding end-to-end transactions through the IT supply chain. I worked with smart people and we worked out ways to discover and collect data and build a topology of a service in order that we could answer; where does a transaction start, which platforms, applications, databases, assets and datacenters does it run through, how long does it take, what is our capacity, when does it slow down, what happens if a piece of the puzzle falls over? Our goal was really to help companies understand and improve the quality and performance of their offerings, meet their SLAs for availability and performance, and it felt like important work for misson-critical offerings that helped IT support the business more effectively.

Then the economy became more digital and the data and need to understand how technology underpins it became even more critical. Technology was now embedded more firmly in direct revenue streams - it almost always was as soon as it existed - but this was a new level of tech and business working tightly together. Some of that early work in this space involved combining data on revenues through e-commerce sites in context of tech performance. A simple example would be to analyse in near real-time the monetary value in the 'online shopping baskets' as people selected goods and services and compare it to the monetary value at the committed payment/transaction stage. We would then correlate that with the tech performance data - network throughput, query response times and status, workload contention and utilization from the server and storage architectures. When the tech broke or slowed down - you could see the impact on revenue as online shopping baskets were abandoned and sessions closed. Better still, you could identify improvements that had an impact on revenue in terms of capacity or speed of response. Many of my customers at this time didn't take it to the level of combining financial or revenue data with the tech data, for them it was enough to have a really robust way of looking at the performance of their tech assets in the context of business services.

Some of you may be surprised that this was work taking place in the early to mid 2000's, way before I joined Apptio in 2010. But it also helps explain why I joined Apptio because it was such a natural progression for technology management to mature into combining business, financial, operational and performance data to drive smarter decisions.

That need has never gone away - it's just got more complex and I'd suggest more urgent and more relevant. The two biggest trends I've seen in the last 5 or 10 years have been in the adoption of Cloud and the transition to product strategies and the adoption of Agile methodologies to support development and product lifecycles. There's never been a bad time to know what a technology service costs and why, but the agility and elasticity of Cloud services with the freedom to change capacity and cost quickly, plus the more business objective aligned Product/Agile strategies has added a new nuance to these questions of cost. Increasingly, cost wants to be framed against revenue. All of my major work this year with clients has a profitability angle to it, some more explicit than others, but always there.?

Sometimes it's amusing to look over a long period in your career and see the same themes again and again. Sometimes it's good validation that topics you felt were important are reaching a stage of maturity and acceptance. For other topics, seeing them again and again is less amusing and more frustrating. Data quality issues fall more into the frustration camp because the lessons learned and experiences of having worked with data for so long don't always find a receptive audience.

Let's finish by summarizing some of those lessons:

  • Data quality doesn't improve without a reason for doing so or a plan to achieve it.
  • The intent and act of using data will drive improvements to it.
  • Very few people in any organisation have a complete understanding of the data which will be available, but many people in many organisations claim that data isn't available.
  • Data is often treated like an alien, separate entity, as if it has nothing to do with people.
  • No one achieves anything of note with data without first knowing what questions to ask of it and what outcomes they want to achieve with it.
  • Perfect can indeed be the enemy of good.
  • Data is not a finite resource which came into being independently. It can be created.
  • Data intended for one purpose can often be used for other purposes too rather than sitting in an isolated silo.
  • Creating a culture where data is more widely shared doesn't mean giving away the secret recipe from the Colonel or competitive advantages, transparency in some areas leads to better decision making, better profitability, reduced risk and other benefits.
  • The best question to ask when people don't want to share data is: "Why not?"

As I said around 5 or 6 years ago, 50% of the people who raise data quality concerns have genuine reasons for doing so. 30% don't agree with the data or don't want others to see it. 15% of people raising concerns think that owning data and keeping it to themselves gives them security or power and 45% just don't understand data or statistical analysis!

Thanks for reading, feel free to share your own observations of data quality topics in comments or through contact.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了