Data rocks
Data give you the answers, but to use its power you need to learn to ask the right questions
Data is one of the most important driving forces of today's world. Control over the flow of data has elevated technological giants such as Alphabet (Google's parent company), Amazon, Apple, Facebook, and Microsoft to the top. These are companies that rank among the top 10 most valuable firms in the world. Artificial intelligence (AI), Machine Learning (ML), Internet of Things (IoT) - these are terms from the headlines of the world's most important news, almost always referred to as our near future. Data-driven decision-making, fact-based decision-making, and metrics, on the other hand, have become regular features in the headlines of company reports presenting their strategic plans. Data rocks.
"The amount of data generated in the next three years will be greater than in the last 30 years, and in the next five years, the world will generate over three times more data than in the previous five years." - IDC's Global DataSphere Forecast. Of this data, 29% relates to productivity. Analytics of productivity data was expected to bring about a boom in organizational productivity and ensure its incredible growth. However, for many companies, this boom never happened. What went wrong? Why, even though significant technological advancements intended to support productivity development, does productivity not increase? Why, even though the provided analyses and forecasts, which should enable managers to make quick, precise, and good decisions, are these decisions often wrong?
?One of the main reasons for this situation is the quality of data and (paradoxically) its quantity. Digital technologies multiply data, and an excess of data is simply harmful. The second consequence of this multitude of data is the poor quality of data. The more data, the more difficult it is to check their correctness, reliability, and timeliness. The more data, the more difficult it is to check their correctness, reliability, and timeliness. Handling poor-quality data also generates enormous costs. Harvard Business Review calculated (2016) that inaccurate data cost the US economy $3 trillion annually. Meanwhile, Gartner (2022) attributed poor data to losses for US companies amounting to $12.9 million annually.
?What makes data "good" or "bad"? Good data are those suitable for intended uses in operations, decision-making, planning, and learning from data. Bad data include duplicates, conflicting data, incomplete data, incorrect data, or unsynchronized data. Considering the amount of data we would have to verify to eliminate "bad" data, the huge costs shown a few lines above become evident.
Organizations are often aware of the poor quality of the data they have and they feel the effects that such data causes when using it. These effects include wasted time, weak decision-making processes, frustrated customers, or difficulties in implementing any strategy. Yet organizations find it difficult to eliminate bad data and improve its quality. It is challenging to introduce and consistently implement process improvements related to data handling. This is influenced not only by the amount of data but also by the number of places where they are stored, often without mutual awareness of their existence. The lack of a structured strategy for working with and storing data, as well as the lack or shortage of individuals willing to take responsibility for such a strategy, are significant factors. Therefore, the mission of data repair, even in medium-sized organizations, seems to be a long-term and arduous mission from the very beginning. It is unclear where to start, how to define the causes of bad data creation and identify the sources of their creation. However, taking up this challenge seems crucial for the proper functioning of the organization, its correct (predictable) growth, and the elimination of risks encountered during this growth.
We begin the data rescue process with identification: an interesting method to verify the quality of data emerging in the area for which a particular manager is responsible is the "Friday Afternoon Measurement" (FAM) method by Thomas C. Redman. The method is relatively simple: managers collect 10-15 critical data attributes for the last 100 units of work performed by their departments - essentially 100 data records. Together with their teams, they analyze each record, noting obvious errors. Then they count the total number of error-free records. This will be a number in the range from 0 to 100, representing the percentage of correctly created data - their Data Quality (DQ) Score. Such a measurement can be conducted in successive areas of the organization to encompass the whole. The results of tests of this method on 75 managers over two years were not optimistic: on average, 47% of newly created data records contained at least one critical error (e.g., affecting work). Only 3% of DQ results could be assessed as "acceptable" using the most lenient quality standard. Furthermore, no analyses showed whether specific industries are better or worse off. The conclusion is therefore alarming - no sector of the economy, no industry, or company is resistant to the effects of low data quality.
The FAM method is one of the solutions, and its additional value is that it demonstrates/suggests an approach to the data repair process itself. Results will vary for different areas of the organization, as will the causes and scale of bad data actions. FAM shows that it is pointless to try to repair all data in the organization; it will always be like the Augean Stables, and only Hercules' approach would be able to solve this problem effectively. FAM suggests: focusing on data from the last, short time interval. Verify the quality of this data and eliminate the sources of error. In the next step, we analyze the reason for errors. By repeating the process in the next defined time interval, we check whether the sources of bad data have indeed been eliminated and whether new ones have appeared. We categorize and describe the sources of both good and bad data, adding the criteria we have adopted for assessment. We do this to check the quality of data in the future and compare the activities we have undertaken (to fix the data) with the results of these activities. And so, step by step, area by area. Identifying the quantity, size, and location of poor-quality data in the organization aims to ensure that the material we have to work with is of sufficient quality, i.e., true and reliable.
领英推荐
However, what if we cannot isolate correct data (for example, due to the number of exceptions that make any algorithmic application meaningless)?
Data are answers. We can and should work with data by asking the right questions. We collect data (even if initially probably manually) not to possess them and know their values. We collect data, but to a very limited extent - as answers to one, specific question. So, we do not ask what the specific number is (e.g., the number of website users), but what such a result tells us (what results for the organization from such a number of users). We don't care about the number of records for a given situation. What matters is, what follows from the fact that a given number is what it is.
Another useful tool for working with data is experiments. The formula in which an experiment must operate has a very specific time frame and a limited scope of data collection needed for the experiment's purpose. By defining the experiment, we establish the question (hypothesis) to which we would like to get an answer. We decide when and how we will find out whether the obtained answer is what we expected or entirely different. We use the obtained answer to formulate the next hypothesis for the next experiment. We can also repeat the experiment by changing certain values of the hypothesis to check completely different assumptions.
With such a limited process, when we obtain a strictly defined amount of data that precisely answers the question asked (it is secondary whether it affirms or denies the hypothesis), we can consider scaling the experiment.
But even when scaling the experiment, we do not multiply additional questions. We check whether the same condition holds in a different situation or for a different group. Experiments require the greatest possible limitation of variables, preferring a one-change-at-a-time approach, one piece of data at a time. One good answer at a time.
When the scaled hypothesis proves to be true for a larger/next area of our organization, there is also a high probability that the consequences will be scalable. That is, if certain actions lead to an improvement in data quality in one area, the same actions will improve data quality throughout the organization.
?It makes no sense to collect data for the sake of data alone. Even more so, it makes no sense to repair data quality if they do not serve us in making decisions. Making decisions is closely related to the quantity/quality of the data we have. However, the quality of decisions does not necessarily result from the quality of the data used to make them. What matters is whether we have asked the data the right questions. Even good data answering the wrong question will lead to a wrong decision. And it's impossible to blame data quality.
A minimalist, experimental approach to working with data also has another additional benefit - by working in a limited area, we reduce the cost of obtaining answers and minimize the risk of making a mistake. Our data become better and more precise. Furthermore, by using them, we make better decisions, without necessarily investing in vast libraries of historical data (of questionable quality). We also save time and avoid many frustrations and misunderstandings. Yes, data rule, and we can govern data using the right approach.