How much data do I need?
Matthew Kroll
We Help Businesses Achieve Operational Excellence through Process Improvement, Employee Engagement & Continued Improvement for Sustained Growth | Industrial Engineer | Fractional Certified Master Black Belt Six Sigma
Happy New Year! We are three years into the third decade of the new millennium. Is it still new? I hope you have great plans for the year. I hope you are ready to adapt those plans as the year unfolds in ways you cannot predict. And I hope you stay calm and collected as the year’s triumphs and tragedies unfold. Have a great 2023!
One of the most frequent questions asked of a Black Belt is, “how much data do I need?” and the most frequent answer is, “it depends.” We all know that biases skew our perspective of the truth. We know that small sample sizes, localized data sets, outliers, and preconceived notions all add to our bias. This is true both in business and our personal lives. In this article, let’s stay out of the discussion of personal bias and focus on the more mechanical business environment. I will provide my experience in how to best manage this challenge in data collection.?
My first bit of advice when it comes to collecting data is don’t overthink it; just go collect the data. In most of our day to day business and manufacturing environments the cost and risk in data collection is low enough that we should just go do it. Also, I know for certain that you will need to go collect data again, so think of the first run as a pilot of your data collection process. The value of the pilot is that you are going to gain some answers to better answer the question of how much data you need to collect.?
The first question is about the criticality or the tolerance of what you are trying to understand. If it is highly critical then it follows that you want more information. If you are dealing with very tight tolerances then you want more data so that you can find small degrees of variation.?
领英推荐
Next, you should think about the type of data that you are able to collect. The simple example is thinking about delivering something “on time.” You can measure that as true or false or as the quantity of days, hours, minutes late. True or false is a clumsy measure. It is discrete, it can only be one or the other and nothing in between. It hides useful information that you might be able to assign to degrees of variability in lateness. It is not the type of measure that you want for critical processes. The result is that it requires much more data to understand the process. Alternatively, a time based measurement provides infinite possible outcomes. Each data point collected provides lots of information. This is why you need less data points when you measure a process on a continuum versus at discrete levels.?
The next thing to consider is the amount of variation that naturally occurs in the process. Think of it this way, any data you collect on a process that varies wildly from day to day or hour to hour is not going to be representative of that process over longer periods of time. You are going to need quite a lot of data to characterize a highly variable or unstable process. Think about lines at the airport during a holiday weekend. Now think about the line at Starbucks on a typical weekday. Holiday weekends are wildly variable. If you are trying to collect data to understand the staffing needed at O’Hare airport on the week of Thanksgiving, you need to look at many samples over multiple years. The line at Starbucks on a weekday is relatively predictable. You could pick just about any weekday, at a consistent time and collect data on the Starbucks drive-thru line. You are going to have a good idea on throughput. You don’t need much data.?
There is one other question to consider. It is sometimes the case that you are looking for a specific problem to occur. The process may be very consistent and you are looking for that one in a million cause that creates a big problem. In this case you require a large sample size or frequent sampling over a long period of time (which is still a large sample). The exception in this case is when you can create a test environment to reproduce the circumstance that creates that one in a million event. For those of you that are familiar with designed experiments you know that this is a great way of searching for causes without having to take much data. A designed experiment has its own set of questions and considerations that I won’t go into here. What you need to be aware of is that a designed experiment uses very small amounts of data to produce a large amount of information. In cases where you are looking for something very infrequent, DOE is a good way to avoid having to collect a large sample.
Everything that I’ve explained to this point are important considerations. Of course, when people ask, “how much data do I need?” they really just want a simple and exact answer. Black Belts are always reluctant to provide a specific number without knowing the details. I will go out on a limb and tell you to start with 30 data points. This is the typical recommendation for a sample size. Once you do your initial data collection and have learned a bit, then think about what you want to know. Are you looking for hard to find occurrences? Are you gauging the ability to maintain tight tolerances? Are you using discrete data? Depending on the answers, you are probably going to need thousands of data points. Most statistical software packages provide sample size calculators. They use the considerations that I have discussed in this article to estimate a quantity of data points to collect. This can be useful, but keep in mind that it's just a formula. Ultimately, your understanding of the process provides the inputs to that formula. The quality of your inputs will determine the validity in the number of data points that you will need to collect. Good luck and get out there and collect some data!
International Executive Coach & Author | Founder of ScaleYOU | CEO & Partner at LeanMail | Expert in Productivity and Leadership Development
1 年Another excellent article, Matt. I would like to recommend the book, "How to measure anything" by Douglas Hubbard. There are many parallels in your thinking. https://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/1118539273
We grow startups and mid-size businesses into massive brands using video. Exceptional video production | Digital marketing | Advertising and strategy | Consultative video marketing to drive the results you want!
1 年Looking forward to seeing more content on this Matthew Kroll especially as with one of our clients, we've dubbed 2023 'the year of data' ??