Six Sigma Pizza – Pie 15
So far…
We are in the second phase of DMAIC, that is in the Measure phase. We saw the need for statistics in problem-solving, understanding the process or any event from which we collect data using the location of data called central tendency. We discussed the Mean, Mode and Median in the previous chapter.
I wanted to reiterate that understanding and to apply statistical tools in day to day decision making is not only simple; it will become indispensable once we get to use it.
It becomes necessary to recap the story nowadays, as we got into the depth of data and statistics. We are in our journey towards improving my friend's pizza business using Six Sigma. We understood and defined his business problems as projects. We organised a two-day workshop on data analysis and statistics as part of Measure Phase training. We saw the central tendency in the previous chapter. In our story, we took our lunch break on day one.
Now
"What do we understand by calculating the average weight of food each person is taking for lunch?" Anand started the post-lunch session with his hesitant question.
"Why do you measure the weight of the food on the plate?" I asked him.
"No, no. I asked to understand how the data collection and the average would help us".
"This is what happens when we find out a new tool. We try to use it or test it upon all the circumstances. All we need here is to understand how statistics and data analysis would help us and try to apply it to our project."
"We see in many companies – they have hundreds of headers of data; data is captured online, machine to machine, abnormalities and everything. However, at last, the purpose of collecting the data is forgotten.
Data analysis without a purpose is as good as sailing on a ship without knowing the destination.
I believe the data we collect shall lead to an understanding of the problem we consider / a focus area and then to wisdom."
I showed a slide on Data Analysis on the projector.
Convert Data into Wisdom - With which we can decide and act.
- Data: Collection of raw information or facts used for computing, reasoning, or measuring.
- Insights: This is the outcome of collecting and organising data, establishing the relationships, trends and patterns. These findings will provide an insight into context and meaning.
- Wisdom: We understand the behaviour and utilise the past and future patterns for the benefit of organisation, society and to humanity.
I could see a deep silence and a more profound realisation among the team members.
Dispersion
"Now, let us proceed with our descriptive statistics. As per our day's plan, we need to complete this part of statistics today" I said.
"Sir, then will you teach inferential statistics the whole day tomorrow?" Abhishek.
"There is something more than that for tomorrow" I continued to get their attention back to the subject of the hour.
"In our previous sessions, you understood the Central Tendency with Mean, Median and Mode. Am I right?"
"Yes, Sir!" they answered in high energy chorus.
"Then let us do a small exercise!" I told them while starting to display the below table on the projector. Imagine I collected the data of temperature of pizza served on two days - each day data in separate columns.
tone.
“Yes, sir. He had grown up a lot” said Ganesh, and that’s their time for a burst of laughter.
They quickly got into calculating the Mean, Median and Mode.
Within a few minutes, they started announcing their results one by one, “Sir, Mean, Median and Mode all are 62 for both the data sets”.
“Sir, Mean is 62 for both, the median is 58 for day 1, 62.5 for day 2, and Mode is 62 in both cases”, answered Kiran.
“No, Kiran. While calculating the median, we have to arrange the data set in an ascending or descending order.”
He took a few moments and came back. “Sorry, sir, I forgot. Yes, Mean, Mode and Median are 62 for both the data sets”.
“Good. Since all the three parameters of central tendency of both the data sets are same, shall I call them identical, meaning they behave similarly?”
Anand came back quickly, “No, I can see that data are different in both days, even though the central tendencies are the same”. Others were in silence.
“True. When we plot the data points on a graph, called frequency distribution plot we get the below outcome” I showed the pictures.
"Now, I can see the difference. The first-day data is spread from 58 to 67 whereas the day 2 data are confined between 61 and 63", Anand acknowledged.
"Yes, sir! After looking at the picture, we understand that both data sets are completely different. Even in our lives, we may perceive two leaders as equal under certain circumstances; however, their true nature comes out at challenging times and differentiate a great leader from a mediocre leader. It is the metric that differentiates"; Ganesh shared his Gyan.
"Guru Ganesh Ji! My salutes to you!" I bowed.
"I mean to say; the central tendency alone is not sufficient to explain the behaviour of the subject we study. We will see three properties of any data under review. They are
1. Central Tendency
2. Dispersion and
3. The shape of the Data.
Let us now focus on dispersion.
Dispersion is a measure of the precision of our data set. That is, we measure 'how much close our data points to each other?'. This is also called data spread.
Again, the dispersion is measured in three ways namely,
1. Range
2. Variance and
3. Standard Deviation.
Range
"Sir, I remember what range is," said Kiran.
"Wow, Kiran! Thanks for helping me to break the monotonous mode" I paused to look at him and said, "Yes, you may proceed".
"Sir, we generally use range in our day to day conversations. For example, the voltage range is from 180 to 230 volts or the price range of Rs. 20,000 to 50,000, etc.”
"Excellent Kiran" I appreciated him and continued.
"As we understand from the word, the range is the simplest measure of spread. This is the difference between the largest value and the smallest value in a data set", and I wrote
Range of Data set = Max value – Min value
"So, theoretically, the range of a data set is a single value. Am I right?" asked Anand.
"Yes, Anand, you are right. The range is always a single value. That's how we use to calculate in statistics. A bit different from our general usage" I said.
"Going by the definition, the voltage range will be 50 volts, and the price range is Rs. 30,000".
"You are right" affirmed Anand.
"But we need to be cautious while using this indicator to understand the behaviour of our process. The range is affected by the extreme values in a data set; The range of data having even one extreme value will have larger value whereas most of the other data points would have been closer to each other" I said.
I displayed the sample data from the computer and continued, "For example, consider the data sets."
"Sir, please give us a moment. We will calculate the range. It sounds easier for me" said Vineet. He started making some calculations in the air without waiting for me.
"Yes, Sir. It is simple. Lets workout!" Kiran requested.
"Sir, range of data set 1 is 4, that is 34-30 and data set 2 is 12, 42-30. Am I right?" asked Vineet.
"Yes, Vineet. You are!" I lauded and waited for others to compute. I continued, "Range in case of data set 1 is closely representing the behaviour of the data set; but in case of data set 2, the range 12 is not going well with all the data points, as it is stretched by one extreme value (42); it is not representing the behaviour of its data set as close as that of data set 1 and its range".
"If you are comfortable, shall we move to the next indicator of dispersion, that is Variance?" I enquired.
Variance
"The range will get distorted with even one extreme value. So, what can we do now?" I asked them.
"Sir, you already told us what to do. Now we are going to use Variance. Right Sir?" Abhishek.
"Right, I understand Abhishek, we are due for a coffee break". We laughed together, amid the serious session.
I continued, "we are in search of another metric which can represent the behaviour of data set more precisely. Our basic statistical knowledge says that the average spread of all data points from the centre value would serve the purpose. Just now, we have seen the central location of the data. Are you with me?"
"Let us experiment!" said Anand.
"We will move ahead to introduce a central (mean) line and calculate the average spread of all data points from their central line – we call it as average spread/dispersion from the centre. Consider our earlier data set.
There are data points from X1, X2, X3, up to X10. We call the average of all data points as Mean (the central line at 62).
The distance of each data point from the mean value (X1-Mean) is d1, d2, d3 ... d10 respectively.
d1 will be 0 (62-62), d2 will be 65-62=3, whereas d3 will be 58-62= -4 because d3 is lesser than the mean. When we calculate the average of all distances, it will be mathematically zero or close to zero, as d’s on left and d’s on the right will cancel out each other when we add positive and negative values.
What to do? Yes! We can square up the d’s so that every d^2 becomes positive.
In our case,
12 is the average of squared deviations from the centre; this term is also called as variance and denoted by σ ^2 (sigma-squared).
"Wait! We intended to calculate the average dispersion of each point from the centre line of data set, right? However, what we have now is the average of squared dispersions. Here we go; we will take a square root for the variance.
We've got a useful metric. We wanted to calculate the average dispersion of each data point from the centre line to understand the overall dispersion properties of data. However, we cannot call the term σ as average; because, in the formula, we have taken the square root to the denominator' n' which is not a squared term. So, we need to give another name; Statisticians call σ as the standard deviation.
Standard deviation is useful to understand the spread of data points; and also, to compare the spread of multiple data sets. Higher the standard deviation higher will be the spread. Let us go back to our example to understand standard deviation".
I assigned them calculation of standard deviation for data set 2 as homework and proceeded with a coffee break and gathered again.
Mean & Standard Deviation
"Now, we can explain the behaviour of both data sets with central tendency complemented by Standard Deviation; simply with mean and standard deviation. The higher standard deviation more will be the spread of data".
Note: As per statistics, the denominator in the formulae is (n-1) instead of n. Here, n-1 is called degrees of freedom.
Quartiles and Inter Quartile Range
In addition to Mean, Median and Mode, Quartiles will also provide us with information on the location of data.
"Sir, one more tool?" asked Abhishek.
"This is the metric that explains both the central tendency and dispersion. As we saw this is as simple as other metrics", I explained.
"Imagine, four of your neighbourly children come to you with one chocolate bar and request you to split their shares. You have to cut it into four parts – importantly, four equal parts!
I drew a rectangular chocolate bar on the board and invited them to divide by offering the pen to them.
Vineet came up again, "Sir, let me try".
"First I will measure the length, here the height - of course. Then will divide the length by four and mark on the bar. He measured the chocolate bar and marked the cuts.
"Wonderful Vineet!"
"First, you cut the bar splitting the bottom 25%. Right?
This line is called first quartile (Q1) or 25th percentile, meaning, 25% of the chocolate bar is below this line.
Then the second line cuts the chocolate bar into two halves. This is our median and also called as Second Quartile (Q2) or 50th percentile.
The third line is called third quartile (Q3) or 75th percentile, meaning 75% of the chocolate bar is on one side (left) of the line.
I marked the percentiles and quartiles on the board.
The Inter Quartile Range (IQR)
"The Inter Quartile Range (IQR) is the range between Q1 and Q3". I wrote on the board,
IQR = Q3 – Q1
"What is the advantage of using IQR?"
Without waiting for their answers, I continued, "Well, IQR gives us a key indication - where are the middle 50% of data points are lying".
I told them, "Now the trick, try to comprehend this discussion replacing the chocolate bar with data".
I asked them, "Now before we call the day, its question time! This indicator is something special amongst all others we have been discussing. Can anyone explain?"
Very few attempted to answer, but no one came close. Then I offered them a clue. "Think about why I chose to discuss this tool after completing both central tendency and dispersion!"
"Kannan, I think I got it. Correct me if I am wrong!" said Ben.
"Sure Ben. When you are coming up, it will be right! Please go ahead."
"This tool explains the central tendency as well as the dispersion of data. The Q2 tells us the central tendency and the distance at which Q1 and Q3 are placed will tell us about the spread".
"That's excellent, Ben! I raised my hands and clapped signalling the participants to follow in applauding him.
"We will connect this indicator when we discuss the graphical data analysis using Box-Plot, tomorrow."
I demonstrated to them how all the above indicators can be calculated with a few clicks using MS Exel and Minitab and assigned the same as homework for the evening.
The Exel output for Descriptive statistics of the two datasets are
Next
The whole of chapters 14 and 15 are dedicated to the basic indicators of data - the central tendency and dispersion encompassing the title of descriptive statistics. We have not touched the Standard Error, Kurtosis and Skewness.
In the next chapters, we will look into the Inferential Statistics and then move on to an understanding of different types of data, data visualisation, screening and stratification of data.
Senior Logistics Project Manager - Offshore Construction Site Logistics at Siemens Gamesa
5 年I enjoyed a lot reading it. Good work!?
Visiting Faculty on International Marketing & Foreign Trade Policy -Working, Speaking and Writing.
5 年Fantastic work.