A step before Data science
Dattatreya Hullur
Director & Principal Data Scientist BE, MBA, MS, PGD Data Science, Azure AI Engineer, Azure Associate Data Scientist
Data science projects starts with data in hand or after large amounts from the data sources. Moment the data is in place, Data Scientists or analysts start working on preprocessing the data. Before we acquire data, it is important to understand the business processes producing this data. So, why is it so important to know the data sources?
As most of you know, the quality of input is decided based on the output – so it becomes crucial to understand the kind of data used as input which might otherwise become challenging to generalize the model at later stages. It also requires a lot of regularization techniques to be employed. This input data will often have high variability with lot of outliers and noise. These processes are naturally unstable and not good for modelling.
Stability of Business Processes
In probability theory, a stable process is a type of stochastic process. It includes stochastic processes whose associated probability distributions are stable distributions.
Typically, a process is said to be stable when all the response parameters that we use to measure the process have both constant means and constant variances over time along with a constant distribution. This is also called controlled variation.
There are two parameters which influence a process to be unstable:
1. Inherent noise in the process
2. Process Changes (ex. Automation, Introduction of tools, Changes in method etc)
We can’t do much on the first parameter as it’s something inherent in the process. With the process changes one needs to do further investigations. Imagine a situation, of using an already trained machine learning model on changed process data - this will definitely lead to poor prediction from the model and its worthiness will always be a question mark. Hence, it is important for a Data Scientist to understand the changes in these business processes and tweak the models accordingly.
How do we understand changes in the process?
‘Control charts' are the best way to understand the changes in the processes. The control chart is a graph used to study how a process changes over time. Data is plotted in time order and a control chart always has a central line for the average, an upper line for the upper control limit, and a lower line for the lower control limit. These lines are determined from historical data. By comparing current data to these lines, you can draw conclusions about whether the process variation is consistent (in control) or is unpredictable (out of control, affected by special causes of variation). This versatile data collection and analysis tool can be used by a variety of industries and is considered one of the seven basic quality tools.
For variable data, the Control charts are used in pairs. The top chart monitors the average, or the centering of the distribution of data from the process. The bottom chart monitors the range, or the width of the distribution. For example, your data were shots in a target practice, the average is where the shots are clustering, and the range is how tightly they are clustered. Control charts for attribute data are used singly.
Shewhart’s 8 rules criteria for checking instability
When to Use a Control Chart ?
· When controlling ongoing processes - to find and correct problems as they occur
· When predicting the expected range of outcomes from a process
· When determining whether a process is stable (in statistical control)
· When analyzing patterns of process variation from special causes (non-routine events) or common causes (built into the process)
In Machine learning, Exploratory Data Analysis (EDA) is an important step. Exploratory data analysis was promoted by John Tukey, to encourage statisticians to explore the data, and possibly formulate hypothesis that could lead to new data collection and experiments.
Typically, we check for outliers during EDA and make decision to keep or remove them. Most of the time, we just remove them before understanding the underlaying cause. So, we need to differentiate the causes for outliers. If outliers are due to process change then, it is better to collect new data and use it for modelling.
The flowchart below depicts this process in detail -
Conclusion: Understanding the stability of the data sources will enable us in generalizing the model well. Overfitting and outliers can be treated early in the data science process and through this we can also see a tremendous reduction in effort for regularization.
Disclaimer: The views, thoughts, and opinions expressed in the text belong solely to the author and not necessarily to the author’s employer, organization, committee, or other group or individual.