A non-technical guide to Data Science
Snigdha Kakkar
?? Accelerate your AI career with daily insights! | 6x LinkedIn Top Voice (Generative AI, Data Science, Machine Learning) | Innovating in Generative AI space | Join 21K+ followers
For beginners, the words data science and machine learning have always been complex. While most of the articles on data science and machine learning are oriented towards their applications in diverse industries and are primarily written for developers and pros, this article aims to explain the core concepts of data science and machine learning for a non-technical person who has interest in learning about data-driven insights in businesses.
Data Science is basically a set of fundamental principles that guide the extraction of knowledge of data.
Now, your data source could be a file, a database, a website, certain forms / surveys, an API, or a device. Whenever, we extract certain data from an API source, there is an underlying process that takes place, which is shown in the below figure. In this process, a request is sent from the client with certain additional information or condition to the API, to which the server responds.
Almost 20% of a data scientist's time is spent in gathering and collecting data sets, followed by around 60% of the time being spent in cleaning and organizing that data. Thus, exploring and processing data is a very critical stage to draw meaningful insights.
This particular stage of exploring and processing data is further broken down four stages -
Now, here Exploratory Data Analysis is done in 5 ways:
- Basic structure (Number of rows or observations, Number of columns / features, Column data types, Exploring the head / tail of data)
- Summary statistics (Numerical: Centrality measures, such as mean, median, mode of data sets; Dispersion measures, such as variance, standard deviation, range, percentiles. Categorical: Total count, Unique count, Category count and proportions, Per category statistics)
- Distributions (Univariate distributions: Histogram, Kernel Density Estimation plot; Bivariate distributions: Scatter plot)
- Grouping and aggregation of data using certain common conditions
- Using crosstabs or pivots to classify data related to multiple features
In my next article, I will dig deeper into each method of Exploratory Data Analysis in detail with examples.
On a high level, exploratory data analysis sets the stage for further data cleansing and munging. Under data munging or wrangling, we treat missing values in columns for certain entries and also work with outliers. The reasons of non-availability of such missing items could be erroneous data entry process, non-availability from the source, or equipment error. As certain missing items and outliers might skew our information, we either delete them or imputate them by adequate measures (Mean Imputation, Median Imputation, Mode Imputation, Forward/Backward fill, Predictive Model) for our futher analysis. The outlier detection is quite easy by using a histogram or a boxplot or a scatter plot. The outliers are treated in the following ways:
- Removal
- Transformation
- Binning
- Imputation
Next, we move towards Feature Engineering. It is a process of transforming raw data to better representative features in order to create better predictive models. Some data scientists say that feature engineering is an art. It requires domain as well as technical knowledge. The three key ways feature engineering is done are: Transformation of features, Creation of features, Selection of features. Many a times, data scientists use this stage to convert all teh categorical features to numerical ones as predictive models in machine learning do not work with categorical features. For this purpose, we use Categorical feature encoding. There are several methods to do such encoding. One of the easiest ways is Binary Encoding, wherein categories such as Gender are broken down into Is_Male or Is_Female and are populated by binary figures of 0 or 1. The second way to do such encoding is Label encoding, wherein we encode the categorical features, using labels such as Low, Medium and High by 1,2 and 3 respectively. If you use python for making such predictive models, then a great way of making such features is by using One-Hot Encoding method.
Finally, the stage is set for building predictive models and advanced visualizations. One can start by building baseline models and improve those models by fine tuning them.
My next article would delve deeper into each of these stages, as there is so much to discover in data science and a lot more to experience. Hope you enjoy this journey, no matter when you start!