FEATURE ENGINEERING FOR MACHINE LEARNING.
Eric Odongo
Bio-Statistician/Clinical Programmer||Co-Founder@EPM-Square Analytics||Open Source Evangelist
Feature engineering is one of the key processes undertaken by almost every machine learning expert. In this article i intend to explain what it means and give a general overview of the specific techniques used within the machine learning community.
Feature engineering can be described as the process of fine tuning variables within a data set and then selecting scientifically those variables that gives optimal performance when included in a machine learning algorithm. There are two steps involved in feature engineering; ie variable transformation/cleaning and variable selection.
The first step ensures that you have a tidy data set useful for further analysis. Tidy in the sense that each row identifies a unique observation and each column identifies a unique variable/feature. The second step is majorly used for modeling purposes where you don't want to include all variables in your model.
DATA TRANSFORMATION
Imputation for missing data - missing data can adversely affect the performance of your model, therefore it is prudent to find a way of dealing with missing data. There are two possible solutions to this menace; deletion and imputation. These methods largely depend on the data that you have in terms of its volume and where you want to deploy your models. Personally i don't recommend deletion unless you are very sure. With imputation taking the median value is plausible in most cases.
Outlier problem - outliers should be eliminated because they affect the performance of machine learning models. Using domain knowledge you can come up with a rule to detect outliers in your data set. For example if you are dealing with age then any negative value or something beyond 120 years should be treated as an outlier.
Binning - this can be applied to both categorical and numerical data. It involves grouping your data into useful segments to make your models more robust and to avoid over fitting.
Log transformation - useful for numerical columns. It aids in normalizing highly skewed data and also decreases the effect of outliers.
Grouping operations - useful when applying aggregate functions. Its easier for numerical features as you can use the mean or sum functions. The situation is a bit complex when dealing with categorical data.(The interested reader to explore on this).
Scaling - this is useful when dealing with numerical features. It involves normalizing your columns so that all observations are within a specific range. Standardization is the popular approach where we use the Z-score statistic to normalize the data. This step is required for unsupervised machine learning algorithms such as k-means where distance metric is used.
Feature split - this is applied on character data depending on the requirements of your analysis. Technically it doesn't affect your model performance in any way.
VARIABLE SELECTION
The modeling process requires one to use a subset of the available variables. But if you have few variables you can as well include them all in your model. However in most cases you will be faced with an array of observations and therefore you will be forced to make a selection. The question is how do you tell the best candidates(variables) for your model...
I propose performing correlation analysis to find out highly correlated variables and eliminating such. More complex methods include cross validation, shrinkage methods(ridge regression and lasso) and dimension reduction techniques like principle components. These methods are well discussed in a textbook by Robert,Trevor, Daniela and Gareth (An introduction to statistical learning with applications in R)
[That marks the end of my short article on feature engineering. Hope it will motivate you to explore further. Thank you for reading]