登录查看更多内容

FEATURE ENGINEERING FOR MACHINE LEARNING.

Eric Odongo

Bio-Statistician/Clinical Programmer||Co-Founder@EPM-Square Analytics||Open Source Evangelist

发布日期: 2019年11月15日

Feature engineering is one of the key processes undertaken by almost every machine learning expert. In this article i intend to explain what it means and give a general overview of the specific techniques used within the machine learning community.

Feature engineering can be described as the process of fine tuning variables within a data set and then selecting scientifically those variables that gives optimal performance when included in a machine learning algorithm. There are two steps involved in feature engineering; ie variable transformation/cleaning and variable selection.

The first step ensures that you have a tidy data set useful for further analysis. Tidy in the sense that each row identifies a unique observation and each column identifies a unique variable/feature. The second step is majorly used for modeling purposes where you don't want to include all variables in your model.

DATA TRANSFORMATION

Imputation for missing data - missing data can adversely affect the performance of your model, therefore it is prudent to find a way of dealing with missing data. There are two possible solutions to this menace; deletion and imputation. These methods largely depend on the data that you have in terms of its volume and where you want to deploy your models. Personally i don't recommend deletion unless you are very sure. With imputation taking the median value is plausible in most cases.

Outlier problem - outliers should be eliminated because they affect the performance of machine learning models. Using domain knowledge you can come up with a rule to detect outliers in your data set. For example if you are dealing with age then any negative value or something beyond 120 years should be treated as an outlier.

Binning - this can be applied to both categorical and numerical data. It involves grouping your data into useful segments to make your models more robust and to avoid over fitting.

Log transformation - useful for numerical columns. It aids in normalizing highly skewed data and also decreases the effect of outliers.

Grouping operations - useful when applying aggregate functions. Its easier for numerical features as you can use the mean or sum functions. The situation is a bit complex when dealing with categorical data.(The interested reader to explore on this).

Scaling - this is useful when dealing with numerical features. It involves normalizing your columns so that all observations are within a specific range. Standardization is the popular approach where we use the Z-score statistic to normalize the data. This step is required for unsupervised machine learning algorithms such as k-means where distance metric is used.

Feature split - this is applied on character data depending on the requirements of your analysis. Technically it doesn't affect your model performance in any way.

VARIABLE SELECTION

The modeling process requires one to use a subset of the available variables. But if you have few variables you can as well include them all in your model. However in most cases you will be faced with an array of observations and therefore you will be forced to make a selection. The question is how do you tell the best candidates(variables) for your model...

I propose performing correlation analysis to find out highly correlated variables and eliminating such. More complex methods include cross validation, shrinkage methods(ridge regression and lasso) and dimension reduction techniques like principle components. These methods are well discussed in a textbook by Robert,Trevor, Daniela and Gareth (An introduction to statistical learning with applications in R)

[That marks the end of my short article on feature engineering. Hope it will motivate you to explore further. Thank you for reading]

要查看或添加评论，请登录

Eric Odongo的更多文章

MACHINE LEARNING ALGORITHMS.

2019年12月6日

MACHINE LEARNING ALGORITHMS.

In this brief article we look at various classes of machine learning algorithms. But hey! what is machine learning?…
CUSTOMER SEGMENT ANALYTICS IN R AND EXCEL.

2019年11月18日

CUSTOMER SEGMENT ANALYTICS IN R AND EXCEL.

In this article i intend to give a brief summary about customer segmentation techniques and how you can implement them…

1 条评论

FEATURE ENGINEERING FOR MACHINE LEARNING.

Eric Odongo

Bio-Statistician/Clinical Programmer||Co-Founder@EPM-Square Analytics||Open Source Evangelist

Eric Odongo的更多文章

社区洞察

其他会员也浏览了

AutoML (Automated Machine Learning) with Use-Cases

Understanding XGBoost: A Powerful Machine Learning Algorithm

Building Intelligent Systems Integrating Machine Learning with Data Engineering

Decision Tree in Machine Learning.

Automating Machine Learning (AutoML) Selection Criteria and Theoretical Principles

Day 2: The MLOps Lifecycle

Linear Regression for Machine Learning

7 common mistakes when doing Machine Learning

The Foundation of Successful Machine Learning in Data Science and Analytics

Unleashing the Power of Feature Engineering and Selection in Machine Learning: A Comprehensive Guide

Eric Odongo的更多文章

MACHINE LEARNING ALGORITHMS.

CUSTOMER SEGMENT ANALYTICS IN R AND EXCEL.

社区洞察

其他会员也浏览了

AutoML (Automated Machine Learning) with Use-Cases

Understanding XGBoost: A Powerful Machine Learning Algorithm

Building Intelligent Systems Integrating Machine Learning with Data Engineering

Decision Tree in Machine Learning.

Automating Machine Learning (AutoML) Selection Criteria and Theoretical Principles

Day 2: The MLOps Lifecycle

Linear Regression for Machine Learning

7 common mistakes when doing Machine Learning

The Foundation of Successful Machine Learning in Data Science and Analytics

Unleashing the Power of Feature Engineering and Selection in Machine Learning: A Comprehensive Guide