A Comprehensive Guide to Feature Engineering for Machine Learning in Python
Source: Google images

A Comprehensive Guide to Feature Engineering for Machine Learning in Python

These articles are part of my learning journey through my graduate program at University Of Michigan, Datacamp, Coursera & LinkedIn etc. You can find my similar articles & more stories at my?medium?profile. I am available at?kaggle?&?github. Thank you for your motivation, support & valuable feedback.


This article includes:

Introduction

Section 1: Dealing with messy data

Section 2: Text data processing

Section 3: Conforming to statistical assumptions

Section 4: Feature selection and extraction

Section 5: Python code samples

Conclusion

Quick references


Introduction

In feature engineering, relevant features are selected, transformed, and extracted from raw data to improve the performance of machine learning models. We will discuss the basics of feature engineering in this article as well as how to apply it to real-world datasets in Python.

No alt text provided for this image
Source: Google images

Section 1: Dealing with messy data

Our raw data often contains outliers, is messy, or is incomplete. The data must be preprocessed and cleaned before feature engineering can begin. This section explains how to handle missing values, outliers, and other common data cleaning techniques in Python.

No alt text provided for this image
Source: Google images

This section covers some common Python data cleaning techniques that can aid in the cleanup of messy data.

Handling Missing Values:

There are many reasons for missing data in datasets, including errors in data entry, technical difficulties, or simply because the data was not collected. The most common method of handling missing data is to impute them using the means, medians, or modes of the values that are available. Another method for filling in missing values is to use interpolation based on neighboring data points.

Pandas provides several functions for handling missing values in Python. As an example, you can use the fillna() method to replace missing values with a specific value or the interpolate() method to interpolate missing values based on neighboring values.

Detection and removal of outliers:

An outlier is a data point that differs significantly from the majority of the data points in a dataset. It is possible for outliers to result from measurement errors or to indicate interesting phenomena, but they can also negatively affect the performance of machine learning models.

No alt text provided for this image
Source: Google images

The z-score or modified z-score method is one of the most common methods for detecting outliers. Z-score is a statistical measure that indicates how far a data point is from the mean of a dataset. An improved version of the z-score is the modified z-score, which is less sensitive to outliers.

To detect and remove outliers from your dataset, you can use Scikit-learn in Python. The EllipticEnvelope class, for example, can be used to fit an elliptic envelope to a dataset and identify outliers.

Data Transformation:

The data we work with may not always be in a format suitable for machine learning models. There may be a need to convert categorical data to numerical data, or skewed data may need to be transformed in order to conform to statistical assumptions.

No alt text provided for this image
Source: Google images

It is common for data transformations to use feature scaling, such as standardization or normalization, to scale the features to a similar range. A logarithmic transformation can also be used to reduce the effect of outliers.

To transform your data in Python, you can use Scikit-learn. You can, for instance, use the StandardScaler class to standardize your features or the PowerTransformer class to perform logarithmic transformations.

Conclusion:

A crucial step in the machine learning pipeline is the handling of messy data. In this section, we reviewed some common data cleaning techniques in Python, including handling missing values, detecting and removing outliers, and transforming data. In order to prepare your data for feature engineering and machine learning modeling, you can use these techniques to preprocess your data.

Section 2: Text data processing

In many applications, such as natural language processing, sentiment analysis, and text classification, text data is a common type of data. Text data, however, requires special processing before it can be used for machine learning. We will cover how to preprocess text data using Python libraries such as NLTK and spaCy in this section.

No alt text provided for this image
Source: Google images

Tokenization:

Tokenization involves breaking down text into smaller units, such as words or phrases, which are referred to as tokens. By tokenizing the text, you reduce its complexity and make it easier to analyze.

Python’s NLTK library offers a number of functions for tokenization, such as the word_tokenize() function, which splits text into words, and the sent_tokenize() function, which splits text into sentences.

Stemming and Lemmatization:

In stemming and lemmatization, the inflectional and derivational forms of words are reduced to their base or root form. Using this technique can reduce the dimensionality of the text data and improve the performance of machine learning algorithms.

No alt text provided for this image
Source: Google images

In Python, the NLTK library provides several functions for stemming and lemmatization, such as PorterStemmer and SnowballStemmer for stemming, and WordNetLemmatizer for lemmatization.

No alt text provided for this image

Stop Word Removal:

Stop words are words that do not have much meaning and can be removed from a text without affecting the overall meaning. The removal of stop words can help reduce the dimensionality of text data and improve the performance of machine learning models.

No alt text provided for this image
Source: Google images

Python’s NLTK library provides a list of stop words that can be used to remove stop words from text.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
    showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
for w in word_tokens:
 if w not in stop_words:
  filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)]        
No alt text provided for this image

Part-of-Speech Tagging:

The process of tagging words based on their parts of speech is known as part-of-speech tagging. In machine learning models, part-of-speech tagging can assist in identifying the syntactic structure of the text.

No alt text provided for this image
Source: Google images

Python’s NLTK library provides several functions for part-of-speech tagging, such as the pos_tag() function, which assigns part-of-speech tags to each word.

from nltk import word_tokenize, pos_ta

print (pos_tag(word_tokenize("I'm learning NLP")))
#Output:
# [('I', 'PRP'), ("'m", 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP')]        

Named Entity Recognition:

Identifying and classifying named entities in text, such as people, organizations, and locations, is known as named entity recognition. In machine learning models, named entity recognition can be used to extract important information from text.

No alt text provided for this image
Source: Google images

The spaCy library in Python provides a number of functions for recognizing named entities, including the ner attribute of the Doc object, which contains information about named entities in the text.

Conclusion:

It is necessary to treat text data in a special manner before we can use it for machine learning. Using Python libraries such as NLTK and spaCy, we covered some common text preprocessing techniques, such as tokenization, stemming, lemmatization, stop words removal, and part-of-speech tagging. These techniques can assist you in preprocessing your text data and preparing it for feature engineering and machine learning.


Section 3: Conforming to statistical assumptions

Models of machine learning often make assumptions about the distribution of data. In linear regression, the data are assumed to be normally distributed, while in decision trees, the data are assumed to be linearly separable. Using Python libraries such as Scikit-Learn, we will demonstrate how to transform the data so that it conforms to statistical assumptions.

No alt text provided for this image
Source: Google images

A normality transformation involves transforming the data in order to conform to a normal distribution. Machine learning models that assume a normal distribution, such as linear regression, can benefit from normality transformations.

It is possible to perform normality transformations in Python using Scikit-learn. As an example, the PowerTransformer class can be used to perform a power transformation, which will transform the data into a normal distribution.

Logarithmic Transformation:

A logarithmic transformation involves taking the logarithm of the data. A logarithmic transformation can reduce the effect of outliers and make the data more symmetric, which can improve the performance of a machine learning model.

No alt text provided for this image
Source: Google images

To perform logarithmic transformations in Python, you can use Scikit-learn. To perform a logarithmic transformation, you can use the PowerTransformer class with the parameter method=’log’.

Box-Cox Transformation:

Box-Cox transformations are a group of power transformations that can be used to transform data in order to conform to a normal distribution. Machine learning models that assume a normal distribution, such as linear regression, can be improved with Box-Cox transformations.

Scikit-learn can be used to perform Box-Cox transformations in Python. The PowerTransformer class can be used to perform Box-Cox transformations, for example, by passing method=’box-cox’ as a parameter.

Scaling:

Scaling refers to the process of adjusting the data so that it falls within a similar range. Machine learning models such as k-nearest neighbors and support vector machines can benefit from scaling.

Scaling can be performed with Scikit-learn in Python. You can use the StandardScaler class to standardize the data or the MinMaxScaler class to scale the data between 0 and 1.

Conclusion:

As part of the machine learning pipeline, it is important to adhere to statistical assumptions. Using Python libraries such as Scikit-learn, we explored some common techniques for transforming the data to conform to statistical assumptions, such as normality transformation, logarithmic transformation, Box-Cox transformation, and scaling. Utilizing these techniques, you can transform your data in order to improve the performance of machine learning models that make assumptions about the underlying distribution of the data.


Section 4: Feature Selection and Extraction

Once the data has been preprocessed, we can begin selecting and extracting relevant features. The purpose of this section is to introduce the most popular techniques for selecting and extracting features using Python libraries, such as Scikit-Learn.

Correlation-Based Feature Selection:

The correlation-based feature selection method selects features according to their correlation with the target variable. Feature selection based on correlation can reduce the dimensionality of data and improve the performance of machine learning algorithms.

No alt text provided for this image
Source: Google images

To perform correlation-based feature selection in Python, you can use Scikit-learn. To select the k features with the highest correlation with the target variable, you can use the SelectKBest class with the parameter score_func=pearsonr.

Mutual Information-Based Feature Selection:

Feature selection based on mutual information is a technique for selecting features that have a mutual relationship with the target variable. The use of mutual information-based feature selection can improve the performance of machine learning models by identifying features related to the target variable.

No alt text provided for this image
Source: Google images

Using Scikit-learn in Python, you can select features based on mutual information. The SelectKBest class can be used with the parameter score_func=mutual_info_classif to select the k features with the highest mutual information with the target variable, for example.

Principal Component Analysis (PCA):

Principal component analysis involves transforming the data into a lower-dimensional space while retaining as much variance as possible. PCA can reduce the dimensionality of data and improve the performance of machine learning algorithms.

PCA can be performed in Python using Scikit-learn. As an example, you can use the PCA class to perform PCA and select the number of principal components based on the amount of variance you wish to retain.

Singular Value Decomposition (SVD):

In singular value decomposition, the data is decomposed into a set of orthogonal components. Machine learning models can be improved through the use of SVD in order to reduce the dimensionality of the data.

To perform SVD in Python, you can use Scikit-learn. It is possible to perform SVD using the TruncatedSVD class and select the number of components based on the amount of variance you wish to retain.

Feature Extraction from Text Data:

In order to extract features from text data, special consideration must be given. Additionally to the techniques discussed in Section 2, there are several techniques for extracting features from text data, such as bag-of-words and term frequency-inverse document frequency (TF-IDF).

To extract features from text data, you can use Scikit-learn in Python. Text data can be converted into a bag-of-words representation using the CountVectorizer class, or into a TF-IDF representation using the TfidfVectorizer class.

Conclusion:

An important step in the machine learning pipeline is the selection and extraction of features. Using Python libraries such as Scikit-learn, we examined some popular feature selection and extraction techniques, including correlation-based feature selection, mutual information-based feature selection, PCA, SVD, and feature extraction from text. Using these techniques, you can select and extract relevant features from your data and improve machine learning performance.


Section 5: Python Code Sample

Using Python libraries such as Pandas, NLTK, and Scikit-learn, we demonstrate some of the techniques covered in this article for feature engineering. As part of the code sample, data cleaning, text data processing, and feature selection and extraction will be performed using the popular Iris dataset.


# Import necessary librarie
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler

# Load data
data = pd.read_csv('data.csv')

# Clean text data
data['text'] = data['text'].str.lower() # Convert text to lowercase
data['text'] = data['text'].apply(lambda x: re.sub(r'[^\w\s]', '', x)) # Remove punctuations
data['text'] = data['text'].apply(lambda x: re.sub(r'\d+', '', x)) # Remove digits

# Create new features
data['word_count'] = data['text'].apply(lambda x: len(str(x).split(" "))) # Count number of words
data['char_count'] = data['text'].apply(lambda x: len(str(x))) # Count number of characters
data['stopwords'] = data['text'].apply(lambda x: len([word for word in str(x).lower().split() if word in stop_words])) # Count number of stopwords
data['numerics'] = data['text'].apply(lambda x: len([num for num in str(x) if num.isdigit()])) # Count number of numerics
data['upper'] = data['text'].apply(lambda x: len([word for word in str(x).split() if word.isupper()])) # Count number of uppercase words

# Create bag of words features
count_vectorizer = CountVectorizer(stop_words='english')
count_features = count_vectorizer.fit_transform(data['text'])
count_features = pd.DataFrame(count_features.todense(), columns=count_vectorizer.get_feature_names())

# Create TF-IDF features
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_features = tfidf_vectorizer.fit_transform(data['text'])
tfidf_features = pd.DataFrame(tfidf_features.todense(), columns=tfidf_vectorizer.get_feature_names())

# Standardize numeric features
numeric_features = ['word_count', 'char_count', 'stopwords', 'numerics', 'upper']
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[numeric_features])
scaled_features = pd.DataFrame(scaled_features, columns=numeric_features)

# Concatenate all features
all_features = pd.concat([count_features, tfidf_features, scaled_features], axis=1)

# Use feature selection techniques (e.g., PCA, Lasso, etc.) to select the most important features
# Use PCA for feature selection
from sklearn.decomposition import PCA

# Set number of components
n_components = 10

# Create PCA object and fit to data
pca = PCA(n_components=n_components)
pca.fit(all_features)

# Transform data
pca_features = pca.transform(all_features)

# View variance explained by each component
print('Variance explained by each component:', pca.explained_variance_ratio_)

# View cumulative variance explained
print('Cumulative variance explained:', np.cumsum(pca.explained_variance_ratio_))

# Select the most important features
selected_features = pca_features[:,:n_components]s        

Conclusion


Quick references

要查看或添加评论,请登录

Muhammad Asad Kamran PMP, ACP, Prince2, TOGAF的更多文章

社区洞察

其他会员也浏览了