Jiliamazing com login register.Enjoy Free 888+200 Daily Legal Bonus

Quantico offers a wide range of operations that assist in the data science & analytics process. One of those major components is feature engineering and that's what I'll be covering in this article. I classify feature engineering as separate from data wrangling which I covered in a previous article. There are some other functions that can be used as feature engineering from the unsupervised learning set of tasks, such as word2vec, anomaly detection, and dimensionality reduction, but I'll cover those in another article.

Feature Engineering is one of the those data wrangling subsets that I believe is much easier to manage in a programming language, such as R or Python, especially while iterating within a project. New feature can be explored at a faster pace, some operations are simply easier to execute outside of SQL, can be created with more flexibility, can be more dynamic, and can be more easily reused with the storage of functions that are common across projects.

With Quantico, feature engineering makes use of my R package on GitHub called Rodeo (R Optimized Data Engineering Operations), which makes use of the R data.table package under the hood. A user has a variety of methods available to them that I categorize into four buckets (listed below). The methods list is near comprehensive for most projects. If you would like to see additional methods made available, please make requests on the GitHub repo in the issues section: https://github.com/AdrianAntico/Rodeo/issues .

Lastly, after you run your feature engineering operations, feel free to jump to the Code Print output tab to retrieve the code that was used to run the operations. They will showcase the Rodeo functions utilized. If you're curious about what goes on under the hood, feel free to browse the source code on GitHub.

Top Level Feature Engineering Categories

Windowing
Numeric
Categorical
Calendar

Windowing Functions

The windowing functions include three types:

Lags and Rolling Statistics for Numeric Variables
Lags and Rolling Mode for Categorical Variables
Differencing of Numeric, Categorical, and Date Variables

Lags and Rolling Statistics for Numeric Variables

The lags and rolling stats for numeric variables are some of the most important feature engineering methods available as they typically lead to big model improvements, unless you're working with strictly cross-sectional data, which typically isn't the case in business settings.

The methods allows a user to generate various lags and rolling stats, by grouping variables if requested, and they include:

Lags
Rolling Mean
Rolling Standard Deviation
Rolling Skewness
Rolling Kurtosis
Rolling Percentiles

There are generally two categories that are associated with these types of variables:

Autoregressive (and moving average)
Distributed Lag (and moving average)

The Autoregressive version simply means that the lags and rolling stats are generated based on the target variable of interest, which are common in forecasting problems but not limited to them. The distributed lag version refers to lags and rolling stats generated based on independent variables.

Examples:

Consider a model to predict whether a student will drop out of school, or if an employee will be working at a company a year from now. For the school example, GPA is a natural selection of a variable, which is essentially a moving average that covers the entire student history. Alternatively, you could generate a quasi-GPA that only covers the previous semester. For the employee retention model, you can look at the previous employee reviews, rolling averages of sick days, rolling average of late arrivals, etc.

Lags and Rolling Mode for Categorical Variables

The lags and rolling stats for categorical variables is much less common but can be a great set of features to create if possible, for the same reasons mentioned above.

Differencing of Numeric, Categorical, and Date Variables

Differencing is another great way to extract information about the past and use it for prediction purposes.

Suppose you're predicting the time until default on a loan and you have access peoples' bank account information. Differencing can be extremely useful in that you can look at their current balance vs. their balance a month ago, or two months ago, etc. You can also look at their previous months balance vs. the month before that, and the second month before that, etc. That is an example of differencing with a numeric variable.

Next, you could look at the days between payments on their debt. If someone typically makes payments weekly and then it turns into monthly, that could mean that they're having trouble with money, thus, increasing the possibility of default.

Lastly, let's say you have a creating rating, but not the actual number. Suppose further that the ratings are A, B, and C. With categorical differencing, you can create difference variables that show any transitions from one category to another which could be either indicative of default or the opposite, depending on which states the person goes to and came from.

This wraps up the Windowing functions.

Numeric Functions

The numeric functions include four types:

Percent Rank Function
Standardize Function
Transform Function
Interaction Function

Percent Rank Function

Create a new variable that is a percent rank of a numeric column. The values lie within [0,1] as they are percentile values. This transformation converts your variable to a uniform distribution. This transformation is the one used to convert a scatterplot into a copula plot.

Standardize Function

The standardize function has the options of centering and scaling and to be done by group variables. The centering part comes from subtracting the mean (or group level mean) from the actual value. The scaling part comes from diving that value by the standard deviation (or group level standard deviation).

Transform Function

The transformation function has several transform methods, including:

Square Root
Asinh
Log
Log + a
Asin
Logit
BoxCox
YeoJohnson

Interaction Function

The interaction function will multiply numeric variables together to create interactions. You can specify to have the variables standardized first to prevent numeric overflow.

Categorical Functions

The categorical functions include two types:

Categorical Encoding
Partial Dummies

Categorical Encoding

Categorical encoding will convert a categorical variable to a numeric one via one of many methods, which include:

Credibility (James Stein)
Target Encoding
Weight of Evidence
Poly Encoding
Backward Difference
Helmert

Partial Dummies

The partial dummies function can actually generate a full set of dummy variables but where it different from pretty much every other method is that you can specify to have only some of the group levels converted to dummy variables while the left are simply left out. I do this sometimes in conjunction with categorical encoding to try to push a little more importance to some group levels.

Calendar Functions

The calendar functions are really useful to extract information out of your date variables and there are two types included:

Calendar Variables
Holiday Variables

Calendar Variables

The calendar variables function converts a date variable into numeric variables to indicate a time unit for the following:

Second
Minute
Hour
Day of Week
Day of Month
Day of Year
Week of Month
Week of Year
Month of Year
Quarter of Year
Year

Holiday Variables

Holiday variables are currently included for:

US Public Holidays
Easter Holidays
Christmas Group Holidays
Other Ecclestrical Feasts

Quantico: Feature Engineering

Adrian Antico

Data Scientist and Open Source Contributor

Top Level Feature Engineering Categories

Windowing Functions

Lags and Rolling Statistics for Numeric Variables

Lags and Rolling Mode for Categorical Variables

Differencing of Numeric, Categorical, and Date Variables

Numeric Functions

领英推荐

Percent Rank Function

Standardize Function

Transform Function

Interaction Function

Categorical Functions

Categorical Encoding

Partial Dummies

Calendar Functions

Calendar Variables

Holiday Variables

更多精彩文章

社区洞察

其他会员也浏览了

Dataprep - An Auto_EDA library

Klib Library

Bamboolib - an Auto EDA library

65 Best Resources to Learn Data Analysis

DSA Mastery: Introduction to Data Structures

Should we learn programming to Future proof ourselves?

Get Started with Data Science - Minimum Viable Tool (MVT)

10 Best Data Science Tools for Non-Programmers

Boost Your Data Cleaning Workflow with PyJanitor

Data Engineering Explained

Top Level Feature Engineering Categories

Windowing Functions

Lags and Rolling Statistics for Numeric Variables

Lags and Rolling Mode for Categorical Variables

Differencing of Numeric, Categorical, and Date Variables

Numeric Functions

领英推荐

Percent Rank Function

Standardize Function

Transform Function

Interaction Function

Categorical Functions

Categorical Encoding

Partial Dummies

Calendar Functions

Calendar Variables

Holiday Variables

Python QuickEcharts

2024年3月26日

Quantico: Multiclass Evaluation Plots

2024年3月1日

Quantico: Plotting

2024年1月19日

Quantico: Code Generation Part_2 - Data Wrangling

2024年1月17日

Quantico: Code Generation Part_1 - Plotting

2024年1月12日

Quantico: Hypothesis Testing

2023年11月27日

Quantico: Forecasting Panel & Single Series Data

2023年11月22日

Quantico: Machine Learning

2023年11月8日

Quantico: Unsupervised Learning

2023年11月2日

Quantico: Data Wrangling

2023年10月31日

社区洞察

其他会员也浏览了

Dataprep - An Auto_EDA library

Klib Library

Bamboolib - an Auto EDA library

65 Best Resources to Learn Data Analysis

DSA Mastery: Introduction to Data Structures

Should we learn programming to Future proof ourselves?

Get Started with Data Science - Minimum Viable Tool (MVT)

10 Best Data Science Tools for Non-Programmers

Boost Your Data Cleaning Workflow with PyJanitor

Data Engineering Explained