Quantico: Feature Engineering
Quantico offers a wide range of operations that assist in the data science & analytics process. One of those major components is feature engineering and that's what I'll be covering in this article. I classify feature engineering as separate from data wrangling which I covered in a previous article. There are some other functions that can be used as feature engineering from the unsupervised learning set of tasks, such as word2vec, anomaly detection, and dimensionality reduction, but I'll cover those in another article.
Feature Engineering is one of the those data wrangling subsets that I believe is much easier to manage in a programming language, such as R or Python, especially while iterating within a project. New feature can be explored at a faster pace, some operations are simply easier to execute outside of SQL, can be created with more flexibility, can be more dynamic, and can be more easily reused with the storage of functions that are common across projects.
With Quantico, feature engineering makes use of my R package on GitHub called Rodeo (R Optimized Data Engineering Operations), which makes use of the R data.table package under the hood. A user has a variety of methods available to them that I categorize into four buckets (listed below). The methods list is near comprehensive for most projects. If you would like to see additional methods made available, please make requests on the GitHub repo in the issues section: https://github.com/AdrianAntico/Rodeo/issues .
Lastly, after you run your feature engineering operations, feel free to jump to the Code Print output tab to retrieve the code that was used to run the operations. They will showcase the Rodeo functions utilized. If you're curious about what goes on under the hood, feel free to browse the source code on GitHub.
Top Level Feature Engineering Categories
Windowing Functions
The windowing functions include three types:
Lags and Rolling Statistics for Numeric Variables
The lags and rolling stats for numeric variables are some of the most important feature engineering methods available as they typically lead to big model improvements, unless you're working with strictly cross-sectional data, which typically isn't the case in business settings.
The methods allows a user to generate various lags and rolling stats, by grouping variables if requested, and they include:
There are generally two categories that are associated with these types of variables:
The Autoregressive version simply means that the lags and rolling stats are generated based on the target variable of interest, which are common in forecasting problems but not limited to them. The distributed lag version refers to lags and rolling stats generated based on independent variables.
Examples:
Consider a model to predict whether a student will drop out of school, or if an employee will be working at a company a year from now. For the school example, GPA is a natural selection of a variable, which is essentially a moving average that covers the entire student history. Alternatively, you could generate a quasi-GPA that only covers the previous semester. For the employee retention model, you can look at the previous employee reviews, rolling averages of sick days, rolling average of late arrivals, etc.
Lags and Rolling Mode for Categorical Variables
The lags and rolling stats for categorical variables is much less common but can be a great set of features to create if possible, for the same reasons mentioned above.
Differencing of Numeric, Categorical, and Date Variables
Differencing is another great way to extract information about the past and use it for prediction purposes.
Suppose you're predicting the time until default on a loan and you have access peoples' bank account information. Differencing can be extremely useful in that you can look at their current balance vs. their balance a month ago, or two months ago, etc. You can also look at their previous months balance vs. the month before that, and the second month before that, etc. That is an example of differencing with a numeric variable.
Next, you could look at the days between payments on their debt. If someone typically makes payments weekly and then it turns into monthly, that could mean that they're having trouble with money, thus, increasing the possibility of default.
Lastly, let's say you have a creating rating, but not the actual number. Suppose further that the ratings are A, B, and C. With categorical differencing, you can create difference variables that show any transitions from one category to another which could be either indicative of default or the opposite, depending on which states the person goes to and came from.
This wraps up the Windowing functions.
Numeric Functions
The numeric functions include four types:
领英推荐
Percent Rank Function
Create a new variable that is a percent rank of a numeric column. The values lie within [0,1] as they are percentile values. This transformation converts your variable to a uniform distribution. This transformation is the one used to convert a scatterplot into a copula plot.
Standardize Function
The standardize function has the options of centering and scaling and to be done by group variables. The centering part comes from subtracting the mean (or group level mean) from the actual value. The scaling part comes from diving that value by the standard deviation (or group level standard deviation).
Transform Function
The transformation function has several transform methods, including:
Interaction Function
The interaction function will multiply numeric variables together to create interactions. You can specify to have the variables standardized first to prevent numeric overflow.
Categorical Functions
The categorical functions include two types:
Categorical Encoding
Categorical encoding will convert a categorical variable to a numeric one via one of many methods, which include:
Partial Dummies
The partial dummies function can actually generate a full set of dummy variables but where it different from pretty much every other method is that you can specify to have only some of the group levels converted to dummy variables while the left are simply left out. I do this sometimes in conjunction with categorical encoding to try to push a little more importance to some group levels.
Calendar Functions
The calendar functions are really useful to extract information out of your date variables and there are two types included:
Calendar Variables
The calendar variables function converts a date variable into numeric variables to indicate a time unit for the following:
Holiday Variables
Holiday variables are currently included for: