Feature Engineering techniques in Python
Anis Ayari
Je crée des agents IA pour votre cas d’usage. Fondateur de DeeplayerAI / Créateur de contenu en IA / Head of AI / Speaker AI
Features engineering is a crucial part in each Machine Learning project. In this article we will go around some techniques to handle this task. Please do not hesitate to comment with new ideas, I will try to keep this article updated as much as possible.
Merge Train and Test
When performing features engineering, in order to have a general model, it is always recommended to work on the whole DataFrame, if you have two 2 files juste merge them (train and test).
df = pd.concat([train[col],test[col]],axis=0) #The label column will be set as NULL for test rows # FEATURE ENGINEERING HERE train[col] = df[:len(train)] test[col] = df[len(train):]
Memory reduction
Sometimes the type encoding of a column is not the best choice, as for example encoding in int32 a column containing only value from 0 to 10. One of the most popular function used a function to reduce the memory usage by converting the type of column to the best type as possible.
def reduce_mem_usage(df): """ iterate through all the columns of a dataframe and modify the data type to reduce memory usage. """ start_mem = df.memory_usage().sum() / 1024 ** 2 print('Memory usage of dataframe is {:.2f} MB'.format(start_mem)) for col in df.columns: col_type = df[col].dtype if col_type != object: c_min = df[col].min() c_max = df[col].max() if str(col_type)[:3] == 'int': if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: df[col] = df[col].astype(np.float32) else: df[col] = df[col].astype(np.float64) else: df[col] = df[col].astype('category') end_mem = df.memory_usage().sum() / 1024 ** 2 print('Memory usage after optimization is: {:.2f} MB'.format(end_mem)) print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
Remove Outliers value
A common way to remove outliers is to use the Z-score.
If you are looking to remove each row where at least one column contains an outlier (defined with the Z-score) you can use the following code :
from scipy import stats df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
NAN trick
Some tree based algorithm can handle NAN value but he will had a step between NAN et non-NAN value, that could be non sense sometime. A common trick is just to fill all nan value by a value lower than the lowest value in the column considered (for example -9999).
df[col].fillna(-9999, inplace=True)
Categorical Features
You can treat categorical features with a label encoding to deal with them as a numeric. You can also decide to treat them as category. I recommend to try both and keep what improve you Cross-Validation by this line of code (after label encoding).
df[col] = df[col].astype('category')
Combining / Splitting
Sometime string variable contain multiple information in one variable. For example FRANCE_Paris . You will need to split it with a regex or using a split method for example :
new = df["localisation"].str.split("_", n = 1, expand = True) df['country'] = new[0] df['city']=new[1]
Otherwise, two (string or numeric) columns can be combined into one column. For example a column with a department of france (75 , for Paris) and the district code (001) can become a zip code : 75001)
df['zipcode'] = df['departement_code'].astype(str) +'_' +df['disctrict_code'].astype(str)
Linear combinations
One of the common feature engineering is to apply simple mathematical operation to create new feature. For example if we have the width and the height of a rectangle we can calculate the area.
df['area'] = df['width'] * df['height']
Count column
Create a column that create a column from the popular value_count method is a powerful technique for tree based algorithm, to define if a value is rare or common.
counts = df[col].value_counts.to_dict() df[col+'_counts'] = df[col].map(counts)
Deal with Date
Dealing with Date and parse each element of a date is crucial in order to analyze event.
First things with we need to convert our Date column (often considered as a string column with pandas). One of the most important field is to know how to use the format parameters. I strongly recommend to save this site as bookmark ! :)
For example if we are looking to convert a date column with this following format : 30 Sep 2019 we will use this piece of code:
df['date'] = pd.to_datetime(df[col], format='%d %b %Y')
Once your column is converted to datetime we may need to extract date components in news columns :
df['year'] = df['date'].year df['month'] = df['date'].month df['day'] = df['date'].day
Aggregations / Group Statistics
In order to continue to detect rare and common value, that is really imporant for Machine Learning prediction, we can decide to detect if a value is rare or common in a subgroup based on a static method. For example here we will like to know which Smartphone brand user do the longest call by calculating the mean of each subclass.
temp = df.groupby('smartphone_brand')['call_duration'] .agg(['mean']) .rename({'mean':'call_duration_mean'},axis=1) df = pd.merge(df,temp,on='smartphone_brand',how=’left’)
With this method a ML algorithm will be able to tell which call have a non common value of call_duration regarding the smartphone brand.
Normalize / Standardize
Normalization could be sometime really useful.
In order to achieve a normalization of a column against itself:
df[col] = ( df[col]-df[col].mean() ) / df[col].std()
Or you can normalize one column against another column. For example if you create a Group Statistic (described above) indicating the mean value for call_duration each week. Then you can remove time dependence by
df[‘call_duration_remove_time’] = df[‘call_duration’] — df[‘call_duration_week_mean’]
The new variable call_duration_remove no longer increases as we advance in time because we have normalized it against the affects of time.
Ultime Features engineering tips
Each column add time computing for your preprocessing but also for your model training. I strongly recommend to test a new feature and see how the features improve (or not…) you evaluation metrics. If it is not the case you should just remove the feature created / modified.
Data Scientist / Product Manager / Admin. Big Data
4 年Great! Useful
Head of Engineering | Datascience | Innovation
5 年Excellent, well written and so usefull thank you so much!