登录查看更多内容

Feature Engineering techniques in Python

Anis Ayari

Je crée des agents IA pour votre cas d’usage. Fondateur de DeeplayerAI / Créateur de contenu en IA / Head of AI / Speaker AI

发布日期: 2019年10月16日

Features engineering is a crucial part in each Machine Learning project. In this article we will go around some techniques to handle this task. Please do not hesitate to comment with new ideas, I will try to keep this article updated as much as possible.

Merge Train and Test

When performing features engineering, in order to have a general model, it is always recommended to work on the whole DataFrame, if you have two 2 files juste merge them (train and test).

df = pd.concat([train[col],test[col]],axis=0)
#The label column will be set as NULL for test rows
# FEATURE ENGINEERING HERE
train[col] = df[:len(train)]
test[col] = df[len(train):]

Memory reduction

Sometimes the type encoding of a column is not the best choice, as for example encoding in int32 a column containing only value from 0 to 10. One of the most popular function used a function to reduce the memory usage by converting the type of column to the best type as possible.

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

Remove Outliers value

A common way to remove outliers is to use the Z-score.

If you are looking to remove each row where at least one column contains an outlier (defined with the Z-score) you can use the following code :

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

NAN trick

Some tree based algorithm can handle NAN value but he will had a step between NAN et non-NAN value, that could be non sense sometime. A common trick is just to fill all nan value by a value lower than the lowest value in the column considered (for example -9999).

df[col].fillna(-9999, inplace=True)

Categorical Features

You can treat categorical features with a label encoding to deal with them as a numeric. You can also decide to treat them as category. I recommend to try both and keep what improve you Cross-Validation by this line of code (after label encoding).

df[col] = df[col].astype('category')

Combining / Splitting

Sometime string variable contain multiple information in one variable. For example FRANCE_Paris . You will need to split it with a regex or using a split method for example :

new = df["localisation"].str.split("_", n = 1, expand = True)
df['country'] = new[0]
df['city']=new[1]

Otherwise, two (string or numeric) columns can be combined into one column. For example a column with a department of france (75 , for Paris) and the district code (001) can become a zip code : 75001)

df['zipcode'] = df['departement_code'].astype(str)
                +'_'
                +df['disctrict_code'].astype(str)

Linear combinations

One of the common feature engineering is to apply simple mathematical operation to create new feature. For example if we have the width and the height of a rectangle we can calculate the area.

df['area'] = df['width'] * df['height']

Count column

Create a column that create a column from the popular value_count method is a powerful technique for tree based algorithm, to define if a value is rare or common.

counts = df[col].value_counts.to_dict()
df[col+'_counts'] = df[col].map(counts)

Deal with Date

Dealing with Date and parse each element of a date is crucial in order to analyze event.

First things with we need to convert our Date column (often considered as a string column with pandas). One of the most important field is to know how to use the format parameters. I strongly recommend to save this site as bookmark ! :)

For example if we are looking to convert a date column with this following format : 30 Sep 2019 we will use this piece of code:

df['date'] =  pd.to_datetime(df[col], format='%d %b %Y')

Once your column is converted to datetime we may need to extract date components in news columns :

df['year'] =  df['date'].year
df['month'] = df['date'].month
df['day'] = df['date'].day

Aggregations / Group Statistics

In order to continue to detect rare and common value, that is really imporant for Machine Learning prediction, we can decide to detect if a value is rare or common in a subgroup based on a static method. For example here we will like to know which Smartphone brand user do the longest call by calculating the mean of each subclass.

temp = df.groupby('smartphone_brand')['call_duration']
       .agg(['mean'])
       .rename({'mean':'call_duration_mean'},axis=1)
df = pd.merge(df,temp,on='smartphone_brand',how=’left’)

With this method a ML algorithm will be able to tell which call have a non common value of call_duration regarding the smartphone brand.

Normalize / Standardize

Normalization could be sometime really useful.

In order to achieve a normalization of a column against itself:

df[col] = ( df[col]-df[col].mean() ) / df[col].std()

Or you can normalize one column against another column. For example if you create a Group Statistic (described above) indicating the mean value for call_duration each week. Then you can remove time dependence by

df[‘call_duration_remove_time’] = df[‘call_duration’] — df[‘call_duration_week_mean’]

The new variable call_duration_remove no longer increases as we advance in time because we have normalized it against the affects of time.

Ultime Features engineering tips

Each column add time computing for your preprocessing but also for your model training. I strongly recommend to test a new feature and see how the features improve (or not…) you evaluation metrics. If it is not the case you should just remove the feature created / modified.

Jean-Philippe KOKORA

Data Scientist / Product Manager / Admin. Big Data

4 年

Great! Useful

Jean MILPIED

Head of Engineering | Datascience | Innovation

5 年

Excellent, well written and so usefull thank you so much!

1 次回应

查看更多评论

要查看或添加评论，请登录

Anis Ayari的更多文章

ChatGPT, un modèle d'IA incroyable mais avec une limite... Le problème du Clonage de Comportement.

2022年12月5日

ChatGPT, un modèle d'IA incroyable mais avec une limite... Le problème du Clonage de Comportement.

Vous avez sans doute beaucoup entendu parlé de #ChatGPT ?? , mais est ce que vous savez vraiment comment ca fonctionne…

34 条评论
Combien gagne un Data Scientist ? Les (vrais) salaires de Data Scientist.

2021年2月1日

Combien gagne un Data Scientist ? Les (vrais) salaires de Data Scientist.

De plus en plus de formations emergent pour devenir Data Scientist. Des plus grandes ecoles, aux reconversions…
Pourquoi les imbéciles se croient si intelligent?? L'effet Dunning-Kruger appliqué à notre époque.

2020年11月6日

Pourquoi les imbéciles se croient si intelligent?? L'effet Dunning-Kruger appliqué à notre époque.

Avez-vous déjà participé à une conversation où quelqu’un qui n’a aucune expérience dans un domaine vous donne des…

3 条评论
Où en est le marché de l’emploi en Data Science ? (+ quelques astuces pour les CV de Data Scientist)

2020年10月5日

Où en est le marché de l’emploi en Data Science ? (+ quelques astuces pour les CV de Data Scientist)

Avec la crise du COVID, les budgets se font rare, les risques se minimisent, les data scientist co?tent chers il est…

17 条评论
C’est quoi le métier d’ingénieur ?

2020年10月2日

C’est quoi le métier d’ingénieur ?

Le métier d’ingénieur est un des métiers les plus prestigieux de France. Mais être ingénieur c’est aussi beaucoup de…
Le département Data & Analytics de Deloitte France recrute

2019年6月20日

Le département Data & Analytics de Deloitte France recrute

Bonjour à tous, le département Consulting Data & Analytics de Deloitte France poursuit une très forte croissance et…
Do not rush to code. 4 principles for AI projects in enterprise.

2019年5月9日

Do not rush to code. 4 principles for AI projects in enterprise.

No the AI doesn't understand by itself. No Data Science is not automatic.
Computer vision : une révolution qui ne fait que commencer

2019年3月5日

Computer vision : une révolution qui ne fait que commencer

Et si l’outil qui allait bient?t bouleverser notre monde était… une simple caméra ? Derrière cet objet familier, que…

See all articles

Feature Engineering techniques in Python

Anis Ayari

Je crée des agents IA pour votre cas d’usage. Fondateur de DeeplayerAI / Créateur de contenu en IA / Head of AI / Speaker AI

Merge Train and Test

Memory reduction

Remove Outliers value

NAN trick

Categorical Features

Combining / Splitting

Linear combinations

Count column

Deal with Date

Aggregations / Group Statistics

Normalize / Standardize

Ultime Features engineering tips

Anis Ayari的更多文章

社区洞察

其他会员也浏览了

Performance of Python Lists, NumPy Arrays and PyTorch Tensors

What are Sets in Python and How to use them? NareshIT

Understanding Big O Notation in Python For Algorithm Efficiency

Validation of parametric short-term strategies with python: how fast can you really go?

Comprehension in Python

Calculating Variance Inflation Factor (VIF)

Learning Python - Day 5

Heap Sort Algorithm with Python

Introduction to Floating-Point Arithmetic in Python by MarsDevs.

Understanding How Python Code Runs in Memory

Merge Train and Test

Memory reduction

Remove Outliers value

NAN trick

Categorical Features

Combining / Splitting

Linear combinations

Count column

Deal with Date

Aggregations / Group Statistics

Normalize / Standardize

Ultime Features engineering tips

Anis Ayari的更多文章

ChatGPT, un modèle d'IA incroyable mais avec une limite... Le problème du Clonage de Comportement.

Combien gagne un Data Scientist ? Les (vrais) salaires de Data Scientist.

Pourquoi les imbéciles se croient si intelligent?? L'effet Dunning-Kruger appliqué à notre époque.

Où en est le marché de l’emploi en Data Science ? (+ quelques astuces pour les CV de Data Scientist)

C’est quoi le métier d’ingénieur ?

Le département Data & Analytics de Deloitte France recrute

Do not rush to code. 4 principles for AI projects in enterprise.

Computer vision : une révolution qui ne fait que commencer

社区洞察

其他会员也浏览了

Performance of Python Lists, NumPy Arrays and PyTorch Tensors

What are Sets in Python and How to use them? NareshIT

Understanding Big O Notation in Python For Algorithm Efficiency

Validation of parametric short-term strategies with python: how fast can you really go?

Comprehension in Python

Calculating Variance Inflation Factor (VIF)

Learning Python - Day 5

Heap Sort Algorithm with Python

Introduction to Floating-Point Arithmetic in Python by MarsDevs.

Understanding How Python Code Runs in Memory