Pre-processing data in Python for Machine Learning
Pre-processing data in Python for Machine Learning .@achowdhu #algorithms #datascience #color #AI #deeplearning #machinelearning #BI #NLP #datamining #Biology #Statistics #Math #IoT #infogra #Tech #TechNews .@machinelearnflx .@DataSciFact .@D

Pre-processing data in Python for Machine Learning

Exploring categorical features

The Gapminder dataset that contained a categorical 'Region' feature, which we dropped in previous exercises since we did not have the tools to deal with it. Now however, we do, so we have added it back in!

Dealing with categorical features

● Scikit-learn will not accept categorical features by default

● Need to encode categorical features numerically

● Convert to ‘dummy variables’

● 0: Observation was NOT that category

● 1: Observation was that category

Dealing with categorical features in Python

● scikit-learn: OneHotEncoder()

● pandas: get_dummies()

We can explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.

  • Import pandas as pd.
  • Read the CSV file 'gapminder.csv' into a DataFrame called df.
  • Use pandas to create a boxplot showing the variation of life expectancy ('life') by region ('Region'). To do so, pass the column names in to df.boxplot() (in that order).

import pandas as pd

# Read 'gapminder.csv' into a DataFrame: df

df = pd.read_csv('c:/scripts/17-supervised-learning-with-scikit-learn/data/gm_2008_region.csv')

df.columns

# Create a boxplot of life expectancy per region

df.boxplot('life', 'Region', rot=60)

# Show the plot

plt.show()

Out[1072]: Index(['population', 'fertility', 'HIV', 'CO2', 'BMI_male', 'GDP',

    'BMI_female', 'life', 'child_mortality', 'Region'],

   dtype='object')


Exploratory data analysis should always be the precursor to model building.





Creating dummy variables

The scikit-learn does not accept non-numerical features. The 'Region' feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the 'Region' feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.

  • Use the pandas get_dummies() function to create dummy variables from the df DataFrame. Store the result as df_region.
  • Print the columns of df_region. This has been done for you.
  • Use the get_dummies() function again, this time specifying drop_first=True to drop the unneeded dummy variable (in this case, 'Region_America').
  • print the new columns of df_region and take note of how one column was dropped!

Now that you have created the dummy variables, you can use the 'Region' feature to predict life expectancy!

Regression with categorical features

Having created the dummy variables from the 'Region' feature, we can build regression models. Here, we will use ridge regression to perform 5-fold cross-validation.

  • The feature array X and target variable array y have been pre-loaded.
  • Import Ridge from sklearn.linear_model and cross_val_score from sklearn.model_selection.
  • Instantiate a ridge regressor called ridge with alpha=0.5 and normalize=True.
  • Perform 5-fold cross-validation on X and y using the cross_val_score() function.
  • Print the cross-validated scores.

[ 0.86808336 0.80623545 0.84004203 0.7754344  0.87503712]

We now know how to build models using data that includes categorical features.

Dropping missing data

"0" in triceps, insulin and bmi columns are not acceptable or realistic values. These values need to be imputed with mean values. Otherwise, we may loose large number of data for the dataset, if we use dropna() instead of replace "0" with mean values.

Dropping missing data

In [12]: df = df.dropna()

In [13]: df.shape

Out[13]: (393, 9)

We will loose half of the records, which is not acceptable.

Imputing missing data

● Making an educated guess about the missing values

● Example: Using the mean of the non-missing entries

In [1]: from sklearn.preprocessing import Imputer

In [2]: imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

In [3]: imp.fit(X)

In [4]: X = imp.transform(X)

Imputing within a pipeline

The voting dataset contained a many of missing values. We need to take care of these!

The unprocessed dataset has been loaded into a DataFrame df. Explore it in the IPython Shell with the .head() method. You will see that there are certain data points labeled with a '?'. These denote missing values. Different datasets encode missing values in different ways. Sometimes it may be a '9999', other times a 0 - real-world data can be very messy! If you're lucky, the missing values will already be encoded as NaN. We use NaN because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna() and .fillna(), as well as scikit-learn's Imputation transformer Imputer().

  • In this exercise, we will convert the '?'s to NaNs, and then drop the rows that contain them from the DataFrame.
  • Explore the DataFrame df in the IPython Shell. Notice how the missing value is represented.
  • Convert all '?' data points to np.nan.
  • Count the total number of NaNs using the .isnull() and .sum() methods. This has been done for you.
  • Drop the rows with missing values from df using .dropna().
  • See how many rows were lost by dropping the missing values.

When many values in your dataset are missing, if you drop them, you may end up throwing away valuable information along with the missing data. It's better instead to develop an imputation strategy. This is where domain knowledge is useful, but in the absence of it, you can impute missing values with the mean or the median of the row or column that the missing value is in.

Imputing missing data in a ML Pipeline I

As you've come to appreciate, there are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.

We will now practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. We have seen three classifiers in this course so far: k-NN, logistic regression, and the decision tree. We will now be introduced to a fourth one - the Support Vector Machine, or SVM. It works under the hood. It works exactly as you would expect of the scikit-learn estimators that you have worked with previously, in that it has the same .fit() and .predict() methods as before.

  • Import Imputer from sklearn.preprocessing and SVC from sklearn.svm. SVC stands for Support Vector Classification, which is a type of SVM.
  • Setup the Imputation transformer to impute missing data (represented as 'NaN') with the 'most_frequent' value in the column (axis=0).
  • Instantiate a SVC classifier. Store the result in clf.
  • Create the steps of the pipeline by creating a list of tuples:
  • The first tuple should consist of the imputation step, using imp.
  • The second should consist of the classifier.

Having set up the pipeline steps, you can now use it for classification.

Imputing missing data in a ML Pipeline II

Having setup the steps of the pipeline, we will now use it on the voting dataset to classify a Congressman's party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. We can use the .fit() and .predict() methods on pipelines just as we did with our classifiers and regressors!

We will generate a classification report of our predictions. The feature array X and target variable array y have been pre-loaded. Additionally, train_test_split and classification_report have been imported from sklearn.model_selection and sklearn.metrics respectively.

  • Import the following modules:
  • Imputer from sklearn.preprocessing and Pipeline from sklearn.pipeline.
  • SVC from sklearn.svm.
  • Create the pipeline using Pipeline() and steps.
  • Create training & test sets. Use 30% of the data for testing & a random state of 42.
  • Fit the pipeline to the training set and predict the labels of the test set.
  • Compute the classification report.

Pipeline has performed imputation as well as classification!

Centering and scaling

Why scale your data? Verify the data range between min and max. Too large or too small value may influence your model.

● Many models use some form of distance to inform them

● Features on larger scales can unduly influence the model

● Example: k-NN uses distance explicitly when making predictions

● We want features to be on a similar scale

● Normalizing (or scaling and centering)

Ways to normalize your data

● Standardization: Subtract the mean and divide by variance

● All features are centered around zero and have variance one

● Can also subtract the minimum and divide by the range

● Minimum zero and maximum one

● Can also normalize so the data ranges from -1 to +1

● See scikit-learn docs for further details

Scaling in scikit-learn

Scaling in a pipeline

CV and scaling in a pipeline

Scaling and CV in a pipeline

How significantly the performance of a model can improve if the features are scaled. Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.

You will now explore scaling for yourself on a new dataset - White Wine Quality! Previously, we have used the Red Wine Quality dataset. We have used the 'quality' feature of the wine to create a binary target variable: If 'quality' is less than 5, the target variable is 1, and otherwise, it is 0.

The DataFrame has been pre-loaded as df, along with the feature and target variable arrays X and y. Explore it in the IPython Shell. Notice how some features seem to have different units of measurement. 'density', for instance, only takes values between 0 and 1, while 'total sulfur dioxide' has a maximum value of 289. As a result, it may be worth scaling the features here. Lets scale the features and compute the mean and standard deviation of the unscaled features compared to the scaled features.

  • Import scale from sklearn.preprocessing.
  • Scale the features X using scale().
  • Print the mean and standard deviation of the unscaled features X, and then the scaled features X_scaled. Use the numpy functions np.mean() and np.std() to compute the mean and standard deviations.

Mean of Unscaled Features: 18.432687072460002

Standard Deviation of Unscaled Features: 41.54494764094571

Mean of Scaled Features: 2.7314972981668206e-15

Standard Deviation of Scaled Features: 0.9999999999999999

Notice the difference in the mean and standard deviation of the scaled features compared to the unscaled features.

Centering and scaling in a pipeline

With regard to whether or not scaling is effective, the proof is in the pudding! See for yourself whether or not scaling the features of the White Wine Quality dataset has any impact on its performance. We will use a k-NN classifier as part of a pipeline that includes scaling, and for the purposes of comparison, a k-NN classifier trained on the unscaled data has been provided.

The feature array and target variable array have been pre-loaded as X and y. Additionally, KNeighborsClassifier and train_test_split have been imported from sklearn.neighbors and sklearn.model_selection, respectively.

  • Import the following modules:
  • StandardScaler from sklearn.preprocessing.
  • Pipeline from sklearn.pipeline.
  • Complete the steps of the pipeline with StandardScaler() for 'scaler' and KNeighborsClassifier() for 'knn'.
  • Create the pipeline using Pipeline() and steps.
  • Create training and test sets, with 30% used for testing. Use a random state of 42.
  • Fit the pipeline to the training set.
  • Compute the accuracy scores of the scaled and unscaled models by using the .score() method inside the provided print() functions.

It looks like scaling has significantly improved model performance!

Pipeline for classification

It is time now to piece together everything into a pipeline for classification! We need to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.

We will be using the SVM classifier. The hyperparameters we will tune are C

C and gamma. C controls the regularization strength. It is analogous to the C

we tuned for logistic regression in previously (https://www.dhirubhai.net/pulse/building-logistic-regression-model-roc-curve-abu/), while gamma controls the kernel coefficient.

The following modules have been pre-loaded: Pipeline, svm, train_test_split, GridSearchCV, classification_report, accuracy_score. The feature and target variable arrays X and y have also been pre-loaded.

  • Setup the pipeline with the following steps:
  • Scaling, called 'scaler' with StandardScaler().
  • Classification, called 'SVM' with SVC().
  • Specify the hyperparameter space using the following notation: 'step_name__parameter_name'. Here, the step_name is SVM, and the parameter_names are C and gamma.
  • Create training & test sets, 20% of the data used for the test set. Use a random state of 21.
  • Instantiate GridSearchCV with the pipeline and hyperparameter space and fit it to the training set. Use 3-fold cross-validation (This is the default, so you don't have to specify it).
  • Predict the labels of the test set and compute the metrics.

Pipeline for regression

We will return to the Gapminder dataset. Guess what? Even this dataset has missing values that we dealt with for you in earlier! Now, you have all the tools to take care of them! Build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio of your ElasticNet using GridSearchCV.

All the necessary modules have been imported, and the feature and target variable arrays have been pre-loaded as X and y.

  • Set up a pipeline with the following steps:
  • 'imputation', which uses the Imputer() transformer and the 'mean' strategy to impute missing data ('NaN') using the mean of the column.
  • 'scaler', which scales the features using StandardScaler().
  • 'elasticnet', which instantiates an ElasticNet regressor.
  • Specify the hyperparameter space for the l1_ratio using the following notation: 'step_name__parameter_name'. Here, the step_name is elasticnet, and the parameter_name is l1_ratio.
  • Create training & test sets, 40% of the data used for the test set. Use a random state of 42.
  • Instantiate GridSearchCV with the pipeline and hyperparameter space. Use 3-fold cross-validation (This is the default, so you don't have to specify it).
  • Fit the GridSearchCV object to the training set.
  • Compute R2 and the best parameters.

 Tuned ElasticNet Alpha: {'elasticnet__l1_ratio': 1.0}

Tuned ElasticNet R squared: 0.8862016570888217

We have now mastered the fundamentals of supervised learning with scikit-learn!

● Using machine learning techniques to build predictive models

● For both regression and classification problems

● With real-world data

● Underfiting and overfiting

● Test-train split

● Cross-validation

● Grid search

● Regularization, lasso and ridge regression

● Data pre-processing

#algorithms #datascience #color #AI #deeplearning #machinelearning #BI #NLP #datamining #Biology #Statistics #Math #IoT #infogra #Tech #TechNews 

.@machinelearnflx .@DataSciFact .@DataScienceCtrl .@DennisKoutoudis .@saidercan .@achowdhu .@machinelearning .@BigData .@DataScience .@Analytics .@OpenAI




































要查看或添加评论,请登录

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE的更多文章

社区洞察

其他会员也浏览了