Pipe(line) dreams Part II: Creating a single preprocessing pipeline in Python
In part one of this trilogy of preprocessing posts, I gave an overview of pipeline preprocessing. In this part I focus on preprocessing numeric data, the code for which can be found on my GitHub. However, before anything can be preprocessed, we must acquire a "thing"--the data. Here I am importing a csv file directly from the web. I am importing it as a pandas dataframe, which is why I am also importing the pandas library, and abbreviating its name to "pd". Pandas is a python library that has many useful functions and allows structuring your data in a way that feels familiar from other environments that you may have worked in. In addition to importing the csv, I am instructing the function to not import the column names (I didn't like them)--to not have a header--and to use names that I am supplying as the column names.
import pandas as pd
df1 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", header = None, names=['ID', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'y'], usecols=['ID', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'y'])
These data, as the website describes, are features of biopsied cells, while the target is whether the cell was malignant or not. All features are numeric, and on a 1 to 10 scale.
I always like to look at the data a little when I begin my analyses. The following code will print the first 5 lines of the data, and if I put a number in the (), it will show me that number of rows instead:
print(df1.head())
#Checking variable types:
print(df1.dtypes)
#Examining some descriptive stats for the numeric features:
print(df1.describe())
As you will recall, the target--y--is categorical. However, although I explained above that the describe function will only work on numerical values, we see that it is describing the target. Looking at the type of variable python thinks this is clarifies why this occurring, as Python seems to think that this feature is an integer. This lets us know that we should declare it a factor somewhere down the line. Looking at the output of y, however, we can see that the classes are likely very imbalanced. That is not something that I will deal with in this article or pipeline, but something to note as important when building your final model. These outputs also alert us to the fact that x6 is not being read in as an integer, and prods us to examine it further.
At this point I want to split my data into my training and test sets. For #machinelearning algorithms we want to build the model using part of our data--likely the majority of it--and then test our model on the remaining part of the data. This is necessary (and some even partition their data into three sections!), because otherwise we cannot have any way to know how our model will perform once we release it out into the wild.
Understanding why we need a test set helps explain why we should split the data this early in the process, and preprocess only the training set--fitting on it--and then transforming it and the test set with the fit of the training set. If we take into account the test set's characteristics when cleaning the training data, then we will likely increase our chances of our model successfully predicting our test set.
For example, imagine (this extreme condition) where we have a single, numeric feature with the numbers [100, 101, 100, 99, 97, 21, 43, 89, 1, 110], and we randomly split it into two parts--70% and 30%--and we end up with a training set that is [100, 101, 100, 99, 97, 43, 110] and mean of ~92.86, and a test set that is [21, 89, 1] and a mean of 37. Now, imagine that there were NAs that we needed to fill in. If we filled those in after calculating a mean with all the values we would have training and testing sets that looked very different than if we imputed before splitting the data.
So onto splitting the data, for which we will use scikit-learn's train_test_split that will randomly split the set for us, and we will specify that we want our training set to be 80% of the data, and the test to be 20%:
from sklearn.model_selection import train_test_split
df1_train_full, test_full = train_test_split(df1, test_size=0.2)
To manipulate the data you can use pandas. I used R before python, so using functions like .loc and .iloc does not come naturally to me, and I personally think it is not very readable. A popular data wrangling package in #R is #dplyr, and it turns out that some kind souls (who I am ever grateful for) have created a library called #dfply (which actually uses pandas) that allows you to use functions almost identical to dplyr in Python. If you love .loc-ing and .iloc-ing you may hate dfply, but it may be worth checking it out either way.
In dfply we pipe (hey--more pipes!), and therefore start with declaring our data that we want to work on, and then indicate that we want to move on with two arrows >> and list the next manipulation that you would like to perform. Before listing the feature you are interested in manipulation, preface it with a "X." Here I am using dfply to see whether IDs appear more than once, and see that I have 39 duplicates:
from dfply import *
dup_id = (df1_train_full >>
summarize(distinct_id_sum = n_distinct(X.ID), id_sum = n(X.ID)) )
print(dup_id)
I next utilize dfply to remove the duplicates, as its unclear why they might exist, and since there are not that many, and I do not know whether that row is now trustworthy:
df1_train_full = (df1_train_full >> distinct(X.ID))
#I am additionally going to remove the target so that I can preprocess the features:
df1_train = df1_train_full >> drop(X.y)
I begin with preprocessing the feature x6 that was listed as dtype object above; the other features were all appropriately listed as integers. I begin by selecting only x6 using dfply, and then listing all its distinct values:
print(df1_train >> select(X.x6) >> distinct(X.x6))
We see that 6 has a "?" in it, and since we only asked for the distinct values, so it is possible that there may be more than one question mark in there:
In retrospect, it seems like it would be more informative to get a count of how many question marks there are, to give me an idea of whether I feel that I should drop missing data, or if I would be best off imputing it. I can get a count with the value_counts function:
print(df1_train["x6"].value_counts())
There are 12 question marks. Seeing that it is significantly less that 10% of the rows I would be comfortable dropping the rows, but imputing the missing data would be fine as well. Of course, if you can afford to do so computationally and time wise, it likely pays to run the model twice--one with the data dropped, and once with them imputed. For the purpose of this exercise I will impute the missing values.
First, I need to convert the feature to a numeric feature. I do this by specifying that I would like the entire dataframe to be numeric. This makes it easy to then additionally specify that I would like the data types to all be to be a float32 data type, as it is less computationally expensive than a 64. One nice thing about the following line of code (at least in situations like these) is that it will force any non-numeric values to be NAs:
import numpy as np
df1_train = df1_train.apply(pd.to_numeric, errors = 'coerce').astype(np.float32)
If possible, I like to quickly peak at my data again and make sure that everything is as I expect it to be:
print(df1_train.head())
print(df1_train.dtypes)
print(df1_train.describe())
We can see that every column has a count of 516 except x6, indicating that it is likely the only feature with missing data. There are a few ways that we can easily better examine missing data. One way is using the isnull function:
df1_train.isnull().sum()
The other two methods that I will cover here are both visual. One method creates a visualization of all the columns in the dataframe, and marks the rows with the missing values with white lines. If a column is completely dark, it means that there are no missing data in it:
%matplotlib inline #This line of code is only needed if you are utilizing a Jupyter Notebook
import missingno as msno
msno.matrix(df1_train)
The second method creates both a table that lists the percents of each variable missing, as well as a bar plot that displays similar information. We start by creating two features that represent the total missing values, as well as the percentage that the missing values are of each column. We then concatenate the two columns into a dataframe, and the last line of code prints the first five rows of the dataframe, as ordered by percent missing. We then create and customize a bar plot using these columns:
import seaborn as sns
total = df1_train.isnull().sum().sort_values(ascending=False)
percent = (df1_train.isnull().sum()/df1_train.isnull().count() * 100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
f, ax = plt.subplots(figsize=(15, 6))
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index, y=missing_data['Percent'])
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
missing_data.head()
We next want to deal with the missing data, since we likely need it gone or imputed in order to mutate the data further. As I mentioned above, I will impute the missing values. This is the first stage in cleaning the data, and therefore the first step in the pipeline. I have already removed the targets, but I need to remove the ID variable as well, since I do not intend to use it as a feature:
df_train_x = df1_train >> drop(X.ID)
I will use scikit-learn's pipeline function, and its imputer for imputation. If I intended to build a pipeline with only an imputer, it would look like this:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
numeric_pipe1 = Pipeline([('imputer', Imputer(strategy="mean"))])
However, as there will be more steps in our pipeline, we will continue to add them inside the brackets. The next step, following imputation, is engineering the features we are interested in entering into our model. Let us take a minute to examine the histograms of the features:
df1_train.hist(bins = 10, figsize = (20, 15)) plt.show
For some of these features, specifically the ones with the bulk on the data in a single bin, I am tempted to preform discretization (i.e. converting it from a continuous variable to a categorical one), since it seems almost meaningless to refer to anything that is not that bin as anything other than "not in the majority bin". Sometimes discretization makes sense and can help, while others it is just a method where you will lose information. As always, when possible it likely pays to run the model with and without discretization. The worst offender here is x9, so I will dichotomize it; if a value is smaller than two I will modify it to zero, otherwise I will modify it to one. I do this by building a custom transformer that I can input into my pipeline.
While I am building a custom transformer to engineer features, I will also engineer a feature that is a ratio of two existing features. Although with this dataset it does not really make sense to engineer a ratio, as far as I can tell, since it would be random, I would like to do so for practice purposes. I will create a ratio of x3 to x7. We build a class to utilize as a transformer:
from sklearn.base import BaseEstimator, TransformerMixin
x3_ix, x7_ix, x9_ix = 2, 6, 8 #These numbers represent the column numbers. Remember that in Python we begin counting the first column as zero
class CombinedAttributesAdder1(BaseEstimator, TransformerMixin):
def __init__(self, add_x3_to_x7 = True):
self.add_x3_to_x7 = add_x3_to_x7
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
lambda_x9 = lambda x: 0 if x < 2 else 1 #using .apply and then a lambda doesn't work with np arrays, only pandas dataframes
vlambda = np.vectorize(lambda_x9) #this is a good substitute to the .apply
x9_dich = vlambda(X[:, x9_ix])
if self.add_x3_to_x7:
x3_to_x7 = X[:, x3_ix] / X[:, x7_ix]
return np.c_[X, x9_dich, x3_to_x7]
else:
return np.c_[X, x9_dich]
attr_adder1 = CombinedAttributesAdder1(add_x3_to_x7=False)
df_extra_attribs1 = attr_adder1.transform(df_train_x.values)
For my next part of my pipeline I am going to scale my features:
from sklearn.preprocessing import StandardScaler
If we want to add the second and third pipeline steps to our first step, we end up with the following code:
numeric_pipe = Pipeline([('imputer', Imputer(strategy="mean")),
('attribs_adder1', CombinedAttributesAdder1()),
('std_scaler', StandardScaler())])
Since this not the end of our practice pipeline, let us keep going. Although these data have a fairly small dimensionality, and we likely will not gain much from running principal components analysis (PCA), let us do so anyways for funsies:
Since we want to run the PCA and we would like to determine how many components we want by utilizing a scree plot, we first need to run our data through our current pipeline by fitting it it to our data that we then transform. We will create a dataframe out of it, and examine the first five rows to ensure that all is as it should be:
numeric_pipe = Pipeline([('imputer', SimpleImputer(strategy="mean")),
('attribs_adder1', CombinedAttributesAdder1()),
('std_scaler', StandardScaler())])
train_df_tr = numeric_pipe.fit_transform(df_train_x)
train_df_tr_pd = pd.DataFrame(train_df_tr)
train_df_tr_pd.head()
A scree plot helps determine the number of components that we want to keep by displaying the variance explained by each principal component within the analysis. It is the cumulative explained variance ratio as a function of the number of components. The more components we have the less information we throw away, but the larger the dimensionality of our data. The point at which the graph begins to level off is the number of components we likely want to retain:
from sklearn.decomposition import PCA
pca = PCA().fit(train_df_tr_pd)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()
The curve seems to level off strongest around two, and then again around seven. Reducing dimensionality from the original ten components to seven is not a great save, but we will reduce the dimensionality further shortly, utilizing K-means. This leaves us with the following pipeline:
numeric_pipe = Pipeline([('imputer', SimpleImputer(strategy="mean")),
('attribs_adder1', CombinedAttributesAdder1()),
('std_scaler', StandardScaler()),
('pca', PCA(n_components = 7))])
In this example, PCA is not the final step in our pipeline. However, if you do not want to run it as a step in the pipeline, you would handle the PCA as follows:
from sklearn.decomposition import PCA
pca = PCA(n_components=7)
principalComponents = pca.fit_transform(train_df_tr_pd)
principalDf = pd.DataFrame(data = principalComponents, columns = ['pc1', 'pc2', 'pc3', 'pc4', 'pc5', 'pc6', 'pc7'])
#Resetting the index to avoid issues further down with concat
principalDf = principalDf.reset_index(drop=True)
print(principalDf.head())
However, you likely will want to save your components in a new dataframe together with your target, which you will concatenate along axis = 1 (which is the columns' axis). We create finalDf that is the final dataframe with the principal components (PCs). Until now we did not create a dataframe with just the target because we didn't need it for the pipeline. Obviously, if you want to concatenate the target you need to make such a dataframe first:
target_dataframe_name = df1_train_full >> select(X.y)
target_dataframe_name = pd.DataFrame(target_dataframe_name)
target_dataframe_name = target_dataframe_name.reset_index(drop=True)
#Specifying axis = 1 so that it adds the dataframes horizontally
final_df = pd.concat([principalDf, target_dataframe_name], axis=1)
print(final_df.head())
Sometimes, following dimensionality reduction with PCA, we want to reduce our features even further through clustering. For example, if you are trying to figure out which problems reported to customer service can be easily solved, and which are much harder to solve, you might want to know if different problems cluster together, and whether those clusters predict how many times repeated calls (if at all) had to be made to try and solve the same issue (assuming that repeated calls indicate a complicated issue as it was clearly not solved the first time the customer called). In that example you might first run a PCA with data related to your problems whose dimensionality you want to reduce. Then, after saving the PCs in the place of all the original features, you might run a k-means algorithm on them. One approach to tackling this problem would then to plot boxplots of each of the groups as a function of number of calls to solve an issue to see whether the groups differ. If they do differ, and say group A takes far more calls to solve, then in the future when someone calls with an issue that falls into that group you might want to allocate more resources to that person, especially if those types of problems are also predictors of churn.
We begin by PCAing our data through the pipeline, so that we can use it to determine how many clusters we would like our final k-means algorithm to have:
numeric_pipe = Pipeline([('imputer', SimpleImputer(strategy="mean")),
('attribs_adder1', CombinedAttributesAdder1()),
('std_scaler', StandardScaler()),
('pca', PCA(n_components = 7))])
train_df_tr = numeric_pipe.fit_transform(df_train_x)
clust_data = pd.DataFrame(train_df_tr)
Similar to determining the number of PCs in PCA, we visually assess how many clusters we want our algorithm to employ:
from sklearn.cluster import KMeans
import sklearn.metrics
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(clust_data)
prediction = model.predict(clust_data)
meanDistortions.append(sum(np.min(cdist(clust_data, model.cluster_centers_, 'euclidean'), axis=1)) / clust_data.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
The "elbow" in this graph is pretty clearly at two, and so that is the number of clusters that I will choose:
final_model1 = KMeans(2) final_model1.fit(clust_data) prediction1 = final_model1.predict(clust_data)
After running the clustering we want to join a column with the cluster number to the dataframe with all the data for exploratory data analysis purposes and possibly further manipulations and model building. We do not, however, need the original ID feature, which is why we drop it prior to adding a column called GROUP, which is the feature of interest:
final_df_train1 = df1_train_full >> drop(X.ID)
final_df_train1["GROUP"] = prediction1
print("Groups Assigned : \n")
print(final_df_train1.head(10))
If we run the pipeline now as is the output will be two columns with each point's relation to each of the k points:
numeric_pipe = Pipeline([('imputer', SimpleImputer(strategy="mean")),
('attribs_adder1', CombinedAttributesAdder1()),
('std_scaler', StandardScaler()), ('pca', PCA(n_components = 7)),
('kmean', KMeans(n_clusters = 2))])
train_df_tr = numeric_pipe.fit_transform(df_train_x)
train_df_tr_pd = pd.DataFrame(train_df_tr)
train_df_tr_pd.head()
Usually, this is not the output that we are after when we run a k-means algorithm. That being said, depending on your needs, and especially if this is not the final step in your pipeline, this type of output that describes the point's distance from each k can be useful. Retaining the information of each point's distance means that more of the original information can be used later to build a possibly better model, much like the way that we keep all the PCs, and do not collapse them into a single column.
However, in situations in which we would like a neat, single column output, we can create an additional custom transformer for the pipeline that will do just that:
k1_ix, k2_ix = 0, 1
class CombinedAttributesAdder2(BaseEstimator, TransformerMixin):
def __init__(self, add_create_groups = True):
self.add_create_groups = add_create_groups
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
group = X[:, k1_ix] - X[:, k2_ix]
lambda_group = lambda x: 0 if x > 0 else 1
vlambda = np.vectorize(lambda_group)
if self.add_create_groups:
group = vlambda(group)
return np.c_[group]
else:
return np.c_[X]
attr_adder2 = CombinedAttributesAdder2(add_create_groups=False)
Which leads us to our final pipeline with an output of a single column of groups that we can now use as predictors in a different model:
numeric_pipe = Pipeline([('imputer', SimpleImputer(strategy="mean")),
('attribs_adder1', CombinedAttributesAdder1()),
('std_scaler', StandardScaler()),
('pca', PCA(n_components = 7)),
('kmean', KMeans(n_clusters = 2)),
('attribs_adder2', CombinedAttributesAdder2())])
train_df_tr = numeric_pipe.fit_transform(df_train_x)
train_df_tr_pd = pd.DataFrame(train_df_tr)
train_df_tr_pd.head(10)
This pipeline only dealt with numeric data. Part III of this article trilogy will cover how to create both numeric and categorical pipelines, combine them, and chain a final model at the end. If you enjoyed the article, found it helpful, or think someone else might, please like it, or share on you social media outlet of choice. Writing these articles is fun for me, but it is even more enjoyable if I know that others like them as well.
Building scalable and reliable end-to-end ETL ML pipelines ; ML Ops on GCP and AWS; 3x AWS, 4x GCP
4 年That is very neat and helpful. Couldn't find part III though which actually combines everything.
data scientist | data analyst | trainer | R | Python | SQL
6 年Thanks very much for this very nice work