Python MACHINE LEARNING
David Dayakarun Chinnesh Erothi Preach Christ resurrected , AI health care in MIT
AI Engineer Challenging Machine Learning- Deep Learning- Natural Language Processing updating the best real time documents if you want to learn AI I am the one
Dec 3rd
Python MACHINE LEARNING DATA SET
Powered by
2-MachineLearningModels-Reference(unsaved changes)
Beginners Data Science for Python Developers (1)
Python 3
Trusted
RunCodeMarkdownRaw NBConvertHeading-Enter/Exit RISE Slideshow
Introduction to machine learning models
You have now made it to the section on machine learning (ML). ML and the branch of computer science in which it resides, artificial intelligence (AI), are so central to data science that ML/AI and data science are synonymous in the minds of many people. However, the preceding sections have hopefully demonstrated that there are a lot of other facets to the discipline of data science apart from the prediction and classification tasks that supply so much value to the world. (Remember, at least 80 percent of the effort in most data-science projects will be composed of cleaning and manipulating the data to prepare it for analysis.)
That said, ML is fun! In this section, and the next one on data science in the cloud, you will get to play around with some of the “magic” of data science and start to put into practice the tools you have spent the last five sections learning. Let's get started!
A quick aside: types of ML
As you get deeper into data science, it might seem like there are a bewildering array of ML algorithms out there. However many you encounter, it can be handy to remember that most ML algorithms fall into three broad categories:
- Predictive algorithms: These analyze current and historical facts to make predictions about unknown events, such as the future or customers’ choices.
- Classification algorithms: These teach a program from a body of data, and the program then uses that learning to classify new observations.
- Time-series forecasting algorithms: While it can argued that these algorithms are a part of predictive algorithms, their techniques are specialized enough that they in many ways functions like a separate category. Time-series forecasting is beyond the scope of this course, but we have more than enough work with focusing here on prediction and classification.
Prediction: linear regression
Learning goal: By the end of this subsection, you should be comfortable fitting linear regression models, and you should have some familiarity with interpreting their output.
Arguably the simplest form of machine learning is to draw a line connecting two points and make predictions about where that trend might lead.
But what if you have more than two points—and those points don't line up neatly? What if you have points in more than two dimensions? This is where linear regression comes in.
Formally, linear regression is used to predict a quantitative response (the values on a Y axis) that is dependent on one or more predictors (values on one or more axes that are orthogonal to Y, commonly just thought of collectively as X). The working assumption is that the relationship between predictors and response is more or less linear. The goal of linear regression is to fit a straight line in the best possible way to minimize the deviation between our observed responses in the dataset and the responses predicted by our line, the linear approximation. (The most common means of assessing this error is called the least squares method; it consists of minimizing the number you get when you square the difference between your predicted value and the actual value and add up all of those squared differences for your entire dataset.)
Statistically, we can represent this relationship between response and predictors as:
??=
??
0
+
??
1
??+??
Y=B0+B1X+E
Remember high school geometry?
??
0
B0 is the intercept of our line and
??
1
B1 is its slope. We commonly refer to
??
0
B0 and
??
1
B1 as coefficients and to
??
E as the error term, which represents the margin of error in the model.
Let's try this in practice with actual data. (Note: no graph paper will be harmed in the course of these predictions.)
Data exploration
We'll begin by importing our usual libraries and using our %matplotlib inline magic command:
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
/home/nbuser/anaconda3_420/lib/python3.5/site-packages/matplotlib/font_manager.py:281: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
'Matplotlib is building the font cache using fc-list. '
And now for our data. In this case, we’ll use a newer housing dataset than the Boston Housing Dataset we used in the last section (with this one storing data on individual houses across the United States).
In [2]:
df = pd.read_csv('./Data/Housing_Dataset_Sample.csv')
df.head()
Out[2]:
Avg. Area Income
Avg. Area House Age
Avg. Area Number of Rooms
Avg. Area Number of Bedrooms
Area Population
Price
Address
0
79545.458574
5.682861
7.009188
4.09
23086.800503
1.059034e+06
208 Michael Ferry Apt. 674\nLaurabury, NE 3701...
1
79248.642455
6.002900
6.730821
3.09
40173.072174
1.505891e+06
188 Johnson Views Suite 079\nLake Kathleen, CA...
2
61287.067179
5.865890
8.512727
5.13
36882.159400
1.058988e+06
9127 Elizabeth Stravenue\nDanieltown, WI 06482...
3
63345.240046
7.188236
5.586729
3.26
34310.242831
1.260617e+06
USS Barnett\nFPO AP 44820
4
59982.197226
5.040555
7.839388
4.23
26354.109472
6.309435e+05
USNS Raymond\nFPO AE 09386
Exercise:
In [3]:
# Do you remember the DataFrame method for looking at overall information
# about a DataFrame, such as number of columns and rows? Try it here.
Let's also use the describe method to look at some of the vital statistics about the columns. Note that in cases like this, in which some of the column names are long, it can be helpful to view the transposition of the summary, like so:
In [4]:
df.describe().T
Out[4]:
count
mean
std
min
25%
50%
75%
max
Avg. Area Income
5000.0
6.858311e+04
10657.991214
17796.631190
61480.562388
6.880429e+04
7.578334e+04
1.077017e+05
Avg. Area House Age
5000.0
5.977222e+00
0.991456
2.644304
5.322283
5.970429e+00
6.650808e+00
9.519088e+00
Avg. Area Number of Rooms
5000.0
6.987792e+00
1.005833
3.236194
6.299250
7.002902e+00
7.665871e+00
1.075959e+01
Avg. Area Number of Bedrooms
5000.0
3.981330e+00
1.234137
2.000000
3.140000
4.050000e+00
4.490000e+00
6.500000e+00
Area Population
5000.0
3.616352e+04
9925.650114
172.610686
29403.928702
3.619941e+04
4.286129e+04
6.962171e+04
Price
5000.0
1.232073e+06
353117.626581
15938.657923
997577.135049
1.232669e+06
1.471210e+06
2.469066e+06
Let's look at the data in the Price column. (You can disregard the deprecation warning if it appears.)
In [5]:
sns.distplot(df['Price'])
/home/nbuser/anaconda3_420/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f677b7d0f60>
As we would hope with this much data, our prices form a nice bell-shaped, normally distributed curve.
Now, let's look at a simple relationship like that between house prices and the average income in a geographic area:
In [6]:
sns.jointplot(df['Avg. Area Income'],df['Price'])
/home/nbuser/anaconda3_420/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[6]:
<seaborn.axisgrid.JointGrid at 0x7f677b3bb8d0>
As we would expect, there is an intuitive, linear relationship between them. Also good: the pairplot shows that the data in both columns is normally distributed, so we don't have to worry about somehow transforming the data for meaningful analysis.
Let's take a quick look at all of the columns:
In [7]:
sns.pairplot(df)
Out[7]:
<seaborn.axisgrid.PairGrid at 0x7f67801a09e8>
Some observations:
- Not all of the combinations of columns provide strong linear relationships; some just look like blobs. That's nothing to worry about for our analysis.
- See the visualizations that look like lanes rather than organic groups? That is the result of the average number of bedrooms in houses being measured in discrete values rather than continuous ones (as no one has 0.3 bedrooms in their house). The number of bathrooms is also the one column whose data is not really normally distributed, though some of this might be distortion caused by the default bin size of the pairplot histogram functionality.
It is now time to make a prediction.
Fitting the model
Let's make a prediction. Let's feed everything into a linear model (average area income, average area house age, average area number of rooms, average area number of bedrooms, and area population) and see how well knowing those factors can help us predict the price of a home.
To do this, we will make our first five columns the X (our predictors) and the Price column the Y (our response):
In [8]:
X = df.iloc[:,:5]
y = df['Price']
Now, we could use all of our data to create our model. However, all that would get us is a model that is good at predicting itself. Not only would that leave us with no objective way to measure how good the model is, it would also likely lead to a model that was less accurate when used on new data. Such a model is termed overfitted.
To avoid this, data scientists divide their datasets for ML into training data (the data used to fit the model) and test data (data used to evaluate how accurate the model is). Fortunately, scikit-learn provides a function that enables us to easily divide up our data between training and test sets: train_test_split. In this case, we will use 70 percent of our data for training and reserve 30 percent of it for testing. (Note that you will also supply a fourth parameter to the function: random_state; train_test_split randomly divides up our data between test and training, so this number provides an explicit seed for the random-number generator so that you will get the same result each time you run this code snippet.)
In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=54)
All that is left now is to import our linear regression algorithm and fit our model based on our training data:
In [10]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
In [11]:
reg.fit(X_train,y_train)
Out[11]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Evaluating the model
Now, a moment of truth: let's see how our model does making predictions based on the test data:
In [12]:
predictions = reg.predict(X_test)
In [13]:
predictions
Out[13]:
array([ 614607.96220733, 1849444.80372637, 1118945.0888425 , ...,
834789.0342857 , 1787928.10906922, 1455422.23696486])
Our predictions are just an array of numbers: these are the house prices predicted by our model. One for every row in our test dataset.
Remember how we mentioned that linear models have the mathematical form of
??=
??
0
+
??
1
???+??
Y=B0+B1?X+E? Let’s look at the actual equation:
In [14]:
print(reg.intercept_,reg.coef_)
-2646401.726324682 [2.15873958e+01 1.65828187e+05 1.21323502e+05 2.79025671e+03
1.51667244e+01]
In algebraic terms, here is our model:
??=?2,646,401+0.21587
??
1
+0.00002
??
2
+0.00001
??
3
+0.00279
??
4
+0.00002
??
5
Y=?2,646,401+0.21587X1+0.00002X2+0.00001X3+0.00279X4+0.00002X5
where:
- ??=
- Y= Price
- ??
- 1
- =
- X1= Average area income
- ??
- 2
- =
- X2= Average area house age
- ??
- 3
- =
- X3= Average area number of rooms
- ??
- 4
- =
- X4= Average area number of bedrooms
- ??
- 5
- =
- X5= Area population
So, just how good is our model? There are many ways to measure the accuracy of ML models. Linear models have a good one: the
??
2
R2 score (also knows as the coefficient of determination). A high
??
2
R2, close to 1, indicates better prediction with less error.
In [15]:
#Explained variation. A high R2 close to 1 indicates better prediction with less error.
from sklearn.metrics import r2_score
r2_score(y_test,predictions)
Out[15]:
0.921660486570713
The
??
2
R2 score also indicates how much explanatory power a linear model has. In the case of our model, the five predictors we used in the model explain a little more than 92 percent of the price of a house in this dataset.
We can also plot our errors to get a visual sense of how wrong our predictions were:
In [16]:
#plot errors
sns.distplot([y_test-predictions])
/home/nbuser/anaconda3_420/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f677968bb00>
Do you notice the numbers on the left axis? Whereas a histogram shows the number of things that fall into discrete numeric buckets, a kernel density estimation (KDE, and the histogram that accompanies it in the Seaborn displot) normalizes those numbers to show what proportion of results lands in each bucket. Essentially, these are all decimal numbers less than 1.0 because the area under the KDE has to add up to 1.
Maybe more gratifying, we can plot the predictions from our model:
In [17]:
# Plot outputs
plt.scatter(y_test,predictions, color='blue')
Out[17]:
<matplotlib.collections.PathCollection at 0x7f677935df98>
The linear nature of our predicted prices is clear enough, but because all of the dots are solid it's hard to see the areas of concentration. Can you think of a way to refine this visualization to make it clearer?
Exercise:
In [18]:
# Hint: Remember to try the plt.scatter parameter alpha=.
# It takes values between 0 and 1.
Takeaway: In this subsection, you performed prediction using linear regression by exploring your data, then fitting your model, and finally evaluating your model’s performance.
Classification: logistic regression
Learning goal: By the end of this subsection, you should know how logistic regression differs from linear regression, be comfortable fitting logistic regression models, and have some familiarity with interpreting their output.
We'll now pivot to discussing classification. If our simple analogy of predictive analytics was drawing a line through points and extrapolating from that, then classification can be described in its simplest form as drawing lines around groups of points.
While linear regression is used to predict quantitative responses, such as what someone's score on an exam might be, logistic regression is used for classification problems, such as predicting someone passing or failing an exam.
Formally, logistic regression predicts the categorical response (Y) based on predictors (Xs). Logistic regression goes by several names, and it is also known in the scholarly literature as logit regression, maximum-entropy classification (MaxEnt), and the log-linear classifier. In this algorithm, the probabilities describing the possible outcomes of a single trial are modeled using a sigmoid (S-curve) function. Sigmoid functions take any value and transform it to be between 0 and 1, which can be used as a probability for a class to be predicted, with the goal of predictors mapping to 1 when something belongs in the class and 0 when they do not.
To show this in action, let's do something a little different and try a historical dataset: the fates of the passengers of the RMS Titanic, which is a popular dataset for classification problems in machine learning. In this case, the class we want to predict is whether a passenger survived the doomed liner's sinking.
The dataset has 12 variables:
- PassengerId
- Survived: 0 = No, 1 = Yes
- Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- Sex
- Age
- Sibsp: Number of siblings or spouses aboard the Titanic
- Parch: Number of parents or children aboard the Titanic
- Ticket: Passenger ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton
In [19]:
df = pd.read_csv('./Data/train_data_titanic.csv')
df.head()
Out[19]:
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
In [20]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
One reason that the Titanic data set is a popular classification set is that it provides opportunities to prepare data for analysis. To prepare this dataset for analysis, we need to perform a number of tasks:
- Remove extraneous variables
- Check for multicollinearity
- Handle missing values
We will touch on each of these steps in turn.
Remove extraneous variables
The name of individual passengers and their ticket numbers will clearly do nothing to help our model, so we can drop those columns to simplify matters.
In [21]:
df.drop(['Name','Ticket'],axis=1,inplace=True)
There are additional variables that will not add classifying power to our model, but to find them we will need to look for correlation between variables.
Check for multicollinearity
If one or more of our predictors can themselves be predicted from other predictors, it can produce a state of multicollinearity in our model. When items are too closely related, it can make it difficult to determine the true predictors of a condition. Multicollinearity is a challenge because it can skew the results of regression models (both linear and logistic) and reduce the predictive or classifying power of a model.
To help combat this problem, we can start to look for some initial patterns. For example, do any correlations between Survived and Fare jump out?
In [22]:
sns.pairplot(df[['Survived','Fare']], dropna=True)
Out[22]:
<seaborn.axisgrid.PairGrid at 0x7f6778de5d68>
Exercise:
In [23]:
# Try running sns.pairplot twice more on some other combinations of columns
# and see if any patterns emerge.
We can also use groupby to look for patterns. Consider the mean values for the various variables when we group by Survived:
In [24]:
df.groupby('Survived').mean()
Out[24]:
PassengerId
Pclass
Age
SibSp
Parch
Fare
Survived
0
447.016393
2.531876
30.626179
0.553734
0.329690
22.117887
1
444.368421
1.950292
28.343690
0.473684
0.464912
48.395408
In [25]:
df.groupby('Age').mean()
Out[25]:
PassengerId
Survived
Pclass
SibSp
Parch
Fare
Age
0.42
804.000000
1.000000
3.000000
0.000000
1.000000
8.516700
0.67
756.000000
1.000000
2.000000
1.000000
1.000000
14.500000
0.75
557.500000
1.000000
3.000000
2.000000
1.000000
19.258300
0.83
455.500000
1.000000
2.000000
0.500000
1.500000
23.875000
0.92
306.000000
1.000000
1.000000
1.000000
2.000000
151.550000
1.00
415.428571
0.714286
2.714286
1.857143
1.571429
30.005957
2.00
346.900000
0.300000
2.600000
2.100000
1.300000
37.536250
3.00
272.000000
0.833333
2.500000
1.833333
1.333333
25.781950
4.00
466.100000
0.700000
2.600000
1.600000
1.400000
29.543330
5.00
380.000000
1.000000
2.750000
1.750000
1.250000
22.717700
6.00
762.333333
0.666667
2.666667
1.333333
1.333333
25.583333
7.00
288.666667
0.333333
2.666667
2.666667
1.333333
31.687500
8.00
400.250000
0.500000
2.500000
2.000000
1.250000
28.300000
9.00
437.250000
0.250000
3.000000
2.500000
1.750000
27.938538
10.00
620.000000
0.000000
3.000000
1.500000
2.000000
26.025000
11.00
534.500000
0.250000
2.500000
2.500000
1.500000
54.240625
12.00
126.000000
1.000000
3.000000
1.000000
0.000000
11.241700
13.00
614.000000
1.000000
2.500000
0.000000
0.500000
13.364600
14.00
312.000000
0.500000
2.500000
2.000000
0.833333
42.625700
14.50
112.000000
0.000000
3.000000
1.000000
0.000000
14.454200
15.00
554.600000
0.800000
2.600000
0.400000
0.400000
49.655020
16.00
422.294118
0.352941
2.529412
0.764706
0.529412
25.745100
17.00
423.000000
0.461538
2.384615
0.615385
0.384615
28.389423
18.00
516.269231
0.346154
2.461538
0.384615
0.423077
38.063462
19.00
389.400000
0.360000
2.360000
0.320000
0.200000
27.869496
20.00
493.066667
0.200000
3.000000
0.200000
0.066667
8.624173
20.50
228.000000
0.000000
3.000000
0.000000
0.000000
7.250000
21.00
390.208333
0.208333
2.583333
0.333333
0.208333
31.565621
22.00
365.740741
0.407407
2.555556
0.148148
0.222222
25.504781
23.00
510.266667
0.333333
2.133333
0.400000
0.266667
37.994720
...
...
...
...
...
...
...
44.00
437.111111
0.333333
2.111111
0.444444
0.222222
29.758333
45.00
367.500000
0.416667
2.000000
0.333333
0.583333
36.818408
45.50
268.000000
0.000000
2.000000
0.000000
0.000000
17.862500
46.00
427.000000
0.000000
1.333333
0.333333
0.000000
55.458333
47.00
534.666667
0.111111
1.777778
0.222222
0.111111
27.601389
48.00
663.111111
0.666667
1.666667
0.555556
0.555556
37.893067
49.00
533.500000
0.666667
1.333333
0.666667
0.166667
59.929183
50.00
457.200000
0.500000
1.600000
0.400000
0.200000
64.025830
51.00
456.142857
0.285714
2.000000
0.142857
0.142857
28.752386
52.00
589.500000
0.500000
1.333333
0.500000
0.333333
51.402783
53.00
572.000000
1.000000
1.000000
2.000000
0.000000
51.479200
54.00
383.625000
0.375000
1.500000
0.500000
0.500000
44.477087
55.00
254.500000
0.500000
1.500000
0.000000
0.000000
23.250000
55.50
153.000000
0.000000
3.000000
0.000000
0.000000
8.050000
56.00
542.750000
0.500000
1.000000
0.000000
0.250000
43.976025
57.00
700.000000
0.000000
2.000000
0.000000
0.000000
11.425000
58.00
325.000000
0.600000
1.000000
0.000000
0.600000
93.901660
59.00
164.000000
0.000000
2.500000
0.000000
0.000000
10.375000
60.00
583.750000
0.500000
1.250000
0.750000
0.500000
55.000000
61.00
374.666667
0.000000
1.666667
0.000000
0.000000
24.019433
62.00
552.500000
0.500000
1.250000
0.000000
0.000000
35.900000
63.00
380.000000
1.000000
2.000000
0.500000
0.000000
43.772900
64.00
492.500000
0.000000
1.000000
0.500000
2.000000
144.500000
65.00
264.333333
0.000000
1.666667
0.000000
0.333333
32.093067
66.00
34.000000
0.000000
2.000000
0.000000
0.000000
10.500000
70.00
709.500000
0.000000
1.500000
0.500000
0.500000
40.750000
70.50
117.000000
0.000000
3.000000
0.000000
0.000000
7.750000
71.00
295.500000
0.000000
1.000000
0.000000
0.000000
42.079200
74.00
852.000000
0.000000
3.000000
0.000000
0.000000
7.775000
80.00
631.000000
1.000000
1.000000
0.000000
0.000000
30.000000
88 rows × 6 columns
Survivors appear to be slightly younger on average with higher-cost fare.
In [26]:
df.head()
Out[26]:
PassengerId
Survived
Pclass
Sex
Age
SibSp
Parch
Fare
Cabin
Embarked
0
1
0
3
male
22.0
1
0
7.2500
NaN
S
1
2
1
1
female
38.0
1
0
71.2833
C85
C
2
3
1
3
female
26.0
0
0
7.9250
NaN
S
3
4
1
1
female
35.0
1
0
53.1000
C123
S
4
5
0
3
male
35.0
0
0
8.0500
NaN
S
Value counts can also help us get a sense of the data before us, such as numbers for siblings and spouses on the Titanic, in addition to the sex split of passengers:
In [27]:
df['SibSp'].value_counts()
Out[27]:
0 608
1 209
2 28
4 18
3 16
8 7
5 5
Name: SibSp, dtype: int64
In [28]:
df['Parch'].value_counts()
Out[28]:
0 678
1 118
2 80
5 5
3 5
4 4
6 1
Name: Parch, dtype: int64
In [29]:
df['Sex'].value_counts()
Out[29]:
male 577
female 314
Name: Sex, dtype: int64
Handle missing values
We now need to address missing values. First, let’s look to see which columns have more than half of their values missing:
In [30]:
#missing
df.isnull().sum()>(len(df)/2)
Out[30]:
PassengerId False
Survived False
Pclass False
Sex False
Age False
SibSp False
Parch False
Fare False
Cabin True
Embarked False
dtype: bool
Let's break down the code in the call above just a bit. df.isnull().sum() tells pandas to take the sum of all of the missing values for each column. len(df)/2is just another way of expressing half the number of rows in the DataFrame. Taken together with the >, this line of code is looking for any columns with more than half of its entries missing, and there is one: Cabin.
We could try to do something about those missing values. However, if any pattern does emerge in the data that involves Cabin, it will be highly cross-correlated with both Pclass and Fare (as higher-fare, better-class accommodations were grouped together on the Titanic). Given that too much cross-correlation can be detrimental to a model, it is probably just better for us to drop Cabin from our DataFrame:
In [31]:
df.drop('Cabin',axis=1,inplace=True)
Let's now run info to see if there are columns with just a few null values.
In [32]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.7+ KB
One note on the data: given that 1,503 died in the Titanic tragedy (and that we know that some survived), this data set clearly does not include every passenger on the ship (and none of the crew). Also remember that Survived is a variable that includes both those who survived and those who perished.
Back to missing values. Age is missing several values, as is Embarked. Let's see how many values are missing from Age:
In [33]:
df['Age'].isnull().value_counts()
Out[33]:
False 714
True 177
Name: Age, dtype: int64
As we saw above, Age isn't really correlated with Fare, so it is a variable that we want to eventually use in our model. That means that we need to do something with those missing values. But we before we decide on a strategy, we should check to see if our median age is the same for both sexes.
In [34]:
df.groupby('Sex')['Age'].median().plot(kind='bar')
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6778cb5160>
The median ages are different for men and women sailing on the Titanic, which means that we should handle the missing values accordingly. A sound strategy is to replace the missing ages for passengers with the median age for the passengers' sexes.
In [35]:
df['Age'] = df.groupby('Sex')['Age'].apply(lambda x: x.fillna(x.median()))
Any other missing values?
In [36]:
df.isnull().sum()
Out[36]:
PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 2
dtype: int64
We are missing two values for Embarked. Check to see how that variable breaks down:
In [37]:
df['Embarked'].value_counts()
Out[37]:
S 644
C 168
Q 77
Name: Embarked, dtype: int64
We can look to see where the most common port of embarkment was, and use that value as our default.
In [38]:
df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)
df['Embarked'].value_counts()
Out[38]:
S 646
C 168
Q 77
Name: Embarked, dtype: int64
We can see Southampton was the most common port of embarkment. Because we are only missing two values, we can use Southampton as our default.
In [39]:
df = pd.get_dummies(data=df, columns=['Sex', 'Embarked'],drop_first=True)
df.head()
Out[39]:
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
Sex_male
Embarked_Q
Embarked_S
0
1
0
3
22.0
1
0
7.2500
1
0
1
1
2
1
1
38.0
1
0
71.2833
0
0
0
2
3
1
3
26.0
0
0
7.9250
0
0
1
3
4
1
1
35.0
1
0
53.1000
0
0
1
4
5
0
3
35.0
0
0
8.0500
1
0
1
Let's do a final look at the correlation matrix to see if there is anything else we should remove.
In [40]:
df.corr()
Out[40]:
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
Sex_male
Embarked_Q
Embarked_S
PassengerId
1.000000
-0.005007
-0.035144
0.035734
-0.057527
-0.001652
0.012658
0.042939
-0.033606
0.022204
Survived
-0.005007
1.000000
-0.338481
-0.073296
-0.035322
0.081629
0.257307
-0.543351
0.003650
-0.149683
Pclass
-0.035144
-0.338481
1.000000
-0.338056
0.083081
0.018443
-0.549500
0.131900
0.221009
0.074053
Age
0.035734
-0.073296
-0.338056
1.000000
-0.236376
-0.176038
0.094161
0.095256
-0.032994
-0.005855
SibSp
-0.057527
-0.035322
0.083081
-0.236376
1.000000
0.414838
0.159651
-0.114631
-0.026354
0.068734
Parch
-0.001652
0.081629
0.018443
-0.176038
0.414838
1.000000
0.216225
-0.245489
-0.081228
0.060814
Fare
0.012658
0.257307
-0.549500
0.094161
0.159651
0.216225
1.000000
-0.182333
-0.117216
-0.162184
Sex_male
0.042939
-0.543351
0.131900
0.095256
-0.114631
-0.245489
-0.182333
1.000000
-0.074115
0.119224
Embarked_Q
-0.033606
0.003650
0.221009
-0.032994
-0.026354
-0.081228
-0.117216
-0.074115
1.000000
-0.499421
Embarked_S
0.022204
-0.149683
0.074053
-0.005855
0.068734
0.060814
-0.162184
0.119224
-0.499421
1.000000
Pclass and Fare have some amount of correlation, we can probably get rid of one of them. In addition, we need to remove Survived from our X DataFrame because it will be our response DataFrame, Y:
In [41]:
X = df.drop(['Survived', 'PassengerId','Pclass'],axis=1)
y = df['Survived']
Exercise:
We now need to split the training and test data, which you will so as an exercise:
In [42]:
from sklearn.model_selection import train_test_split
# Look up in the portion above on linear regression and use train_test_split here.
# Set test_size = 0.3 and random_state = 67 to get the same results as below when
# you run through the rest of the code example below.
Now you will import and fit the logistic regression model:
In [43]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()