Python MACHINE LEARNING

Python MACHINE LEARNING

Dec 3rd 

Python MACHINE LEARNING DATA SET 

Powered by

2-MachineLearningModels-Reference(unsaved changes)

Beginners Data Science for Python Developers (1)

Python 3 

Trusted

RunCodeMarkdownRaw NBConvertHeading-Enter/Exit RISE Slideshow


Introduction to machine learning models

You have now made it to the section on machine learning (ML). ML and the branch of computer science in which it resides, artificial intelligence (AI), are so central to data science that ML/AI and data science are synonymous in the minds of many people. However, the preceding sections have hopefully demonstrated that there are a lot of other facets to the discipline of data science apart from the prediction and classification tasks that supply so much value to the world. (Remember, at least 80 percent of the effort in most data-science projects will be composed of cleaning and manipulating the data to prepare it for analysis.)

That said, ML is fun! In this section, and the next one on data science in the cloud, you will get to play around with some of the “magic” of data science and start to put into practice the tools you have spent the last five sections learning. Let's get started!

A quick aside: types of ML

As you get deeper into data science, it might seem like there are a bewildering array of ML algorithms out there. However many you encounter, it can be handy to remember that most ML algorithms fall into three broad categories:

  • Predictive algorithms: These analyze current and historical facts to make predictions about unknown events, such as the future or customers’ choices.
  • Classification algorithms: These teach a program from a body of data, and the program then uses that learning to classify new observations.
  • Time-series forecasting algorithms: While it can argued that these algorithms are a part of predictive algorithms, their techniques are specialized enough that they in many ways functions like a separate category. Time-series forecasting is beyond the scope of this course, but we have more than enough work with focusing here on prediction and classification.

Prediction: linear regression

Learning goal: By the end of this subsection, you should be comfortable fitting linear regression models, and you should have some familiarity with interpreting their output.

Arguably the simplest form of machine learning is to draw a line connecting two points and make predictions about where that trend might lead.

But what if you have more than two points—and those points don't line up neatly? What if you have points in more than two dimensions? This is where linear regression comes in.

Formally, linear regression is used to predict a quantitative response (the values on a Y axis) that is dependent on one or more predictors (values on one or more axes that are orthogonal to Y, commonly just thought of collectively as X). The working assumption is that the relationship between predictors and response is more or less linear. The goal of linear regression is to fit a straight line in the best possible way to minimize the deviation between our observed responses in the dataset and the responses predicted by our line, the linear approximation. (The most common means of assessing this error is called the least squares method; it consists of minimizing the number you get when you square the difference between your predicted value and the actual value and add up all of those squared differences for your entire dataset.)


Statistically, we can represent this relationship between response and predictors as:

??=

??

0

+

??

1

??+??

Y=B0+B1X+E

Remember high school geometry? 

??

0

B0 is the intercept of our line and 

??

1

B1 is its slope. We commonly refer to 

??

0

B0 and 

??

1

B1 as coefficients and to 

??

E as the error term, which represents the margin of error in the model.

Let's try this in practice with actual data. (Note: no graph paper will be harmed in the course of these predictions.)

Data exploration

We'll begin by importing our usual libraries and using our %matplotlib inline magic command:

In [1]:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns


/home/nbuser/anaconda3_420/lib/python3.5/site-packages/matplotlib/font_manager.py:281: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.

  'Matplotlib is building the font cache using fc-list. '

 

And now for our data. In this case, we’ll use a newer housing dataset than the Boston Housing Dataset we used in the last section (with this one storing data on individual houses across the United States).

In [2]:

df = pd.read_csv('./Data/Housing_Dataset_Sample.csv')

df.head()


Out[2]:

Avg. Area Income

Avg. Area House Age

Avg. Area Number of Rooms

Avg. Area Number of Bedrooms

Area Population

Price

Address

0

79545.458574

5.682861

7.009188

4.09

23086.800503

1.059034e+06

208 Michael Ferry Apt. 674\nLaurabury, NE 3701...

1

79248.642455

6.002900

6.730821

3.09

40173.072174

1.505891e+06

188 Johnson Views Suite 079\nLake Kathleen, CA...

2

61287.067179

5.865890

8.512727

5.13

36882.159400

1.058988e+06

9127 Elizabeth Stravenue\nDanieltown, WI 06482...

3

63345.240046

7.188236

5.586729

3.26

34310.242831

1.260617e+06

USS Barnett\nFPO AP 44820

4

59982.197226

5.040555

7.839388

4.23

26354.109472

6.309435e+05

USNS Raymond\nFPO AE 09386

Exercise:

In [3]:

# Do you remember the DataFrame method for looking at overall information

# about a DataFrame, such as number of columns and rows? Try it here.


Let's also use the describe method to look at some of the vital statistics about the columns. Note that in cases like this, in which some of the column names are long, it can be helpful to view the transposition of the summary, like so:

In [4]:

df.describe().T


Out[4]:

count

mean

std

min

25%

50%

75%

max

Avg. Area Income

5000.0

6.858311e+04

10657.991214

17796.631190

61480.562388

6.880429e+04

7.578334e+04

1.077017e+05

Avg. Area House Age

5000.0

5.977222e+00

0.991456

2.644304

5.322283

5.970429e+00

6.650808e+00

9.519088e+00

Avg. Area Number of Rooms

5000.0

6.987792e+00

1.005833

3.236194

6.299250

7.002902e+00

7.665871e+00

1.075959e+01

Avg. Area Number of Bedrooms

5000.0

3.981330e+00

1.234137

2.000000

3.140000

4.050000e+00

4.490000e+00

6.500000e+00

Area Population

5000.0

3.616352e+04

9925.650114

172.610686

29403.928702

3.619941e+04

4.286129e+04

6.962171e+04

Price

5000.0

1.232073e+06

353117.626581

15938.657923

997577.135049

1.232669e+06

1.471210e+06

2.469066e+06

Let's look at the data in the Price column. (You can disregard the deprecation warning if it appears.)

In [5]:

sns.distplot(df['Price'])


/home/nbuser/anaconda3_420/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

 

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f677b7d0f60>


As we would hope with this much data, our prices form a nice bell-shaped, normally distributed curve.

Now, let's look at a simple relationship like that between house prices and the average income in a geographic area:

In [6]:

sns.jointplot(df['Avg. Area Income'],df['Price'])


/home/nbuser/anaconda3_420/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

 

Out[6]:

<seaborn.axisgrid.JointGrid at 0x7f677b3bb8d0>


As we would expect, there is an intuitive, linear relationship between them. Also good: the pairplot shows that the data in both columns is normally distributed, so we don't have to worry about somehow transforming the data for meaningful analysis.

Let's take a quick look at all of the columns:

In [7]:

sns.pairplot(df)


Out[7]:

<seaborn.axisgrid.PairGrid at 0x7f67801a09e8>


Some observations:


  1. Not all of the combinations of columns provide strong linear relationships; some just look like blobs. That's nothing to worry about for our analysis.
  2. See the visualizations that look like lanes rather than organic groups? That is the result of the average number of bedrooms in houses being measured in discrete values rather than continuous ones (as no one has 0.3 bedrooms in their house). The number of bathrooms is also the one column whose data is not really normally distributed, though some of this might be distortion caused by the default bin size of the pairplot histogram functionality.

It is now time to make a prediction. 

Fitting the model

Let's make a prediction. Let's feed everything into a linear model (average area income, average area house age, average area number of rooms, average area number of bedrooms, and area population) and see how well knowing those factors can help us predict the price of a home. 

To do this, we will make our first five columns the X (our predictors) and the Price column the Y (our response):

In [8]:

X = df.iloc[:,:5]

y = df['Price']


Now, we could use all of our data to create our model. However, all that would get us is a model that is good at predicting itself. Not only would that leave us with no objective way to measure how good the model is, it would also likely lead to a model that was less accurate when used on new data. Such a model is termed overfitted.

To avoid this, data scientists divide their datasets for ML into training data (the data used to fit the model) and test data (data used to evaluate how accurate the model is). Fortunately, scikit-learn provides a function that enables us to easily divide up our data between training and test sets: train_test_split. In this case, we will use 70 percent of our data for training and reserve 30 percent of it for testing. (Note that you will also supply a fourth parameter to the function: random_state; train_test_split randomly divides up our data between test and training, so this number provides an explicit seed for the random-number generator so that you will get the same result each time you run this code snippet.)

In [9]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=54)


All that is left now is to import our linear regression algorithm and fit our model based on our training data:

In [10]:

from sklearn.linear_model import LinearRegression

reg = LinearRegression()


In [11]:

reg.fit(X_train,y_train)


Out[11]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Evaluating the model

Now, a moment of truth: let's see how our model does making predictions based on the test data:

In [12]:

predictions = reg.predict(X_test)


In [13]:

predictions


Out[13]:

array([ 614607.96220733, 1849444.80372637, 1118945.0888425 , ...,

        834789.0342857 , 1787928.10906922, 1455422.23696486])

Our predictions are just an array of numbers: these are the house prices predicted by our model. One for every row in our test dataset.

Remember how we mentioned that linear models have the mathematical form of 

??=

??

0

+

??

1

???+??

Y=B0+B1?X+E? Let’s look at the actual equation:

In [14]:

print(reg.intercept_,reg.coef_)


-2646401.726324682 [2.15873958e+01 1.65828187e+05 1.21323502e+05 2.79025671e+03

 1.51667244e+01]

 

In algebraic terms, here is our model:

??=?2,646,401+0.21587

??

1

+0.00002

??

2

+0.00001

??

3

+0.00279

??

4

+0.00002

??

5

Y=?2,646,401+0.21587X1+0.00002X2+0.00001X3+0.00279X4+0.00002X5

where:

  • ??=
  • Y= Price
  • ??
  • 1
  • =
  • X1= Average area income
  • ??
  • 2
  • =
  • X2= Average area house age
  • ??
  • 3
  • =
  • X3= Average area number of rooms
  • ??
  • 4
  • =
  • X4= Average area number of bedrooms
  • ??
  • 5
  • =
  • X5= Area population

So, just how good is our model? There are many ways to measure the accuracy of ML models. Linear models have a good one: the 

??

2

R2 score (also knows as the coefficient of determination). A high 

??

2

R2, close to 1, indicates better prediction with less error.

In [15]:

#Explained variation. A high R2 close to 1 indicates better prediction with less error.

from sklearn.metrics import r2_score

r2_score(y_test,predictions)


Out[15]:

0.921660486570713

The 

??

2

R2 score also indicates how much explanatory power a linear model has. In the case of our model, the five predictors we used in the model explain a little more than 92 percent of the price of a house in this dataset.

We can also plot our errors to get a visual sense of how wrong our predictions were:

In [16]:

#plot errors

sns.distplot([y_test-predictions])


/home/nbuser/anaconda3_420/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

 

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f677968bb00>


Do you notice the numbers on the left axis? Whereas a histogram shows the number of things that fall into discrete numeric buckets, a kernel density estimation (KDE, and the histogram that accompanies it in the Seaborn displot) normalizes those numbers to show what proportion of results lands in each bucket. Essentially, these are all decimal numbers less than 1.0 because the area under the KDE has to add up to 1.

Maybe more gratifying, we can plot the predictions from our model:

In [17]:

# Plot outputs

plt.scatter(y_test,predictions, color='blue')


Out[17]:

<matplotlib.collections.PathCollection at 0x7f677935df98>


The linear nature of our predicted prices is clear enough, but because all of the dots are solid it's hard to see the areas of concentration. Can you think of a way to refine this visualization to make it clearer?

Exercise:

In [18]:

# Hint: Remember to try the plt.scatter parameter alpha=.

# It takes values between 0 and 1.


Takeaway: In this subsection, you performed prediction using linear regression by exploring your data, then fitting your model, and finally evaluating your model’s performance.

Classification: logistic regression

Learning goal: By the end of this subsection, you should know how logistic regression differs from linear regression, be comfortable fitting logistic regression models, and have some familiarity with interpreting their output.

We'll now pivot to discussing classification. If our simple analogy of predictive analytics was drawing a line through points and extrapolating from that, then classification can be described in its simplest form as drawing lines around groups of points.

While linear regression is used to predict quantitative responses, such as what someone's score on an exam might be, logistic regression is used for classification problems, such as predicting someone passing or failing an exam.

Formally, logistic regression predicts the categorical response (Y) based on predictors (Xs). Logistic regression goes by several names, and it is also known in the scholarly literature as logit regression, maximum-entropy classification (MaxEnt), and the log-linear classifier. In this algorithm, the probabilities describing the possible outcomes of a single trial are modeled using a sigmoid (S-curve) function. Sigmoid functions take any value and transform it to be between 0 and 1, which can be used as a probability for a class to be predicted, with the goal of predictors mapping to 1 when something belongs in the class and 0 when they do not.


To show this in action, let's do something a little different and try a historical dataset: the fates of the passengers of the RMS Titanic, which is a popular dataset for classification problems in machine learning. In this case, the class we want to predict is whether a passenger survived the doomed liner's sinking.

The dataset has 12 variables:

  • PassengerId
  • Survived: 0 = No, 1 = Yes
  • Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Sex
  • Age
  • Sibsp: Number of siblings or spouses aboard the Titanic
  • Parch: Number of parents or children aboard the Titanic
  • Ticket: Passenger ticket number 
  • Fare: Passenger fare 
  • Cabin: Cabin number 
  • Embarked: Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton

In [19]:

df = pd.read_csv('./Data/train_data_titanic.csv')

df.head()


Out[19]:

PassengerId

Survived

Pclass

Name

Sex

Age

SibSp

Parch

Ticket

Fare

Cabin

Embarked

0

1

0

3

Braund, Mr. Owen Harris

male

22.0

1

0

A/5 21171

7.2500

NaN

S

1

2

1

1

Cumings, Mrs. John Bradley (Florence Briggs Th...

female

38.0

1

0

PC 17599

71.2833

C85

C

2

3

1

3

Heikkinen, Miss. Laina

female

26.0

0

0

STON/O2. 3101282

7.9250

NaN

S

3

4

1

1

Futrelle, Mrs. Jacques Heath (Lily May Peel)

female

35.0

1

0

113803

53.1000

C123

S

4

5

0

3

Allen, Mr. William Henry

male

35.0

0

0

373450

8.0500

NaN

S

In [20]:

df.info()


<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 12 columns):

PassengerId    891 non-null int64

Survived       891 non-null int64

Pclass         891 non-null int64

Name           891 non-null object

Sex            891 non-null object

Age            714 non-null float64

SibSp          891 non-null int64

Parch          891 non-null int64

Ticket         891 non-null object

Fare           891 non-null float64

Cabin          204 non-null object

Embarked       889 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 83.6+ KB

 

One reason that the Titanic data set is a popular classification set is that it provides opportunities to prepare data for analysis. To prepare this dataset for analysis, we need to perform a number of tasks:


  • Remove extraneous variables
  • Check for multicollinearity 
  • Handle missing values

We will touch on each of these steps in turn.

Remove extraneous variables

The name of individual passengers and their ticket numbers will clearly do nothing to help our model, so we can drop those columns to simplify matters.

In [21]:

df.drop(['Name','Ticket'],axis=1,inplace=True)


There are additional variables that will not add classifying power to our model, but to find them we will need to look for correlation between variables.

Check for multicollinearity

If one or more of our predictors can themselves be predicted from other predictors, it can produce a state of multicollinearity in our model. When items are too closely related, it can make it difficult to determine the true predictors of a condition. Multicollinearity is a challenge because it can skew the results of regression models (both linear and logistic) and reduce the predictive or classifying power of a model.

To help combat this problem, we can start to look for some initial patterns. For example, do any correlations between Survived and Fare jump out?

In [22]:

sns.pairplot(df[['Survived','Fare']], dropna=True)


Out[22]:

<seaborn.axisgrid.PairGrid at 0x7f6778de5d68>


Exercise:

In [23]:

# Try running sns.pairplot twice more on some other combinations of columns

# and see if any patterns emerge.


We can also use groupby to look for patterns. Consider the mean values for the various variables when we group by Survived:

In [24]:

df.groupby('Survived').mean()


Out[24]:

PassengerId

Pclass

Age

SibSp

Parch

Fare

Survived







0

447.016393

2.531876

30.626179

0.553734

0.329690

22.117887

1

444.368421

1.950292

28.343690

0.473684

0.464912

48.395408

In [25]:

df.groupby('Age').mean()


Out[25]:

PassengerId

Survived

Pclass

SibSp

Parch

Fare

Age







0.42

804.000000

1.000000

3.000000

0.000000

1.000000

8.516700

0.67

756.000000

1.000000

2.000000

1.000000

1.000000

14.500000

0.75

557.500000

1.000000

3.000000

2.000000

1.000000

19.258300

0.83

455.500000

1.000000

2.000000

0.500000

1.500000

23.875000

0.92

306.000000

1.000000

1.000000

1.000000

2.000000

151.550000

1.00

415.428571

0.714286

2.714286

1.857143

1.571429

30.005957

2.00

346.900000

0.300000

2.600000

2.100000

1.300000

37.536250

3.00

272.000000

0.833333

2.500000

1.833333

1.333333

25.781950

4.00

466.100000

0.700000

2.600000

1.600000

1.400000

29.543330

5.00

380.000000

1.000000

2.750000

1.750000

1.250000

22.717700

6.00

762.333333

0.666667

2.666667

1.333333

1.333333

25.583333

7.00

288.666667

0.333333

2.666667

2.666667

1.333333

31.687500

8.00

400.250000

0.500000

2.500000

2.000000

1.250000

28.300000

9.00

437.250000

0.250000

3.000000

2.500000

1.750000

27.938538

10.00

620.000000

0.000000

3.000000

1.500000

2.000000

26.025000

11.00

534.500000

0.250000

2.500000

2.500000

1.500000

54.240625

12.00

126.000000

1.000000

3.000000

1.000000

0.000000

11.241700

13.00

614.000000

1.000000

2.500000

0.000000

0.500000

13.364600

14.00

312.000000

0.500000

2.500000

2.000000

0.833333

42.625700

14.50

112.000000

0.000000

3.000000

1.000000

0.000000

14.454200

15.00

554.600000

0.800000

2.600000

0.400000

0.400000

49.655020

16.00

422.294118

0.352941

2.529412

0.764706

0.529412

25.745100

17.00

423.000000

0.461538

2.384615

0.615385

0.384615

28.389423

18.00

516.269231

0.346154

2.461538

0.384615

0.423077

38.063462

19.00

389.400000

0.360000

2.360000

0.320000

0.200000

27.869496

20.00

493.066667

0.200000

3.000000

0.200000

0.066667

8.624173

20.50

228.000000

0.000000

3.000000

0.000000

0.000000

7.250000

21.00

390.208333

0.208333

2.583333

0.333333

0.208333

31.565621

22.00

365.740741

0.407407

2.555556

0.148148

0.222222

25.504781

23.00

510.266667

0.333333

2.133333

0.400000

0.266667

37.994720

...

...

...

...

...

...

...

44.00

437.111111

0.333333

2.111111

0.444444

0.222222

29.758333

45.00

367.500000

0.416667

2.000000

0.333333

0.583333

36.818408

45.50

268.000000

0.000000

2.000000

0.000000

0.000000

17.862500

46.00

427.000000

0.000000

1.333333

0.333333

0.000000

55.458333

47.00

534.666667

0.111111

1.777778

0.222222

0.111111

27.601389

48.00

663.111111

0.666667

1.666667

0.555556

0.555556

37.893067

49.00

533.500000

0.666667

1.333333

0.666667

0.166667

59.929183

50.00

457.200000

0.500000

1.600000

0.400000

0.200000

64.025830

51.00

456.142857

0.285714

2.000000

0.142857

0.142857

28.752386

52.00

589.500000

0.500000

1.333333

0.500000

0.333333

51.402783

53.00

572.000000

1.000000

1.000000

2.000000

0.000000

51.479200

54.00

383.625000

0.375000

1.500000

0.500000

0.500000

44.477087

55.00

254.500000

0.500000

1.500000

0.000000

0.000000

23.250000

55.50

153.000000

0.000000

3.000000

0.000000

0.000000

8.050000

56.00

542.750000

0.500000

1.000000

0.000000

0.250000

43.976025

57.00

700.000000

0.000000

2.000000

0.000000

0.000000

11.425000

58.00

325.000000

0.600000

1.000000

0.000000

0.600000

93.901660

59.00

164.000000

0.000000

2.500000

0.000000

0.000000

10.375000

60.00

583.750000

0.500000

1.250000

0.750000

0.500000

55.000000

61.00

374.666667

0.000000

1.666667

0.000000

0.000000

24.019433

62.00

552.500000

0.500000

1.250000

0.000000

0.000000

35.900000

63.00

380.000000

1.000000

2.000000

0.500000

0.000000

43.772900

64.00

492.500000

0.000000

1.000000

0.500000

2.000000

144.500000

65.00

264.333333

0.000000

1.666667

0.000000

0.333333

32.093067

66.00

34.000000

0.000000

2.000000

0.000000

0.000000

10.500000

70.00

709.500000

0.000000

1.500000

0.500000

0.500000

40.750000

70.50

117.000000

0.000000

3.000000

0.000000

0.000000

7.750000

71.00

295.500000

0.000000

1.000000

0.000000

0.000000

42.079200

74.00

852.000000

0.000000

3.000000

0.000000

0.000000

7.775000

80.00

631.000000

1.000000

1.000000

0.000000

0.000000

30.000000

88 rows × 6 columns

Survivors appear to be slightly younger on average with higher-cost fare.

In [26]:

df.head()


Out[26]:

PassengerId

Survived

Pclass

Sex

Age

SibSp

Parch

Fare

Cabin

Embarked

0

1

0

3

male

22.0

1

0

7.2500

NaN

S

1

2

1

1

female

38.0

1

0

71.2833

C85

C

2

3

1

3

female

26.0

0

0

7.9250

NaN

S

3

4

1

1

female

35.0

1

0

53.1000

C123

S

4

5

0

3

male

35.0

0

0

8.0500

NaN

S

Value counts can also help us get a sense of the data before us, such as numbers for siblings and spouses on the Titanic, in addition to the sex split of passengers:

In [27]:

df['SibSp'].value_counts()


Out[27]:

0    608

1    209

2     28

4     18

3     16

8      7

5      5

Name: SibSp, dtype: int64

In [28]:

df['Parch'].value_counts()


Out[28]:

0    678

1    118

2     80

5      5

3      5

4      4

6      1

Name: Parch, dtype: int64

In [29]:

df['Sex'].value_counts()


Out[29]:

male      577

female    314

Name: Sex, dtype: int64

Handle missing values

We now need to address missing values. First, let’s look to see which columns have more than half of their values missing:

In [30]:

#missing

df.isnull().sum()>(len(df)/2)


Out[30]:

PassengerId    False

Survived       False

Pclass         False

Sex            False

Age            False

SibSp          False

Parch          False

Fare           False

Cabin           True

Embarked       False

dtype: bool

Let's break down the code in the call above just a bit. df.isnull().sum() tells pandas to take the sum of all of the missing values for each column. len(df)/2is just another way of expressing half the number of rows in the DataFrame. Taken together with the >, this line of code is looking for any columns with more than half of its entries missing, and there is one: Cabin.

We could try to do something about those missing values. However, if any pattern does emerge in the data that involves Cabin, it will be highly cross-correlated with both Pclass and Fare (as higher-fare, better-class accommodations were grouped together on the Titanic). Given that too much cross-correlation can be detrimental to a model, it is probably just better for us to drop Cabin from our DataFrame:

In [31]:

df.drop('Cabin',axis=1,inplace=True)


Let's now run info to see if there are columns with just a few null values.

In [32]:

df.info()


<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 9 columns):

PassengerId    891 non-null int64

Survived       891 non-null int64

Pclass         891 non-null int64

Sex            891 non-null object

Age            714 non-null float64

SibSp          891 non-null int64

Parch          891 non-null int64

Fare           891 non-null float64

Embarked       889 non-null object

dtypes: float64(2), int64(5), object(2)

memory usage: 62.7+ KB

 

One note on the data: given that 1,503 died in the Titanic tragedy (and that we know that some survived), this data set clearly does not include every passenger on the ship (and none of the crew). Also remember that Survived is a variable that includes both those who survived and those who perished.

Back to missing values. Age is missing several values, as is Embarked. Let's see how many values are missing from Age:

In [33]:

df['Age'].isnull().value_counts()


Out[33]:

False    714

True     177

Name: Age, dtype: int64

As we saw above, Age isn't really correlated with Fare, so it is a variable that we want to eventually use in our model. That means that we need to do something with those missing values. But we before we decide on a strategy, we should check to see if our median age is the same for both sexes.

In [34]:

df.groupby('Sex')['Age'].median().plot(kind='bar')


Out[34]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6778cb5160>


The median ages are different for men and women sailing on the Titanic, which means that we should handle the missing values accordingly. A sound strategy is to replace the missing ages for passengers with the median age for the passengers' sexes.

In [35]:

df['Age'] = df.groupby('Sex')['Age'].apply(lambda x: x.fillna(x.median()))


Any other missing values?

In [36]:

df.isnull().sum()


Out[36]:

PassengerId    0

Survived       0

Pclass         0

Sex            0

Age            0

SibSp          0

Parch          0

Fare           0

Embarked       2

dtype: int64

We are missing two values for Embarked. Check to see how that variable breaks down:

In [37]:

df['Embarked'].value_counts()


Out[37]:

S    644

C    168

Q     77

Name: Embarked, dtype: int64

We can look to see where the most common port of embarkment was, and use that value as our default.

In [38]:

df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)

df['Embarked'].value_counts()


Out[38]:

S    646

C    168

Q     77

Name: Embarked, dtype: int64

We can see Southampton was the most common port of embarkment. Because we are only missing two values, we can use Southampton as our default.

In [39]:

df = pd.get_dummies(data=df, columns=['Sex', 'Embarked'],drop_first=True)

df.head()


Out[39]:

PassengerId

Survived

Pclass

Age

SibSp

Parch

Fare

Sex_male

Embarked_Q

Embarked_S

0

1

0

3

22.0

1

0

7.2500

1

0

1

1

2

1

1

38.0

1

0

71.2833

0

0

0

2

3

1

3

26.0

0

0

7.9250

0

0

1

3

4

1

1

35.0

1

0

53.1000

0

0

1

4

5

0

3

35.0

0

0

8.0500

1

0

1

Let's do a final look at the correlation matrix to see if there is anything else we should remove.

In [40]:

df.corr()


Out[40]:

PassengerId

Survived

Pclass

Age

SibSp

Parch

Fare

Sex_male

Embarked_Q

Embarked_S

PassengerId

1.000000

-0.005007

-0.035144

0.035734

-0.057527

-0.001652

0.012658

0.042939

-0.033606

0.022204

Survived

-0.005007

1.000000

-0.338481

-0.073296

-0.035322

0.081629

0.257307

-0.543351

0.003650

-0.149683

Pclass

-0.035144

-0.338481

1.000000

-0.338056

0.083081

0.018443

-0.549500

0.131900

0.221009

0.074053

Age

0.035734

-0.073296

-0.338056

1.000000

-0.236376

-0.176038

0.094161

0.095256

-0.032994

-0.005855

SibSp

-0.057527

-0.035322

0.083081

-0.236376

1.000000

0.414838

0.159651

-0.114631

-0.026354

0.068734

Parch

-0.001652

0.081629

0.018443

-0.176038

0.414838

1.000000

0.216225

-0.245489

-0.081228

0.060814

Fare

0.012658

0.257307

-0.549500

0.094161

0.159651

0.216225

1.000000

-0.182333

-0.117216

-0.162184

Sex_male

0.042939

-0.543351

0.131900

0.095256

-0.114631

-0.245489

-0.182333

1.000000

-0.074115

0.119224

Embarked_Q

-0.033606

0.003650

0.221009

-0.032994

-0.026354

-0.081228

-0.117216

-0.074115

1.000000

-0.499421

Embarked_S

0.022204

-0.149683

0.074053

-0.005855

0.068734

0.060814

-0.162184

0.119224

-0.499421

1.000000

Pclass and Fare have some amount of correlation, we can probably get rid of one of them. In addition, we need to remove Survived from our X DataFrame because it will be our response DataFrame, Y:

In [41]:

X = df.drop(['Survived', 'PassengerId','Pclass'],axis=1)

y = df['Survived']


Exercise:

We now need to split the training and test data, which you will so as an exercise:

In [42]:

from sklearn.model_selection import train_test_split

# Look up in the portion above on linear regression and use train_test_split here.

# Set test_size = 0.3 and random_state = 67 to get the same results as below when

# you run through the rest of the code example below.


Now you will import and fit the logistic regression model:

In [43]:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()


要查看或添加评论,请登录

社区洞察

其他会员也浏览了