STEP-BY-STEP-APPROACH-TO CLASSIFY-THE-PERSON-HAVING-CANCER-OR-NOT-USING-MLAI ALGORITHMS

STEP-BY-STEP-APPROACH-TO CLASSIFY-THE-PERSON-HAVING-CANCER-OR-NOT-USING-MLAI ALGORITHMS

A Step-By-Step approach towards Machine Learning Classification in Python using Random Forest, PCA, & Hyperparameter Tuning — with complete pyhton code

As data scientists, we have many options to choose from to create a classification model. One of the most popular and robust methods is using Random Forests. We can perform Hyperparameter Tuning on Random Forests to try to optimize the model’s performance.

It is also common practice to try Principal Component Analysis (PCA) Before fitting our data to a model.

PCA can make interpreting each “feature” a little harder when we analyze the “feature importances” of our Random Forest model.         
However, PCA performs dimensionality reduction, which can reduce the number of features for the Random Forest to process,         
So,PCA might help speed up the training of your Random Forest model.         
Note that computational cost is one of the biggest drawbacks of Random Forests (it can take a long time to run the model).         
PCA can become really important especially when you are working with hundreds or even thousands of predicting features.         
So if the most important thing is to simply have the best performing model, and interpreting feature importance can be sacrificed, then PCA may be useful to try.        
Now, let’s get started with our example. We will be working with the Scikit-learn         

“Breast Cancer” Dataset. We will create two models,

and compare their performance to each other:

1. Random Forest        
2. Random Forest With PCA Reduced Dimensionality        
1. Import Data        
First, we load the data and create a dataframe.         
Since this is a pre-cleaned “Kid-Toy” dataset from Scikit-learn, we are good to proceed with the modeling process.         

As a best practice, we should always do the following:

Use df.head() to take a glance at the new dataframe to make sure it looks as intended.        
Use df.info() to get a sense of the data types and counts in each column. You may need to convert data types as needed.        
Use df.isna() to make sure that there are no NaN values. You may need to impute values or eliminate rows as needed.        
Use df.describe() to get a sense for the         
Minimum,         
Maximum,         
Mean,         
Median,         
Standard deviation,         
And interquartile ranges of each column.        
The column named “cancer” is the target variable that we want to predict using our model.        
LABEL: “0” means “no cancer”         
 ? ? ? “1” means “cancer”.        
import pandas as pd        
from sklearn.datasets import load_breast_cancer        
columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',         
           'mean smoothness', 'mean compactness', 'mean concavity',         
           'mean concave points', 'mean symmetry', 'mean fractal dimension',         
           'radius error', 'texture error', 'perimeter error', 'area error',         
           'smoothness error', 'compactness error', 'concavity error',         
           'concave points error', 'symmetry error', 'fractal dimension error',        
           'worst radius', 'worst texture', 'worst perimeter', 'worst area',         
           'worst smoothness', 'worst compactness', 'worst concavity',         
           'worst concave points', 'worst symmetry', 'worst fractal dimension']        
dataset = load_breast_cancer()        
data = pd.DataFrame(dataset['data'], columns=columns)        
data['cancer'] = dataset['target']        
display(data.head())        

mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cancer 0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0 1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0 2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0 3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0 4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

display(data.info())        
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         569 non-null    float64
 15  compactness error        569 non-null    float64
 16  concavity error          569 non-null    float64
 17  concave points error     569 non-null    float64
 18  symmetry error           569 non-null    float64
 19  fractal dimension error  569 non-null    float64
 20  worst radius             569 non-null    float64
 21  worst texture            569 non-null    float64
 22  worst perimeter          569 non-null    float64
 23  worst area               569 non-null    float64
 24  worst smoothness         569 non-null    float64
 25  worst compactness        569 non-null    float64
 26  worst concavity          569 non-null    float64
 27  worst concave points     569 non-null    float64
 28  worst symmetry           569 non-null    float64
 29  worst fractal dimension  569 non-null    float64
 30  cancer                   569 non-null    int32  
dtypes: float64(30), int32(1)
memory usage: 135.7 KB
        
None        
display(data.isna().sum())        
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
cancer                     0
dtype: int64        
display(data.describe())        

mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cancer count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946 0.627417 std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061 0.483918 min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040 0.000000 25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460 0.000000 50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040 1.000000 75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080 1.000000 max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500 1.000000

8 rows × 31 columns

display(data.head())        
display(data.info())        
display(data.isna().sum())        
display(data.describe())        

mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cancer 0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0 1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0 2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0 3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0 4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         569 non-null    float64
 15  compactness error        569 non-null    float64
 16  concavity error          569 non-null    float64
 17  concave points error     569 non-null    float64
 18  symmetry error           569 non-null    float64
 19  fractal dimension error  569 non-null    float64
 20  worst radius             569 non-null    float64
 21  worst texture            569 non-null    float64
 22  worst perimeter          569 non-null    float64
 23  worst area               569 non-null    float64
 24  worst smoothness         569 non-null    float64
 25  worst compactness        569 non-null    float64
 26  worst concavity          569 non-null    float64
 27  worst concave points     569 non-null    float64
 28  worst symmetry           569 non-null    float64
 29  worst fractal dimension  569 non-null    float64
 30  cancer                   569 non-null    int32  
dtypes: float64(30), int32(1)
memory usage: 135.7 KB
        
None        
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
cancer                     0
dtype: int64        

mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cancer count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946 0.627417 std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061 0.483918 min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040 0.000000 25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460 0.000000 50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040 1.000000 75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080 1.000000 max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500 1.000000

8 rows × 31 columns

Above is a portion of the breast cancer dataframe. Each row has observations about a patient. The end column named “cancer” is the target variable that we are trying to predict. 0 means “no cancer” and 1 means “cancer”.

2.LETS SPLIT THE DATASET INTO TRAIN AND TEST

2. Train-Test Split        
Here, we split our data using the Scikit-learn “train_test_split” function.         
We want to give the model as much data as possible to train with.         
However, we also want to make sure that we have enough data for the model to test itself on.         
In general, as the number of rows in the dataset increases, the more data we can give to the training set.        
Supposing:        
If we had millions of rows, we could have a 90% train / 10% test split.         
However, our dataset is only 569 rows, which is not a very large dataset to train or test on.         
So to be fair towards both train set and test set, we will split the data into 50% train and 50% test.         
We set stratify=y to ensure that both the train and test sets have the same proportion of 0s and 1s as the original dataset.        

LETS IMPORT THE RELEVANT LIBRARY

from sklearn.model_selection import train_test_split        
X = data.drop('cancer', axis=1)          
y = data['cancer']         
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state = 2020, stratify=y)        

3.LETS SCALE THE DATA HERE

3. Scale Data        
Before modeling,         
we need to “center” and “standardize” our data by scaling.         
We scale to control for the fact that different variables are measured on different scales.         
We scale so that each predictor can have a “fair fight” against each other in deciding importance.         
See this article. We also convert “y_train” from a Pandas “Series” object into a NumPy array for the model to accept the target training data later on.        
import numpy as np        
from sklearn.preprocessing import StandardScaler        
ss = StandardScaler()        
X_train_scaled = ss.fit_transform(X_train)        
X_test_scaled = ss.transform(X_test)        
y_train = np.array(y_train)        

4. LETS WE FIT TO THE "BASELINE" RANDOM FOREST MODEL:

4. Fit To “Baseline” Random Forest Model        
Now we create a “baseline” Random Forest model.         
This model uses all of the predicting features and of the default settings defined in the Scikit-learn Random Forest Classifier documentation.         
First, we instantiate the model and fit the scaled data to it. We can measure the accuracy of the model on our training data.        
from sklearn.ensemble import RandomForestClassifier        
from sklearn.metrics import recall_score        
rfc = RandomForestClassifier()        
rfc.fit(X_train_scaled, y_train)        
display(rfc.score(X_train_scaled, y_train))        
1.0        
If we are curious to see which features are most important to the Random Forest model to predict breast cancer,         
we can visualize and quantify the importances by calling the “feature_importances_” method:        
from sklearn.ensemble import RandomForestClassifier        
# Lets Instantiate the RandomForestClassifier        
rfc_1 = RandomForestClassifier()        
from sklearn.ensemble import RandomForestClassifier        
# Instantiate the RandomForestClassifier        
rfc_1 = RandomForestClassifier()        
# Fit the model with your training data        
rfc_1.fit(X_train, y_train)  # Replace X_train and y_train with your actual training data        
# Now, you can access the feature importances        
feats = {}        
for feature, importance in zip(data.columns, rfc_1.feature_importances_):        
    feats[feature] = importance        
#Make sure to replace X_train and y_train with your actual training data and labels. After fitting the model, you can calculate and access the feature importances without encountering the NotFittedError.        
import seaborn as sns        
import matplotlib.pyplot as plt        
feats = {}        
for feature, importance in zip(data.columns, rfc_1.feature_importances_):        
    feats[feature] = importance        
            
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-Importance'})        
importances = importances.sort_values(by='Gini-Importance', ascending=False)        
importances = importances.reset_index()        
importances = importances.rename(columns={'index': 'Features'})        
sns.set(font_scale = 7)        
sns.set(style="whitegrid", color_codes=True, font_scale = 1.7)        
fig, ax = plt.subplots()        
fig.set_size_inches(35,20)        
sns.barplot(x=importances['Gini-Importance'], y=importances['Features'], data=importances, color='green')        
plt.xlabel('Importance', fontsize=30, weight = 'bold')        
plt.ylabel('Features', fontsize=30, weight = 'bold')        
plt.title('Feature Importance', fontsize=30, weight = 'bold')        
display(plt.show())        
display(importances)        
None        

Features Gini-Importance 0 worst area 0.160013 1 worst concave points 0.134852 2 worst perimeter 0.108445 3 worst radius 0.092098 4 mean concave points 0.087374 5 mean area 0.086158 6 mean radius 0.062147 7 worst concavity 0.053510 8 area error 0.041403 9 mean perimeter 0.031531 10 mean concavity 0.025719 11 worst texture 0.011973 12 worst compactness 0.011629 13 mean compactness 0.010870 14 worst smoothness 0.009485 15 radius error 0.008631 16 perimeter error 0.008184 17 worst symmetry 0.007517 18 concavity error 0.006875 19 mean texture 0.006216 20 worst fractal dimension 0.006004 21 fractal dimension error 0.004676 22 mean smoothness 0.003955 23 mean fractal dimension 0.003776 24 smoothness error 0.003534 25 compactness error 0.003359 26 mean symmetry 0.003065 27 symmetry error 0.003007 28 concave points error 0.002474 29 texture error 0.001522

5. LETS DO PCA (Principal Component Analysis)

5. PCA (Principal Component Analysis)        
Here, how we could improve our baseline model?         
Using dimension reduction, we can approximate the original dataset with fewer variables, while reducing computational power to run our model.         
Using PCA, we can study the cumulative explained variance ratio of these features to understand which features explain the most variance in the data.        
We instantiate the PCA function and set the number of components (features) that we want to consider.         
We’ll set it to “30” to see the explained variance of all the generated components, before deciding where to make the cut.         
Then we “fit” our scaled X_train data to the PCA function.        
import matplotlib.pyplot as plt        
import seaborn as sns        
from sklearn.decomposition import PCA        
pca_test = PCA(n_components=30)        
pca_test.fit(X_train_scaled)        
sns.set(style='whitegrid')        
plt.plot(np.cumsum(pca_test.explained_variance_ratio_))        
plt.xlabel('number of components')        
plt.ylabel('cumulative explained variance')        
plt.axvline(linewidth=4, color='g', linestyle = '--', x=10, ymin=0, ymax=1)        
display(plt.show())        
evr = pca_test.explained_variance_ratio_        
cvr = np.cumsum(pca_test.explained_variance_ratio_)        
pca_df = pd.DataFrame()        
pca_df['Cumulative Variance Ratio'] = cvr        
pca_df['Explained Variance Ratio'] = evr        
display(pca_df.head(10))        
None        

Cumulative Variance Ratio Explained Variance Ratio 0 0.448362 0.448362 1 0.625759 0.177397 2 0.724960 0.099201 3 0.792890 0.067930 4 0.848247 0.055357 5 0.888681 0.040435 6 0.911139 0.022457 7 0.928491 0.017353 8 0.942257 0.013766 9 0.954676 0.012419

This dataframe shows the Cumulative Variance Ratio (how much total variance of the data is explained) and the Explained Variance Ratio (how much each PCA         
component explains the total variance of the data).        
Looking at the dataframe above, when we use PCA to reduce our 30 predicting variables down to 10 components, we can still explain over 95% of the variance. The other 20 components explain less than 5% of the variance, so we can cut them. Using this logic, we will use PCA to reduce the number of components from 30 to 10 for X_train and X_test. We will assign these recreated, “reduced dimension” datasets to “X_train_scaled_pca” and “X_test_scaled_pca”.        
pca = PCA(n_components=10)        
pca.fit(X_train_scaled)        
X_train_scaled_pca = pca.transform(X_train_scaled)        
X_test_scaled_pca = pca.transform(X_test_scaled)        
Each component is a linear combination of the original variables with corresponding “weights”. We can see these “weights” for each PCA component by creating a dataframe.        
pca_dims = []        
for x in range(0, len(pca_df)):        
    pca_dims.append('PCA Component {}'.format(x))        
pca_test_df = pd.DataFrame(pca_test.components_, columns=columns, index=pca_dims)        
pca_test_df.head(10).T        

PCA Component 0 PCA Component 1 PCA Component 2 PCA Component 3 PCA Component 4 PCA Component 5 PCA Component 6 PCA Component 7 PCA Component 8 PCA Component 9 mean radius 0.229757 -0.210916 -0.030019 0.046463 0.033286 0.015126 -0.069528 -0.188402 -0.123401 -0.073688 mean texture 0.101159 -0.069574 0.058763 -0.596565 0.003598 -0.043046 0.063547 0.073671 0.162137 -0.154302 mean perimeter 0.236449 -0.193037 -0.028313 0.047683 0.031936 0.007210 -0.063591 -0.188007 -0.121388 -0.064397 mean area 0.230530 -0.212977 0.013286 0.065251 -0.004611 0.000570 -0.006363 -0.095945 -0.136370 -0.082562 mean smoothness 0.137090 0.205970 -0.031698 0.117614 -0.453585 -0.079954 -0.162701 -0.178166 0.124023 0.384612 mean compactness 0.237018 0.166762 -0.048961 0.023697 -0.001176 -0.064327 0.040522 -0.219044 0.004426 0.042943 mean concavity 0.258283 0.046613 0.015929 0.017595 0.080184 -0.038802 -0.094767 0.066544 -0.072309 0.109753 mean concave points 0.261131 -0.027002 -0.012987 0.044803 -0.054359 -0.032057 -0.145943 -0.175107 -0.062065 0.034035 mean symmetry 0.142589 0.199079 -0.025816 0.011965 -0.171304 0.425826 -0.052532 -0.192594 0.573150 -0.366492 mean fractal dimension 0.048243 0.374409 0.019466 0.052554 -0.125120 -0.206347 0.276034 -0.103717 0.037225 0.121591 radius error 0.198241 -0.156043 0.242641 0.097740 -0.169394 0.005944 0.253895 0.196675 0.189494 0.062926 texture error 0.014900 0.029639 0.381296 -0.394502 -0.180359 0.016760 -0.112127 -0.290333 -0.135634 0.301144 perimeter error 0.201851 -0.146351 0.247544 0.098916 -0.146092 -0.021203 0.242301 0.204722 0.171634 0.106013 area error 0.194945 -0.178766 0.202098 0.119607 -0.145411 -0.004738 0.301414 0.274772 0.134046 0.044372 smoothness error 0.014347 0.191923 0.365874 0.038990 -0.284123 -0.089260 -0.247373 0.244809 -0.322680 -0.580268 compactness error 0.171912 0.223294 0.179930 -0.014723 0.301831 -0.004327 0.060756 -0.078446 -0.039296 -0.158751 concavity error 0.157414 0.179085 0.175692 0.014822 0.371139 -0.016035 -0.171668 0.387779 0.068313 0.180893 concave points error 0.177422 0.110437 0.208714 0.059477 0.210365 -0.096004 -0.450919 -0.016357 0.360001 0.040743 symmetry error 0.043627 0.131212 0.322861 0.031223 0.024503 0.600697 0.062640 -0.033279 -0.380295 0.229579 fractal dimension error 0.107563 0.267545 0.232915 0.036954 0.204746 -0.162705 0.293586 -0.288475 -0.072128 -0.188742 worst radius 0.234702 -0.198216 -0.084894 0.012091 -0.009573 0.001169 0.000195 -0.085686 -0.084838 -0.077837 worst texture 0.097766 -0.046985 -0.039242 -0.635329 -0.069450 -0.031456 0.035084 0.080786 0.026349 0.017349 worst perimeter 0.241662 -0.179552 -0.082561 0.013082 -0.002634 -0.010149 0.000019 -0.083420 -0.085520 -0.062077 worst area 0.231446 -0.201948 -0.044320 0.028217 -0.040899 -0.019286 0.064157 -0.004252 -0.082277 -0.100282 worst smoothness 0.115744 0.219736 -0.197580 -0.004231 -0.429084 -0.125672 -0.188421 0.268932 -0.187861 -0.036686 worst compactness 0.211018 0.163505 -0.219081 -0.073312 0.101510 -0.009020 0.125119 0.017447 -0.082049 -0.007973 worst concavity 0.228496 0.108625 -0.152782 -0.054237 0.184507 0.000937 -0.082448 0.305685 -0.068861 0.170074 worst concave points 0.248232 0.013176 -0.171477 -0.004987 0.033230 -0.041166 -0.207445 0.011237 -0.002608 0.032291 worst symmetry 0.117385 0.153712 -0.285296 -0.055914 -0.038116 0.557743 0.073833 0.151422 -0.057087 -0.028769 worst fractal dimension 0.124467 0.302658 -0.213438 -0.049523 0.006512 -0.158821 0.351446 0.021301 -0.085360 -0.052610

  1. Fit To “Baseline” Random Forest Model After PCA Now, we can fit our X_train_scaled_pca and y_train data to another “baseline” Random Forest model, to see if we get any improvement on the model’s predictions.

rfc = RandomForestClassifier()        
rfc.fit(X_train_scaled_pca, y_train)        
display(rfc.score(X_train_scaled_pca, y_train))        
1.0        
TARGET:        
LABEL:ONE MEANS CANCER        
      ZERO MEANS NO CANCER        

要查看或添加评论,请登录

社区洞察

其他会员也浏览了