STEP-BY-STEP-APPROACH-TO CLASSIFY-THE-PERSON-HAVING-CANCER-OR-NOT-USING-MLAI ALGORITHMS
Yash Raj Singh
M.Tech|MS-LJMU-UK|PGD-IIITB|MLAI| GEN-AI|AI-Architect|EDA |Technology, Analytics, Consulting services-BSFI-CMT-HRTTEM|Python| Predictive |Prescriptive/Discriptive |Diagnostic| Problem-Solving |PCTT| CFC| PE-Chat-GPT
STEP-BY-STEP-APPROACH-TO CLASSIFY-THE-PERSON-HAVING-CANCER-OR-NOT-USING-MLAI ALGORITHMS
A Step-By-Step approach towards Machine Learning Classification in Python using Random Forest, PCA, & Hyperparameter Tuning — with complete pyhton code
As data scientists, we have many options to choose from to create a classification model. One of the most popular and robust methods is using Random Forests. We can perform Hyperparameter Tuning on Random Forests to try to optimize the model’s performance.
It is also common practice to try Principal Component Analysis (PCA) Before fitting our data to a model.
PCA can make interpreting each “feature” a little harder when we analyze the “feature importances” of our Random Forest model.
However, PCA performs dimensionality reduction, which can reduce the number of features for the Random Forest to process,
So,PCA might help speed up the training of your Random Forest model.
Note that computational cost is one of the biggest drawbacks of Random Forests (it can take a long time to run the model).
PCA can become really important especially when you are working with hundreds or even thousands of predicting features.
So if the most important thing is to simply have the best performing model, and interpreting feature importance can be sacrificed, then PCA may be useful to try.
Now, let’s get started with our example. We will be working with the Scikit-learn
“Breast Cancer” Dataset. We will create two models,
and compare their performance to each other:
1. Random Forest
2. Random Forest With PCA Reduced Dimensionality
1. Import Data
First, we load the data and create a dataframe.
Since this is a pre-cleaned “Kid-Toy” dataset from Scikit-learn, we are good to proceed with the modeling process.
As a best practice, we should always do the following:
Use df.head() to take a glance at the new dataframe to make sure it looks as intended.
Use df.info() to get a sense of the data types and counts in each column. You may need to convert data types as needed.
Use df.isna() to make sure that there are no NaN values. You may need to impute values or eliminate rows as needed.
Use df.describe() to get a sense for the
Minimum,
Maximum,
Mean,
Median,
Standard deviation,
And interquartile ranges of each column.
The column named “cancer” is the target variable that we want to predict using our model.
LABEL: “0” means “no cancer”
? ? ? “1” means “cancer”.
import pandas as pd
from sklearn.datasets import load_breast_cancer
columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error', 'fractal dimension error',
'worst radius', 'worst texture', 'worst perimeter', 'worst area',
'worst smoothness', 'worst compactness', 'worst concavity',
'worst concave points', 'worst symmetry', 'worst fractal dimension']
dataset = load_breast_cancer()
data = pd.DataFrame(dataset['data'], columns=columns)
data['cancer'] = dataset['target']
display(data.head())
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cancer 0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0 1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0 2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0 3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0 4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0
5 rows × 31 columns
display(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mean radius 569 non-null float64
1 mean texture 569 non-null float64
2 mean perimeter 569 non-null float64
3 mean area 569 non-null float64
4 mean smoothness 569 non-null float64
5 mean compactness 569 non-null float64
6 mean concavity 569 non-null float64
7 mean concave points 569 non-null float64
8 mean symmetry 569 non-null float64
9 mean fractal dimension 569 non-null float64
10 radius error 569 non-null float64
11 texture error 569 non-null float64
12 perimeter error 569 non-null float64
13 area error 569 non-null float64
14 smoothness error 569 non-null float64
15 compactness error 569 non-null float64
16 concavity error 569 non-null float64
17 concave points error 569 non-null float64
18 symmetry error 569 non-null float64
19 fractal dimension error 569 non-null float64
20 worst radius 569 non-null float64
21 worst texture 569 non-null float64
22 worst perimeter 569 non-null float64
23 worst area 569 non-null float64
24 worst smoothness 569 non-null float64
25 worst compactness 569 non-null float64
26 worst concavity 569 non-null float64
27 worst concave points 569 non-null float64
28 worst symmetry 569 non-null float64
29 worst fractal dimension 569 non-null float64
30 cancer 569 non-null int32
dtypes: float64(30), int32(1)
memory usage: 135.7 KB
None
display(data.isna().sum())
mean radius 0
mean texture 0
mean perimeter 0
mean area 0
mean smoothness 0
mean compactness 0
mean concavity 0
mean concave points 0
mean symmetry 0
mean fractal dimension 0
radius error 0
texture error 0
perimeter error 0
area error 0
smoothness error 0
compactness error 0
concavity error 0
concave points error 0
symmetry error 0
fractal dimension error 0
worst radius 0
worst texture 0
worst perimeter 0
worst area 0
worst smoothness 0
worst compactness 0
worst concavity 0
worst concave points 0
worst symmetry 0
worst fractal dimension 0
cancer 0
dtype: int64
display(data.describe())
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cancer count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946 0.627417 std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061 0.483918 min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040 0.000000 25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460 0.000000 50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040 1.000000 75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080 1.000000 max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500 1.000000
8 rows × 31 columns
display(data.head())
display(data.info())
display(data.isna().sum())
display(data.describe())
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cancer 0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0 1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0 2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0 3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0 4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0
5 rows × 31 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mean radius 569 non-null float64
1 mean texture 569 non-null float64
2 mean perimeter 569 non-null float64
3 mean area 569 non-null float64
4 mean smoothness 569 non-null float64
5 mean compactness 569 non-null float64
6 mean concavity 569 non-null float64
7 mean concave points 569 non-null float64
8 mean symmetry 569 non-null float64
9 mean fractal dimension 569 non-null float64
10 radius error 569 non-null float64
11 texture error 569 non-null float64
12 perimeter error 569 non-null float64
13 area error 569 non-null float64
14 smoothness error 569 non-null float64
15 compactness error 569 non-null float64
16 concavity error 569 non-null float64
17 concave points error 569 non-null float64
18 symmetry error 569 non-null float64
19 fractal dimension error 569 non-null float64
20 worst radius 569 non-null float64
21 worst texture 569 non-null float64
22 worst perimeter 569 non-null float64
23 worst area 569 non-null float64
24 worst smoothness 569 non-null float64
25 worst compactness 569 non-null float64
26 worst concavity 569 non-null float64
27 worst concave points 569 non-null float64
28 worst symmetry 569 non-null float64
29 worst fractal dimension 569 non-null float64
30 cancer 569 non-null int32
dtypes: float64(30), int32(1)
memory usage: 135.7 KB
None
mean radius 0
mean texture 0
mean perimeter 0
mean area 0
mean smoothness 0
mean compactness 0
mean concavity 0
mean concave points 0
mean symmetry 0
mean fractal dimension 0
radius error 0
texture error 0
perimeter error 0
area error 0
smoothness error 0
compactness error 0
concavity error 0
concave points error 0
symmetry error 0
fractal dimension error 0
worst radius 0
worst texture 0
worst perimeter 0
worst area 0
worst smoothness 0
worst compactness 0
worst concavity 0
worst concave points 0
worst symmetry 0
worst fractal dimension 0
cancer 0
dtype: int64
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension cancer count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946 0.627417 std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061 0.483918 min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040 0.000000 25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460 0.000000 50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040 1.000000 75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080 1.000000 max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500 1.000000
8 rows × 31 columns
Above is a portion of the breast cancer dataframe. Each row has observations about a patient. The end column named “cancer” is the target variable that we are trying to predict. 0 means “no cancer” and 1 means “cancer”.
2.LETS SPLIT THE DATASET INTO TRAIN AND TEST
2. Train-Test Split
Here, we split our data using the Scikit-learn “train_test_split” function.
We want to give the model as much data as possible to train with.
However, we also want to make sure that we have enough data for the model to test itself on.
In general, as the number of rows in the dataset increases, the more data we can give to the training set.
Supposing:
If we had millions of rows, we could have a 90% train / 10% test split.
However, our dataset is only 569 rows, which is not a very large dataset to train or test on.
So to be fair towards both train set and test set, we will split the data into 50% train and 50% test.
We set stratify=y to ensure that both the train and test sets have the same proportion of 0s and 1s as the original dataset.
LETS IMPORT THE RELEVANT LIBRARY
from sklearn.model_selection import train_test_split
X = data.drop('cancer', axis=1)
y = data['cancer']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state = 2020, stratify=y)
3.LETS SCALE THE DATA HERE
3. Scale Data
Before modeling,
we need to “center” and “standardize” our data by scaling.
We scale to control for the fact that different variables are measured on different scales.
We scale so that each predictor can have a “fair fight” against each other in deciding importance.
See this article. We also convert “y_train” from a Pandas “Series” object into a NumPy array for the model to accept the target training data later on.
import numpy as np
from sklearn.preprocessing import StandardScaler
领英推荐
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)
y_train = np.array(y_train)
4. LETS WE FIT TO THE "BASELINE" RANDOM FOREST MODEL:
4. Fit To “Baseline” Random Forest Model
Now we create a “baseline” Random Forest model.
This model uses all of the predicting features and of the default settings defined in the Scikit-learn Random Forest Classifier documentation.
First, we instantiate the model and fit the scaled data to it. We can measure the accuracy of the model on our training data.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
rfc = RandomForestClassifier()
rfc.fit(X_train_scaled, y_train)
display(rfc.score(X_train_scaled, y_train))
1.0
If we are curious to see which features are most important to the Random Forest model to predict breast cancer,
we can visualize and quantify the importances by calling the “feature_importances_” method:
from sklearn.ensemble import RandomForestClassifier
# Lets Instantiate the RandomForestClassifier
rfc_1 = RandomForestClassifier()
from sklearn.ensemble import RandomForestClassifier
# Instantiate the RandomForestClassifier
rfc_1 = RandomForestClassifier()
# Fit the model with your training data
rfc_1.fit(X_train, y_train) # Replace X_train and y_train with your actual training data
# Now, you can access the feature importances
feats = {}
for feature, importance in zip(data.columns, rfc_1.feature_importances_):
feats[feature] = importance
#Make sure to replace X_train and y_train with your actual training data and labels. After fitting the model, you can calculate and access the feature importances without encountering the NotFittedError.
import seaborn as sns
import matplotlib.pyplot as plt
feats = {}
for feature, importance in zip(data.columns, rfc_1.feature_importances_):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-Importance'})
importances = importances.sort_values(by='Gini-Importance', ascending=False)
importances = importances.reset_index()
importances = importances.rename(columns={'index': 'Features'})
sns.set(font_scale = 7)
sns.set(style="whitegrid", color_codes=True, font_scale = 1.7)
fig, ax = plt.subplots()
fig.set_size_inches(35,20)
sns.barplot(x=importances['Gini-Importance'], y=importances['Features'], data=importances, color='green')
plt.xlabel('Importance', fontsize=30, weight = 'bold')
plt.ylabel('Features', fontsize=30, weight = 'bold')
plt.title('Feature Importance', fontsize=30, weight = 'bold')
display(plt.show())
display(importances)
None
Features Gini-Importance 0 worst area 0.160013 1 worst concave points 0.134852 2 worst perimeter 0.108445 3 worst radius 0.092098 4 mean concave points 0.087374 5 mean area 0.086158 6 mean radius 0.062147 7 worst concavity 0.053510 8 area error 0.041403 9 mean perimeter 0.031531 10 mean concavity 0.025719 11 worst texture 0.011973 12 worst compactness 0.011629 13 mean compactness 0.010870 14 worst smoothness 0.009485 15 radius error 0.008631 16 perimeter error 0.008184 17 worst symmetry 0.007517 18 concavity error 0.006875 19 mean texture 0.006216 20 worst fractal dimension 0.006004 21 fractal dimension error 0.004676 22 mean smoothness 0.003955 23 mean fractal dimension 0.003776 24 smoothness error 0.003534 25 compactness error 0.003359 26 mean symmetry 0.003065 27 symmetry error 0.003007 28 concave points error 0.002474 29 texture error 0.001522
5. LETS DO PCA (Principal Component Analysis)
5. PCA (Principal Component Analysis)
Here, how we could improve our baseline model?
Using dimension reduction, we can approximate the original dataset with fewer variables, while reducing computational power to run our model.
Using PCA, we can study the cumulative explained variance ratio of these features to understand which features explain the most variance in the data.
We instantiate the PCA function and set the number of components (features) that we want to consider.
We’ll set it to “30” to see the explained variance of all the generated components, before deciding where to make the cut.
Then we “fit” our scaled X_train data to the PCA function.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
pca_test = PCA(n_components=30)
pca_test.fit(X_train_scaled)
sns.set(style='whitegrid')
plt.plot(np.cumsum(pca_test.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.axvline(linewidth=4, color='g', linestyle = '--', x=10, ymin=0, ymax=1)
display(plt.show())
evr = pca_test.explained_variance_ratio_
cvr = np.cumsum(pca_test.explained_variance_ratio_)
pca_df = pd.DataFrame()
pca_df['Cumulative Variance Ratio'] = cvr
pca_df['Explained Variance Ratio'] = evr
display(pca_df.head(10))
None
Cumulative Variance Ratio Explained Variance Ratio 0 0.448362 0.448362 1 0.625759 0.177397 2 0.724960 0.099201 3 0.792890 0.067930 4 0.848247 0.055357 5 0.888681 0.040435 6 0.911139 0.022457 7 0.928491 0.017353 8 0.942257 0.013766 9 0.954676 0.012419
This dataframe shows the Cumulative Variance Ratio (how much total variance of the data is explained) and the Explained Variance Ratio (how much each PCA
component explains the total variance of the data).
Looking at the dataframe above, when we use PCA to reduce our 30 predicting variables down to 10 components, we can still explain over 95% of the variance. The other 20 components explain less than 5% of the variance, so we can cut them. Using this logic, we will use PCA to reduce the number of components from 30 to 10 for X_train and X_test. We will assign these recreated, “reduced dimension” datasets to “X_train_scaled_pca” and “X_test_scaled_pca”.
pca = PCA(n_components=10)
pca.fit(X_train_scaled)
X_train_scaled_pca = pca.transform(X_train_scaled)
X_test_scaled_pca = pca.transform(X_test_scaled)
Each component is a linear combination of the original variables with corresponding “weights”. We can see these “weights” for each PCA component by creating a dataframe.
pca_dims = []
for x in range(0, len(pca_df)):
pca_dims.append('PCA Component {}'.format(x))
pca_test_df = pd.DataFrame(pca_test.components_, columns=columns, index=pca_dims)
pca_test_df.head(10).T
PCA Component 0 PCA Component 1 PCA Component 2 PCA Component 3 PCA Component 4 PCA Component 5 PCA Component 6 PCA Component 7 PCA Component 8 PCA Component 9 mean radius 0.229757 -0.210916 -0.030019 0.046463 0.033286 0.015126 -0.069528 -0.188402 -0.123401 -0.073688 mean texture 0.101159 -0.069574 0.058763 -0.596565 0.003598 -0.043046 0.063547 0.073671 0.162137 -0.154302 mean perimeter 0.236449 -0.193037 -0.028313 0.047683 0.031936 0.007210 -0.063591 -0.188007 -0.121388 -0.064397 mean area 0.230530 -0.212977 0.013286 0.065251 -0.004611 0.000570 -0.006363 -0.095945 -0.136370 -0.082562 mean smoothness 0.137090 0.205970 -0.031698 0.117614 -0.453585 -0.079954 -0.162701 -0.178166 0.124023 0.384612 mean compactness 0.237018 0.166762 -0.048961 0.023697 -0.001176 -0.064327 0.040522 -0.219044 0.004426 0.042943 mean concavity 0.258283 0.046613 0.015929 0.017595 0.080184 -0.038802 -0.094767 0.066544 -0.072309 0.109753 mean concave points 0.261131 -0.027002 -0.012987 0.044803 -0.054359 -0.032057 -0.145943 -0.175107 -0.062065 0.034035 mean symmetry 0.142589 0.199079 -0.025816 0.011965 -0.171304 0.425826 -0.052532 -0.192594 0.573150 -0.366492 mean fractal dimension 0.048243 0.374409 0.019466 0.052554 -0.125120 -0.206347 0.276034 -0.103717 0.037225 0.121591 radius error 0.198241 -0.156043 0.242641 0.097740 -0.169394 0.005944 0.253895 0.196675 0.189494 0.062926 texture error 0.014900 0.029639 0.381296 -0.394502 -0.180359 0.016760 -0.112127 -0.290333 -0.135634 0.301144 perimeter error 0.201851 -0.146351 0.247544 0.098916 -0.146092 -0.021203 0.242301 0.204722 0.171634 0.106013 area error 0.194945 -0.178766 0.202098 0.119607 -0.145411 -0.004738 0.301414 0.274772 0.134046 0.044372 smoothness error 0.014347 0.191923 0.365874 0.038990 -0.284123 -0.089260 -0.247373 0.244809 -0.322680 -0.580268 compactness error 0.171912 0.223294 0.179930 -0.014723 0.301831 -0.004327 0.060756 -0.078446 -0.039296 -0.158751 concavity error 0.157414 0.179085 0.175692 0.014822 0.371139 -0.016035 -0.171668 0.387779 0.068313 0.180893 concave points error 0.177422 0.110437 0.208714 0.059477 0.210365 -0.096004 -0.450919 -0.016357 0.360001 0.040743 symmetry error 0.043627 0.131212 0.322861 0.031223 0.024503 0.600697 0.062640 -0.033279 -0.380295 0.229579 fractal dimension error 0.107563 0.267545 0.232915 0.036954 0.204746 -0.162705 0.293586 -0.288475 -0.072128 -0.188742 worst radius 0.234702 -0.198216 -0.084894 0.012091 -0.009573 0.001169 0.000195 -0.085686 -0.084838 -0.077837 worst texture 0.097766 -0.046985 -0.039242 -0.635329 -0.069450 -0.031456 0.035084 0.080786 0.026349 0.017349 worst perimeter 0.241662 -0.179552 -0.082561 0.013082 -0.002634 -0.010149 0.000019 -0.083420 -0.085520 -0.062077 worst area 0.231446 -0.201948 -0.044320 0.028217 -0.040899 -0.019286 0.064157 -0.004252 -0.082277 -0.100282 worst smoothness 0.115744 0.219736 -0.197580 -0.004231 -0.429084 -0.125672 -0.188421 0.268932 -0.187861 -0.036686 worst compactness 0.211018 0.163505 -0.219081 -0.073312 0.101510 -0.009020 0.125119 0.017447 -0.082049 -0.007973 worst concavity 0.228496 0.108625 -0.152782 -0.054237 0.184507 0.000937 -0.082448 0.305685 -0.068861 0.170074 worst concave points 0.248232 0.013176 -0.171477 -0.004987 0.033230 -0.041166 -0.207445 0.011237 -0.002608 0.032291 worst symmetry 0.117385 0.153712 -0.285296 -0.055914 -0.038116 0.557743 0.073833 0.151422 -0.057087 -0.028769 worst fractal dimension 0.124467 0.302658 -0.213438 -0.049523 0.006512 -0.158821 0.351446 0.021301 -0.085360 -0.052610
rfc = RandomForestClassifier()
rfc.fit(X_train_scaled_pca, y_train)
display(rfc.score(X_train_scaled_pca, y_train))
1.0
TARGET:
LABEL:ONE MEANS CANCER
ZERO MEANS NO CANCER