Breast Cancer Detection Using Machine Learning Classifier

Muhammad Junaid Iqbal

Building products that turn ideas into revenue!

发布日期: 2020年8月16日

Breast cancer is a dangerous disease for women. If it does not identify in the early-stage then the result will be the death of the patient. It is a common cancer in women worldwide. Worldwide near about 12% of women affected by breast cancer and the number is still increasing.

How a breast cancer’s stage is determined

Calculate the stage of the breast cancer — that is, whether it is limited to one area in the breast, or it has spread to healthy tissues inside the breast or to other parts of the body. Doctors will start to decide this during medical procedure to expel the malignancy and take a gander at least one of the underarm lymph nodes, which is the place bosom disease will in general travel first. Doctor also may order additional blood tests or imaging tests if there is reason to believe the cancer might have spread beyond the breast.

The breast cancer staging system, called the TNM system, is overseen by the American Joint Committee on Cancer. The AJCC is a group of cancer experts who oversee how cancer is classified and communicated. This is to ensure that all doctors and treatment facilities are describing cancer in a uniform way so that the treatment results of all people can be compared and understood.

In the past, stage number was calculated based on just three clinical characteristics, T, N, and M:

the size of the cancer tumor and whether or not it has grown into nearby tissue (T)
whether cancer is in the lymph nodes (N)
whether the cancer has spread to other parts of the body beyond the breast (M)

Numbers or letters after T, N, and M give more details about each characteristic.

You also may see or hear certain words used to describe the stage of the breast cancer:

Local: The cancer is confined within the breast.
Regional: The lymph nodes, primarily those in the armpit, are involved.
Distant: The cancer is found in other parts of the body as well.

Sometimes doctors use the term “locally advanced” or “regionally advanced” to refer to large tumors that involve the breast skin, underlying chest structures, changes to the breast's shape, and lymph node enlargement that is visible or that your doctor can feel during an exam.

Stage 0: Stage 0 is used to describe non-invasive breast cancers. In stage 0, there is no evidence of cancer cells or non-cancerous abnormal cells breaking out of the part of the breast in which they started, or getting through to or invading neighboring normal tissue.
Stage 1: Stage I describes invasive breast cancer (cancer cells are breaking through to or invading normal surrounding breast tissue) Stage I is divided into subcategories known as IA and IB. Learn more about IA and IB, go to CancerCanter.
Stage 2: Stage II is divided into subcategories known as IIA and IIB. In general, stage IIA describes invasive breast cancer in which:

>   no tumor can be found in the breast, but cancer (larger than 2 millimeters [mm]) is found in 1 to 3 axillary lymph nodes (the lymph nodes under the arm) or in the lymph nodes near the breast bone (found during a sentinel node biopsy) or

>   the tumor measures 2 centimeters (cm) or smaller and has spread to the axillary lymph nodes or

>   the tumor is larger than 2 cm but not larger than 5 cm and has not spread to the axillary lymph nodes

Still, if the cancer tumor measures between 2 and 5 cm and:
>   has not spread to the lymph nodes or parts of the body away from the breast
>   is HER2-negative

it will likely be classified as stage I.

Similarly, if the cancer tumor measures between 2 and 5 cm and:
> has not spread to the lymph nodes
> is HER2-negative
> is estrogen-receptor-positive
> is progesterone-receptor-negative
> has an Oncotype DX Recurrence Score of 9

it will likely be classified as stage IA.

In general, stage IIB describes invasive breast cancer in which:


>  the tumor is larger than 2 cm but no larger than 5 centimeters; small groups of breast cancer cells — larger than 0.2 mm but not larger than 2 mm — are found in the lymph nodes or

>  the tumor is larger than 2 cm but no larger than 5 cm; cancer has spread to 1 to 3 axillary lymph nodes or to lymph nodes near the breastbone (found during a sentinel node biopsy) or

>  the tumor is larger than 5 cm but has not spread to the axillary lymph nodes

Stage 3: Stage III is divided into subcategories known as IIIA, IIIB, and IIIC.

In general, stage IIIA describes invasive breast cancer in which either:

>  no tumor is found in the breast or the tumor may be any size; cancer is found in 4 to 9 axillary lymph nodes or in the lymph nodes near the breastbone (found during imaging tests or a physical exam) or

>  the tumor is larger than 5 centimeters (cm); small groups of breast cancer cells (larger than 0.2 millimeter [mm] but not larger than 2 mm) are found in the lymph nodes or

>  the tumor is larger than 5 cm; cancer has spread to 1 to 3 axillary lymph nodes or to the lymph nodes near the breastbone (found during a sentinel lymph node biopsy)

In general, stage IIIB describes invasive breast cancer in which:

>  the tumor may be any size and has spread to the chest wall and/or skin of the breast and caused swelling or an ulcer and

>  may have spread to up to 9 axillary lymph nodes or

>  may have spread to lymph nodes near the breastbone

In general, stage IIIC describes invasive breast cancer in which:

>  there may be no sign of cancer in the breast or, if there is a tumor, it may be any size and may have spread to the chest wall and/or the skin of the breast and

>  the cancer has spread to 10 or more axillary lymph nodes or

>  the cancer has spread to lymph nodes above or below the collarbone or

>  the cancer has spread to axillary lymph nodes or to lymph nodes near the breastbone

Stage 4: Stage IV describes invasive breast cancer that has spread beyond the breast and nearby lymph nodes to other organs of the body, such as the lungs, distant lymph nodes, skin, bones, liver, or brain. You may hear the words “advanced” and “metastatic” used to describe stage IV breast cancer. Cancer may be stage IV at first diagnosis, called “de novo” by doctors, or it can be a recurrence of a previous breast cancer that has spread to other parts of the body.

The doctors do not identify each and every breast cancer patient. That’s the reason Machine Learning Engineer / Data Scientist comes into the picture because they have knowledge of maths and computational power. So let’s start ...

Follow the “Breast Cancer Detection Using Machine Learning Classifier End to End Project” step by step to get 3 Bonus.
1. Raw Dataset
2. Ready to use Clean Dataset for ML project
3. Full Project in Jupyter Notebook File

Goal of the ML project

Image Source: https://www.ashray.net.in/en/breast-cancer/learning

We have extracted features of breast cancer patient cells and normal person cells. As a Machine learning engineer / Data Scientist has to create an ML model to classify malignant and benign tumor. To complete this ML project we are using the supervised machine learning classifier algorithm.

Import essential libraries

import pandas as pd  #for data manupulation or analysis
import numpy as np  #for numeric calculation
import matplotlib.pyplot as plt  #for data visualization
import seaborn as sns  #for data visualization

Load breast cancer dataset & explore

We are loading breast cancer data using a scikit-learn load_brast_cancer class.

Click here to download file csv

#Load breast cancer dataset

from sklearn.datasets import load_breast_cancer
cancer_dataset = load_breast_cancer()

type(cancer_dataset)



Output >>> sklearn.utils.Bunch

The scikit-learn store data in an object bunch like a dictionary.

# keys in dataset
cancer_dataset.keys()


Output >>> dict_keys([‘data’, ‘target’, ‘target_names’, ‘DESCR’, ‘feature_names’, ‘filename’])



# featurs of each cells in numeric format
cancer_dataset['data']

Output >>>

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,

      1.189e-01],
      [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01, 8.902e-02],
      [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01, 8.758e-02],
       ...,
      [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01, 7.820e-02],
      [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01, 1.240e-01],
      [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01, 7.039e-02]])

These numeric values are extracted features of each cell.

# malignant or benign value

cancer_dataset['target']

The target stores the values of malignant or benign tumors.

# target value name malignant or benign tumor

cancer_dataset['target_names']

Output >>> array([‘malignant’, ‘benign’], dtype='<U9′)

0 means malignant tumor
1 mean benign tumor

The cancer_dataset[‘DESCR’] store the description of breast cancer dataset.

# description of data

print(cancer_dataset['DESCR'])

Output >>>

.. _breast_cancer_dataset:
 
Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------
 
**Data Set Characteristics:**
    :Number of Instances: 569
    :Number of Attributes: 30 numeric, predictive attributes and the class
    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)
 
        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.
 
        - class:
                - WDBC-Malignant
                - WDBC-Benign
 
    :Summary Statistics:
    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======
 
    :Missing Attribute Values: None
    :Class Distribution: 212 - Malignant, 357 - Benign
    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
    :Donor: Nick Street
    :Date: November, 1995
 
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2
 
Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.
 
Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
 
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
 
This database is also available through the UW CS ftp server:
 
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
 
.. topic:: References
 
   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&amp;T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171.

Features name of malignant & benign tumor.

# name of features

print(cancer_dataset['feature_names'])

Output >>>

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

When we call load_breast_cancer() class it downloads breast_cancer.csv file and you can see file location.

# location/path of data file

print(cancer_dataset['filename'])

Output >>> C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\data\breast_cancer.csv

Create DataFrame

Now, we are creating DataFrame by concate ‘data’ and ‘target’ together and give columns name.

# create datafrmae
cancer_df = pd.DataFrame(np.c_[cancer_dataset['data'],cancer_dataset['target']],

               columns = np.append(cancer_dataset['feature_names'], ['target']))

Click here to download breast cancer DataFrame in CSV file format.

Head of cancer DataFrame

# Head of cancer DataFrame

cancer_df.head(6)

Output >>>

The tail of cancer DataFrame

# Tail of cancer DataFrame

cancer_df.tail(6)

Output >>>

Getting information of cancer DataFrame using ‘.info()‘ method.

# Information of cancer Dataframe
cancer_df.info()

Output >>>

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
mean radius                569 non-null float64
mean texture               569 non-null float64
mean perimeter             569 non-null float64
mean area                  569 non-null float64
mean smoothness            569 non-null float64
mean compactness           569 non-null float64
mean concavity             569 non-null float64
mean concave points        569 non-null float64
mean symmetry              569 non-null float64
mean fractal dimension     569 non-null float64
radius error               569 non-null float64
texture error              569 non-null float64
perimeter error            569 non-null float64
area error                 569 non-null float64
smoothness error           569 non-null float64
compactness error          569 non-null float64
concavity error            569 non-null float64
concave points error       569 non-null float64
symmetry error             569 non-null float64
fractal dimension error    569 non-null float64
worst radius               569 non-null float64
worst texture              569 non-null float64
worst perimeter            569 non-null float64
worst area                 569 non-null float64
worst smoothness           569 non-null float64
worst compactness          569 non-null float64
worst concavity            569 non-null float64
worst concave points       569 non-null float64
worst symmetry             569 non-null float64
worst fractal dimension    569 non-null float64
target                     569 non-null float64
dtypes: float64(31)

memory usage: 137.9 KB

We have a total of non-null 569 patients’ information with 31 features. All feature data types in the float. The size of the DataFrame is 137.9 KB.

Numerical distribution of data. We can know to mean, standard deviation, min, max, 25%,50% and 75% value of each feature.

# Numerical distribution of data

cancer_df.describe()

Output >>>

We have clean and well formated DataFrame, so DtaFrame is ready to visualize.

Data Visualization

Pair plot of breast cancer data

Basically, the pair plot is used to show the numeric distribution in the scatter plot.

# Paiplot of cancer dataframe

sns.pairplot(cancer_df, hue = 'target')

Output >>>

Pair plot of sample feature of DataFrame

# pair plot of sample feature

sns.pairplot(cancer_df, hue = 'target', 
            vars = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness'] )

Output >>>

The pair plot showing malignant and benign tumor data distributed in two classes. It is easy to differentiate in the pair plot.

Counterplot

Showing the total count of malignant and benign tumor patients in counterplot.

# Count the target class

sns.countplot(cancer_df['target'])

Output >>>

In the below counterplot max samples mean radius is equal to 1.

# counter plot of feature mean radius

plt.figure(figsize = (20,8))
sns.countplot(cancer_df['mean radius'])

Heatmap: Heatmap of breast cancer DataFrame

In the below heatmap we can see the variety of different feature’s value. The value of feature ‘mean area’ and ‘worst area’ are greater than other and ‘mean perimeter’, ‘area error’, and ‘worst perimeter’ value slightly less but greater than remaining features.

# heatmap of DataFrame

plt.figure(figsize=(16,9))
sns.heatmap(cancer_df)

Output >>>

Heatmap of a correlation matrix

To find a correlation between each feature and target we visualize heatmap using the correlation matrix.

# Heatmap of Correlation matrix of breast cancer DataFrame
plt.figure(figsize=(20,20))
sns.heatmap(cancer_df.corr(), annot = True, cmap ='coolwarm', linewidths=2)

Output >>>

Correlation barplot

Taking the correlation of each feature with the target and the visualize barplot.

# create second DataFrame by droping target
cancer_df2 = cancer_df.drop(['target'], axis = 1)

print("The shape of 'cancer_df2' is : ", cancer_df2.shape)

Output >>> The shape of ‘cancer_df2’ is : (569, 30)

# visualize correlation barplot
plt.figure(figsize = (16,5))
ax = sns.barplot(cancer_df2.corrwith(cancer_df.target).index, cancer_df2.corrwith(cancer_df.target))

ax.tick_params(labelrotation = 90)

Output>>>

In the above correlation barplot only feature ‘smoothness error’ is strongly positively correlated with the target than others. The features ‘mean factor dimension’, ‘texture error’, and ‘symmetry error’ are very less positive correlated and others remaining are strongly negatively correlated.

Data Preprocessing

Split DataFrame in train and test

# input variable
X = cancer_df.drop(['target'], axis = 1)

X.head(6)

Output >>>

# output variable
y = cancer_df['target']

y.head(6)

Output >>>

Name: target, dtype: float64

Train-Test Split

# split dataset into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 5)

Feature Scaling

Converting different units and magnitude data in one unit.

# Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)

X_test_sc = sc.transform(X_test)

Breast Cancer Detection Machine Learning Model Building

We have clean data to build the Ml model. But which Machine learning algorithm is best for the data we have to find. The output is a categorical format so we will use supervised classification machine learning algorithms.

To build the best model, we have to train and test the dataset with multiple Machine Learning algorithms then we can find the best ML model. So let’s try.

First, we need to import the required packages.

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

Support Vector Classifier

# Support vector classifier
from sklearn.svm import SVC
svc_classifier = SVC()
svc_classifier.fit(X_train, y_train)
y_pred_scv = svc_classifier.predict(X_test)

accuracy_score(y_test, y_pred_scv)

Output >>> 0.5789473684210527

# Train with Standard scaled Data
svc_classifier2 = SVC()
svc_classifier2.fit(X_train_sc, y_train)
y_pred_svc_sc = svc_classifier2.predict(X_test_sc)

accuracy_score(y_test, y_pred_svc_sc)

Output >>> 0.9649122807017544

Logistic Regression

# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr_classifier = LogisticRegression(random_state = 51, penalty = 'l1')
lr_classifier.fit(X_train, y_train)
y_pred_lr = lr_classifier.predict(X_test)

accuracy_score(y_test, y_pred_lr)

Output >>> 0.9736842105263158

# Train with Standard scaled Data
lr_classifier2 = LogisticRegression(random_state = 51, penalty = 'l1')
lr_classifier2.fit(X_train_sc, y_train)
y_pred_lr_sc = lr_classifier.predict(X_test_sc)

accuracy_score(y_test, y_pred_lr_sc)

Output >>> 0.5526315789473685

K – Nearest Neighbor Classifier

# K – Nearest Neighbor Classifier

from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn_classifier.fit(X_train, y_train)
y_pred_knn = knn_classifier.predict(X_test)

accuracy_score(y_test, y_pred_knn)

Output >>> 0.9385964912280702

# Train with Standard scaled Data
knn_classifier2 = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn_classifier2.fit(X_train_sc, y_train)
y_pred_knn_sc = knn_classifier.predict(X_test_sc)

accuracy_score(y_test, y_pred_knn_sc)

Output >>> 0.5789473684210527

Naive Bayes Classifier

# Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)

accuracy_score(y_test, y_pred_nb)

Output >>> 0.9473684210526315

# Train with Standard scaled Data
nb_classifier2 = GaussianNB()
nb_classifier2.fit(X_train_sc, y_train)
y_pred_nb_sc = nb_classifier2.predict(X_test_sc)

accuracy_score(y_test, y_pred_nb_sc)

Output >>> 0.9385964912280702

Decision Tree Classifier

# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 51)
dt_classifier.fit(X_train, y_train)
y_pred_dt = dt_classifier.predict(X_test)

accuracy_score(y_test, y_pred_dt)

Output >>> 0.9473684210526315

# Train with Standard scaled Data
dt_classifier2 = DecisionTreeClassifier(criterion = 'entropy', random_state = 51)
dt_classifier2.fit(X_train_sc, y_train)
y_pred_dt_sc = dt_classifier.predict(X_test_sc)

accuracy_score(y_test, y_pred_dt_sc)

Output >>> 0.7543859649122807

Random Forest Classifier

# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators = 20, criterion = 'entropy', random_state = 51)
rf_classifier.fit(X_train, y_train)
y_pred_rf = rf_classifier.predict(X_test)

accuracy_score(y_test, y_pred_rf)

Output >>> 0.9736842105263158

# Train with Standard scaled Data
rf_classifier2 = RandomForestClassifier(n_estimators = 20, criterion = 'entropy', random_state = 51)
rf_classifier2.fit(X_train_sc, y_train)
y_pred_rf_sc = rf_classifier.predict(X_test_sc)

accuracy_score(y_test, y_pred_rf_sc)

Output >>> 0.7543859649122807

Adaboost Classifier

# Adaboost Classifier
from sklearn.ensemble import AdaBoostClassifier
adb_classifier = AdaBoostClassifier(DecisionTreeClassifier(criterion = 'entropy', random_state = 200),
                                    n_estimators=2000,
                                    learning_rate=0.1,
                                    algorithm='SAMME.R',
                                    random_state=1,)
adb_classifier.fit(X_train, y_train)
y_pred_adb = adb_classifier.predict(X_test)

accuracy_score(y_test, y_pred_adb)

Output >>> 0.9473684210526315

# Train with Standard scaled Data
adb_classifier2 = AdaBoostClassifier(DecisionTreeClassifier(criterion = 'entropy', random_state = 200),
                                    n_estimators=2000,
                                    learning_rate=0.1,
                                    algorithm='SAMME.R',
                                    random_state=1,)
adb_classifier2.fit(X_train_sc, y_train)
y_pred_adb_sc = adb_classifier2.predict(X_test_sc)

accuracy_score(y_test, y_pred_adb_sc)

Output >>> 0.9473684210526315

XGBoost Classifier

# XGBoost Classifier
from xgboost import XGBClassifier
xgb_classifier = XGBClassifier()
xgb_classifier.fit(X_train, y_train)
y_pred_xgb = xgb_classifier.predict(X_test)

accuracy_score(y_test, y_pred_xgb)

Output >>> 0.9824561403508771

# Train with Standard scaled Data
xgb_classifier2 = XGBClassifier()
xgb_classifier2.fit(X_train_sc, y_train)
y_pred_xgb_sc = xgb_classifier2.predict(X_test_sc)

accuracy_score(y_test, y_pred_xgb_sc)

Output >>> 0.9824561403508771

XGBoost Parameter Tuning Randomized Search

# XGBoost classifier most required parameters
params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]

# Randomized Search
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(xgb_classifier, param_distributions=params, scoring= 'roc_auc', n_jobs= -1, verbose= 3)

random_search.fit(X_train, y_train)

Output >>>

RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
          fit_params=None, iid='warn', n_iter=10, n_jobs=-1,
          param_distributions={'learning_rate': [0.05, 0.1, 0.15, 0.2, 0.25, 0.3], 'max_depth': [3, 4, 5, 6, 8, 10, 12, 15], 'min_child_weight': [1, 3, 5, 7], 'gamma': [0.0, 0.1, 0.2, 0.3, 0.4], 'colsample_bytree': [0.3, 0.4, 0.5, 0.7]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,

          
        return_train_score='warn', scoring='roc_auc', verbose=3)

random_search.best_params_

Output >>>

{'min_child_weight': 1,
 'max_depth': 3,
 'learning_rate': 0.3,
 'gamma': 0.4,

 'colsample_bytree': 0.3}

random_search.best_estimator_

Output >>>

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.3, gamma=0.4,
       learning_rate=0.3, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,

       
       silent=None, subsample=1, verbosity=1)

# training XGBoost classifier with best parameters
xgb_classifier_pt = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.4, gamma=0.2,
       learning_rate=0.1, max_delta_step=0, max_depth=15,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=1, verbosity=1)
 
xgb_classifier_pt.fit(X_train, y_train)

y_pred_xgb_pt = xgb_classifier_pt.predict(X_test)

accuracy_score(y_test, y_pred_xgb_pt)

Output >>> 0.9824561403508771

Confusion Matrix

cm = confusion_matrix(y_test, y_pred_xgb_pt)
plt.title('Heatmap of Confusion Matrix', fontsize = 15)
sns.heatmap(cm, annot = True)

plt.show()

The model is giving 0% type II error and it is best.

Classification Report of Model

print(classification_report(y_test, y_pred_xgb_pt))

Output >>>

                 precision    recall  f1-score   support
 
         0.0       1.00      0.96      0.98        48
         1.0       0.97      1.00      0.99        66
 
   micro avg       0.98      0.98      0.98       114
   macro avg       0.99      0.98      0.98       114

   weighted avg    0.98      0.98      0.98       114

Cross-validation of the ML model

To find the ML model is overfitted, under fitted or generalize doing cross-validation.

# Cross validation
from sklearn.model_selection import cross_val_score
cross_validation = cross_val_score(estimator = xgb_model_pt2, X = X_train_sc, y = y_train, cv = 10)
print("Cross validation of XGBoost model = ",cross_validation)
print("Cross validation of XGBoost model (in mean) = ",cross_validation.mean())
from sklearn.model_selection import cross_val_score
cross_validation = cross_val_score(estimator = xgb_classifier_pt, X = X_train_sc,y = y_train, cv = 10)
print("Cross validation accuracy of XGBoost model = ", cross_validation)

print("\nCross validation mean accuracy of XGBoost model = ", cross_validation.mean())

Output >>>

Cross validation accuracy of XGBoost model =  [0.9787234  0.97826087 0.97826087 0.97826087 0.93333333 0.91111111
 1.         1.         0.97777778 0.88888889]

Cross validation mean accuracy of XGBoost model =  0.9624617124062083

The mean accuracy value of cross-validation is 96.24% and XGBoost model accuracy is 98.24%. It showing XGBoost is slightly overfitted but when training data will more it will generalized model.

Save the Machine Learning model

After completion of the Machine Learning project or building the ML model need to deploy in an application. To deploy the ML model need to save it first. To save the Machine Learning project we can use the pickle or joblib package.

Here, we will use pickle, Use anyone which is better for you.

## Pickle
import pickle
 
# save model
pickle.dump(xgb_classifier_pt, open('breast_cancer_detector.pickle', 'wb'))
 
# load model
breast_cancer_detector_model = pickle.load(open('breast_cancer_detector.pickle', 'rb'))
 
# predict the output
y_pred = breast_cancer_detector_model.predict(X_test)
 
# confusion matrix
print('Confusion matrix of XGBoost model: \n',confusion_matrix(y_test, y_pred),'\n')
 
# show the accuracy

print('Accuracy of XGBoost model = ',accuracy_score(y_test, y_pred))

Output >>>

Confusion matrix of XGBoost model: 
 [[46  2]
 [ 0 66]]

Accuracy of XGBoost model =  0.9824561403508771

Note: When we dump the model then model file is store in the disk where the project file is store but we can change path by passing its address.

We have completed the Machine learning Project successfully with 98.24% accuracy which is great for ‘Breast Cancer Detection using Machine learning’ project. Now, we are ready to deploy our ML model in the healthcare project.

Conclusion

To get more accuracy, we trained all supervised classification algorithms but you can try out a few of them which are always popular. After training all algorithms, we found that Logistic Regression, Random Forest and XGBoost classifiers are given high accuracy than remain but we have chosen XGBoost.

As ML Engineer, we always retrain the deployed model after some period of time to sustain the accuracy of the model. We hope our efforts will save the life of breast cancer patients.

Please share your feedback and doubt regarding this ML project, so we can update it.

I hope you enjoy the Machine Learning End to End project. Thank you….. -:)

Click here to download project

Contributors:

Muhammad Junaid Iqbal - 4th year BS Computer Science Student at University of Lahore, Lahore

Zahra Iqbal - 4th year MBBS Student at Islamabad Medical & Dental College(IMDC)-SZABMU, Islamabad

Arslan Murtaza

Technical Project Manager

4 年

Amazing

查看更多评论

要查看或添加评论，请登录

Muhammad Junaid Iqbal的更多文章

Everything You Need To Know About Deep Learning!

2020年5月12日

Everything You Need To Know About Deep Learning!

Deep Learning Do you wanna know about the basics of Deep Learning? If yes, then Congratulation! You are in the right…
What lessons will the world learn from COVID-19?

2020年4月23日

What lessons will the world learn from COVID-19?

I think the overwhelming lesson i have learnt is about human nature, our values and our inbuilt prejudice. The medical…
How to Fight the Coronavirus with AI and Data Science

2020年2月16日

How to Fight the Coronavirus with AI and Data Science

The coronoavirus of 2019 (COVID-19) is being solved with Artificial Intelligence and Data Science. Global researchers…
How to become a Microsoft Learn Student Ambassador

2020年2月9日

How to become a Microsoft Learn Student Ambassador

THE FIRST STEP (Application) Visit this site and sign in with your regular mail account:…
AWS retains its lead but Microsoft Azure and Google Cloud are growing fast

2020年2月9日

AWS retains its lead but Microsoft Azure and Google Cloud are growing fast

Organizations worldwide spent a record $107bn on cloud computing infrastructure services last year according to a new…

1 条评论

See all articles

How a breast cancer’s stage is determined

Goal of the ML project

Load breast cancer dataset & explore

Create DataFrame

Data Visualization

Pair plot of breast cancer data

Counterplot

Heatmap: Heatmap of breast cancer DataFrame

Heatmap of a correlation matrix

Correlation barplot

Data Preprocessing

Split DataFrame in train and test

Feature Scaling

Breast Cancer Detection Machine Learning Model Building

Support Vector Classifier

Logistic Regression

K – Nearest Neighbor Classifier

Naive Bayes Classifier

Decision Tree Classifier

Random Forest Classifier

Adaboost Classifier

XGBoost Classifier

XGBoost Parameter Tuning Randomized Search

Confusion Matrix

Classification Report of Model

Cross-validation of the ML model

Save the Machine Learning model

Conclusion

Contributors:

Muhammad Junaid Iqbal的更多文章

Everything You Need To Know About Deep Learning!

What lessons will the world learn from COVID-19?

How to Fight the Coronavirus with AI and Data Science

How to become a Microsoft Learn Student Ambassador

AWS retains its lead but Microsoft Azure and Google Cloud are growing fast

社区洞察

其他会员也浏览了

Understanding the importance of pink October

The Word Cancer for most of the people means suffering and death!

Breast Cancer in Men: Why Awareness Matters for Everyone

Breast Cancer, Awareness, Scope, and Treatment

Facing a New Reality: How Princess Margaret Cancer Centre Is Advancing Breast Cancer Care for Younger Canadians

CANCER VS TUMORS

Pink October: understanding breast cancer

The Importance of Early Detection in Breast Cancer

What you need to know about breast cancer

How Genes Influence Breast Cancer Risk?