A Beginner's Guide to Machine Learning: Predicting Breast Cancer with Python

A Beginner's Guide to Machine Learning: Predicting Breast Cancer with Python

Introduction

In the province of healthcare, early and accurate diagnosis is crucial for effective treatment. This is particularly true for conditions like breast cancer. In this article, we explore how machine learning, specifically logistic regression, can be employed to aid in the diagnosis of breast cancer, using a real dataset.

The Dataset

We utilize a dataset containing features computed from digitized images of fine needle aspirate (FNA) of breast masses. The dataset includes characteristics like texture, perimeter, and area of the cell nuclei, and labels each instance as malignant (M) or benign (B).

Preparing the Environment

Our analysis begins with importing essential Python libraries:

  • pandas for data handling
  • matplotlib.pyplot and seaborn for data visualization
  • sklearn for machine learning tasks including data splitting, preprocessing, modeling, and evaluation

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.preprocessing import StandardScaler
        

Data Exploration and Preprocessing

We load the dataset using pandas and perform initial data exploration with data.head(). The dataset includes an 'id' column and an 'Unnamed: 32' column, which we drop as they are not relevant to our analysis. We then map the 'diagnosis' column to binary values, where 'M' (malignant) is 1 and 'B' (benign) is 0.

file_path = './breast_cancer_data.csv'
data = pd.read_csv(file_path)
data.head()

data = data.drop(['id', 'Unnamed: 32'], axis=1)
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})
        

Feature Selection and Data Splitting

We separate the features (X) from the target variable (y, diagnosis). Using train_test_split from sklearn, we divide our data into training and testing sets, ensuring our model can be validated on unseen data.

X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        

Data Normalization

Using StandardScaler, we normalize our features to have a mean of 0 and a standard deviation of 1. This standardization is important for logistic regression to perform optimally.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
        

Model Training

We choose Logistic Regression for its effectiveness in binary classification tasks. After training our model on the training set, we use it to make predictions on the test set.

model = LogisticRegression()
model.fit(X_train, y_train)
        

Evaluation

We evaluate our model's performance using metrics like accuracy, confusion matrix, and classification report. These metrics provide insights into the model's ability to correctly classify the instances as malignant or benign.

  • Accuracy gives us the overall correctness of the model.
  • The confusion matrix shows the breakdown of predictions versus actual labels.
  • The classification report provides precision, recall, and F1-score for each class.

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
        

Visualizing the Results

Using matplotlib and seaborn, we create a confusion matrix heatmap. This visualization helps in understanding the true positives, false positives, true negatives, and false negatives in a more intuitive manner.

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
        
Accuracy: 0.9736842105263158
Confusion Matrix:
 [[70  1]
 [ 2 41]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98        71
           1       0.98      0.95      0.96        43

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114
        

Confusion Matrix Heatmap Explanation

A confusion matrix is a table used to evaluate the performance of a classification model. Each cell in the matrix represents the count of true and predicted label combinations. The heatmap here is a visual representation of the confusion matrix with color intensities corresponding to the cell values.

  • Top-Left Cell (True Negative): 70 benign cases were correctly predicted as benign.
  • Top-Right Cell (False Positive): 1 benign case was incorrectly predicted as malignant.
  • Bottom-Left Cell (False Negative): 2 malignant cases were incorrectly predicted as benign.
  • Bottom-Right Cell (True Positive): 41 malignant cases were correctly predicted as malignant.

The titles 'Benign' and 'Malignant' along the axes represent the predicted labels (horizontal axis) and the true labels (vertical axis).

Accuracy Score Explanation

The accuracy score of 0.9736842105263158 (approximately 97.37%) indicates that the logistic regression model correctly predicted the diagnosis for about 97.37% of the cases in the test dataset.

Confusion Matrix Values Explanation

The confusion matrix is a summary of prediction results:

  • First Row: Represents actual benign cases (71 total).70 were correctly predicted as benign (true negatives).1 was incorrectly predicted as malignant (false positives).
  • Second Row: Represents actual malignant cases (43 total).2 were incorrectly predicted as benign (false negatives).41 were correctly predicted as malignant (true positives).

Classification Report Explanation

The classification report provides key metrics for evaluating the model's performance for each class:

  • Precision (0): Out of all the predicted benign cases, 97% were actually benign.
  • Recall (0): Out of all the actual benign cases, 99% were predicted as benign.
  • F1-Score (0): The harmonic mean of precision and recall for benign cases is 98%.
  • Precision (1): Out of all the predicted malignant cases, 98% were actually malignant.
  • Recall (1): Out of all the actual malignant cases, 95% were predicted as malignant.
  • F1-Score (1): The harmonic mean of precision and recall for malignant cases is 96%.
  • Accuracy: Same as the previously stated accuracy score.
  • Macro Avg: The average of precision, recall, and F1-score without considering class imbalance.
  • Weighted Avg: The average of precision, recall, and F1-score, weighted by the number of instances in each class.

The model has demonstrated excellent performance in distinguishing between benign and malignant breast cancer cases, as evidenced by the high accuracy and other metrics.

Conclusion

This case study demonstrates the potential of machine learning in medical diagnostics. Logistic regression, a simple yet powerful algorithm, provides significant insights and aids in making informed decisions in breast cancer diagnosis. As machine learning continues to evolve, its applications in healthcare promise to enhance patient outcomes and revolutionize medical practices.


For a more in-depth explanation of the code and concepts discussed in this article, be sure to subscribe to my YouTube channel. You'll find detailed video tutorials that break down complex Python topics into understandable lessons.

Additionally, if you'd like to get your hands on the source code used in this analysis, visit my GitHub repository. Feel free to star the repository to show your support and keep up to date with my latest projects and contributions to the data science community.

Remember, your engagement and support inspire me to create and share more valuable content. So, subscribe, star, and join me on this exciting journey into the world of data science!

Youtube

Github

要查看或添加评论,请登录

Mends Albert的更多文章

社区洞察

其他会员也浏览了