登录查看更多内容

A Beginner's Guide to Machine Learning: Predicting Breast Cancer with Python

Mends Albert

Software Engineer ???? ? MPhil. Computer Science Student ? Solidity ? Data Scientists ? I post about AI, Datascience, Machine Learning and everything Tech.

发布日期: 2024年1月29日

Introduction

In the province of healthcare, early and accurate diagnosis is crucial for effective treatment. This is particularly true for conditions like breast cancer. In this article, we explore how machine learning, specifically logistic regression, can be employed to aid in the diagnosis of breast cancer, using a real dataset.

The Dataset

We utilize a dataset containing features computed from digitized images of fine needle aspirate (FNA) of breast masses. The dataset includes characteristics like texture, perimeter, and area of the cell nuclei, and labels each instance as malignant (M) or benign (B).

Preparing the Environment

Our analysis begins with importing essential Python libraries:

pandas for data handling
matplotlib.pyplot and seaborn for data visualization
sklearn for machine learning tasks including data splitting, preprocessing, modeling, and evaluation

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.preprocessing import StandardScaler

Data Exploration and Preprocessing

We load the dataset using pandas and perform initial data exploration with data.head(). The dataset includes an 'id' column and an 'Unnamed: 32' column, which we drop as they are not relevant to our analysis. We then map the 'diagnosis' column to binary values, where 'M' (malignant) is 1 and 'B' (benign) is 0.

file_path = './breast_cancer_data.csv'
data = pd.read_csv(file_path)
data.head()

data = data.drop(['id', 'Unnamed: 32'], axis=1)
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})

Feature Selection and Data Splitting

We separate the features (X) from the target variable (y, diagnosis). Using train_test_split from sklearn, we divide our data into training and testing sets, ensuring our model can be validated on unseen data.

X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Data Normalization

Using StandardScaler, we normalize our features to have a mean of 0 and a standard deviation of 1. This standardization is important for logistic regression to perform optimally.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Model Training

We choose Logistic Regression for its effectiveness in binary classification tasks. After training our model on the training set, we use it to make predictions on the test set.

model = LogisticRegression()
model.fit(X_train, y_train)

Evaluation

We evaluate our model's performance using metrics like accuracy, confusion matrix, and classification report. These metrics provide insights into the model's ability to correctly classify the instances as malignant or benign.

Accuracy gives us the overall correctness of the model.
The confusion matrix shows the breakdown of predictions versus actual labels.
The classification report provides precision, recall, and F1-score for each class.

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

领英推荐

What key statistical tools dedicated to clinical trial…

Adrian Olszewski 2 天前

Data Synthetization: enhanced GANs vs Copulas

Vincent Granville 2 年前

Vector and Covector Fields

Patrick Nicolas 11 个月前

Visualizing the Results

Using matplotlib and seaborn, we create a confusion matrix heatmap. This visualization helps in understanding the true positives, false positives, true negatives, and false negatives in a more intuitive manner.

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Accuracy: 0.9736842105263158
Confusion Matrix:
 [[70  1]
 [ 2 41]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.99      0.98        71
           1       0.98      0.95      0.96        43

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Confusion Matrix Heatmap Explanation

A confusion matrix is a table used to evaluate the performance of a classification model. Each cell in the matrix represents the count of true and predicted label combinations. The heatmap here is a visual representation of the confusion matrix with color intensities corresponding to the cell values.

Top-Left Cell (True Negative): 70 benign cases were correctly predicted as benign.
Top-Right Cell (False Positive): 1 benign case was incorrectly predicted as malignant.
Bottom-Left Cell (False Negative): 2 malignant cases were incorrectly predicted as benign.
Bottom-Right Cell (True Positive): 41 malignant cases were correctly predicted as malignant.

The titles 'Benign' and 'Malignant' along the axes represent the predicted labels (horizontal axis) and the true labels (vertical axis).

Accuracy Score Explanation

The accuracy score of 0.9736842105263158 (approximately 97.37%) indicates that the logistic regression model correctly predicted the diagnosis for about 97.37% of the cases in the test dataset.

Confusion Matrix Values Explanation

The confusion matrix is a summary of prediction results:

First Row: Represents actual benign cases (71 total).70 were correctly predicted as benign (true negatives).1 was incorrectly predicted as malignant (false positives).
Second Row: Represents actual malignant cases (43 total).2 were incorrectly predicted as benign (false negatives).41 were correctly predicted as malignant (true positives).

Classification Report Explanation

The classification report provides key metrics for evaluating the model's performance for each class:

Precision (0): Out of all the predicted benign cases, 97% were actually benign.
Recall (0): Out of all the actual benign cases, 99% were predicted as benign.
F1-Score (0): The harmonic mean of precision and recall for benign cases is 98%.
Precision (1): Out of all the predicted malignant cases, 98% were actually malignant.
Recall (1): Out of all the actual malignant cases, 95% were predicted as malignant.
F1-Score (1): The harmonic mean of precision and recall for malignant cases is 96%.
Accuracy: Same as the previously stated accuracy score.
Macro Avg: The average of precision, recall, and F1-score without considering class imbalance.
Weighted Avg: The average of precision, recall, and F1-score, weighted by the number of instances in each class.

The model has demonstrated excellent performance in distinguishing between benign and malignant breast cancer cases, as evidenced by the high accuracy and other metrics.

Conclusion

This case study demonstrates the potential of machine learning in medical diagnostics. Logistic regression, a simple yet powerful algorithm, provides significant insights and aids in making informed decisions in breast cancer diagnosis. As machine learning continues to evolve, its applications in healthcare promise to enhance patient outcomes and revolutionize medical practices.

For a more in-depth explanation of the code and concepts discussed in this article, be sure to subscribe to my YouTube channel. You'll find detailed video tutorials that break down complex Python topics into understandable lessons.

Additionally, if you'd like to get your hands on the source code used in this analysis, visit my GitHub repository. Feel free to star the repository to show your support and keep up to date with my latest projects and contributions to the data science community.

Remember, your engagement and support inspire me to create and share more valuable content. So, subscribe, star, and join me on this exciting journey into the world of data science!

Youtube

Github

要查看或添加评论，请登录

Mends Albert的更多文章

Enhancing Patient Care and Efficiency in Healthcare through Data Science

2024年2月23日

Enhancing Patient Care and Efficiency in Healthcare through Data Science

Introduction: The healthcare industry stands on the brink of a revolution, driven by the transformative power of data…
Build a CNN From Scratch to Detect and Localize Brain Tumors Using MRI Scans

2024年2月19日

Build a CNN From Scratch to Detect and Localize Brain Tumors Using MRI Scans

In the realm of healthcare, Artificial Intelligence (AI) is revolutionizing how we approach diagnosis and treatment…
Building a Machine Learning System for Food Businesses: A Step-by-Step Guide

2024年2月12日

Building a Machine Learning System for Food Businesses: A Step-by-Step Guide

Introduction In the fast-paced world of the food industry, leveraging machine learning (ML) and data analysis is no…
Transforming Customer Experience with Data Science: A Retail Case Study

2024年2月8日

Transforming Customer Experience with Data Science: A Retail Case Study

Introduction In today's competitive market, retailers are constantly seeking innovative strategies to enhance customer…

1 条评论
SMS Spam Detection with Machine Learning: A Beginner's Guide

2024年1月22日

SMS Spam Detection with Machine Learning: A Beginner's Guide

Introduction In an age where instant messaging is ubiquitous, SMS spam is a nuisance we all face. This, however…

6 条评论

See all articles

A Beginner's Guide to Machine Learning: Predicting Breast Cancer with Python

Mends Albert

Software Engineer ???? ? MPhil. Computer Science Student ? Solidity ? Data Scientists ? I post about AI, Datascience, Machine Learning and everything Tech.

Introduction

The Dataset

Preparing the Environment

Data Exploration and Preprocessing

Feature Selection and Data Splitting

Data Normalization

Model Training

Evaluation

领英推荐

Visualizing the Results

Confusion Matrix Heatmap Explanation

Accuracy Score Explanation

Confusion Matrix Values Explanation

Classification Report Explanation

Conclusion

Mends Albert的更多文章

社区洞察

其他会员也浏览了

Neo4j Graph Tech Weekly (E:13)

A Detailed Pre-processing Machine Learning with Python (+Notebook)

TensorFlow Debugging

Machine Learning in R for Beginners: Super Simple Way to Start

Python Libraries to be used for Numerical Predictions

Learning to Rank with Genetic Programming

PRINCIPAL COMPONENT ANALYSIS

Extreme Gradient Boosting XGBoost To Predict Hospital Length Of Stay

Scikit-Learn: Train and Evaluate the Iris Dataset for Classification

Principal Component Analysis - Mathematics behind the algorithm

Introduction

The Dataset

Preparing the Environment

Data Exploration and Preprocessing

Feature Selection and Data Splitting

Data Normalization

Model Training

Evaluation

领英推荐

Visualizing the Results

Confusion Matrix Heatmap Explanation

Accuracy Score Explanation

Confusion Matrix Values Explanation

Classification Report Explanation

Conclusion

Mends Albert的更多文章

Enhancing Patient Care and Efficiency in Healthcare through Data Science

Build a CNN From Scratch to Detect and Localize Brain Tumors Using MRI Scans

Building a Machine Learning System for Food Businesses: A Step-by-Step Guide

Transforming Customer Experience with Data Science: A Retail Case Study

SMS Spam Detection with Machine Learning: A Beginner's Guide

社区洞察

其他会员也浏览了

Neo4j Graph Tech Weekly (E:13)

A Detailed Pre-processing Machine Learning with Python (+Notebook)

TensorFlow Debugging

Machine Learning in R for Beginners: Super Simple Way to Start

Python Libraries to be used for Numerical Predictions

Learning to Rank with Genetic Programming

PRINCIPAL COMPONENT ANALYSIS

Extreme Gradient Boosting XGBoost To Predict Hospital Length Of Stay

Scikit-Learn: Train and Evaluate the Iris Dataset for Classification

Principal Component Analysis - Mathematics behind the algorithm