A Beginner's Guide to Machine Learning: Predicting Breast Cancer with Python
Mends Albert
Software Engineer ???? ? MPhil. Computer Science Student ? Solidity ? Data Scientists ? I post about AI, Datascience, Machine Learning and everything Tech.
Introduction
In the province of healthcare, early and accurate diagnosis is crucial for effective treatment. This is particularly true for conditions like breast cancer. In this article, we explore how machine learning, specifically logistic regression, can be employed to aid in the diagnosis of breast cancer, using a real dataset.
The Dataset
We utilize a dataset containing features computed from digitized images of fine needle aspirate (FNA) of breast masses. The dataset includes characteristics like texture, perimeter, and area of the cell nuclei, and labels each instance as malignant (M) or benign (B).
Preparing the Environment
Our analysis begins with importing essential Python libraries:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
Data Exploration and Preprocessing
We load the dataset using pandas and perform initial data exploration with data.head(). The dataset includes an 'id' column and an 'Unnamed: 32' column, which we drop as they are not relevant to our analysis. We then map the 'diagnosis' column to binary values, where 'M' (malignant) is 1 and 'B' (benign) is 0.
file_path = './breast_cancer_data.csv'
data = pd.read_csv(file_path)
data.head()
data = data.drop(['id', 'Unnamed: 32'], axis=1)
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})
Feature Selection and Data Splitting
We separate the features (X) from the target variable (y, diagnosis). Using train_test_split from sklearn, we divide our data into training and testing sets, ensuring our model can be validated on unseen data.
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Data Normalization
Using StandardScaler, we normalize our features to have a mean of 0 and a standard deviation of 1. This standardization is important for logistic regression to perform optimally.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Model Training
We choose Logistic Regression for its effectiveness in binary classification tasks. After training our model on the training set, we use it to make predictions on the test set.
model = LogisticRegression()
model.fit(X_train, y_train)
Evaluation
We evaluate our model's performance using metrics like accuracy, confusion matrix, and classification report. These metrics provide insights into the model's ability to correctly classify the instances as malignant or benign.
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
领英推荐
Visualizing the Results
Using matplotlib and seaborn, we create a confusion matrix heatmap. This visualization helps in understanding the true positives, false positives, true negatives, and false negatives in a more intuitive manner.
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Benign', 'Malignant'], yticklabels=['Benign', 'Malignant'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Accuracy: 0.9736842105263158
Confusion Matrix:
[[70 1]
[ 2 41]]
Classification Report:
precision recall f1-score support
0 0.97 0.99 0.98 71
1 0.98 0.95 0.96 43
accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
Confusion Matrix Heatmap Explanation
A confusion matrix is a table used to evaluate the performance of a classification model. Each cell in the matrix represents the count of true and predicted label combinations. The heatmap here is a visual representation of the confusion matrix with color intensities corresponding to the cell values.
The titles 'Benign' and 'Malignant' along the axes represent the predicted labels (horizontal axis) and the true labels (vertical axis).
Accuracy Score Explanation
The accuracy score of 0.9736842105263158 (approximately 97.37%) indicates that the logistic regression model correctly predicted the diagnosis for about 97.37% of the cases in the test dataset.
Confusion Matrix Values Explanation
The confusion matrix is a summary of prediction results:
Classification Report Explanation
The classification report provides key metrics for evaluating the model's performance for each class:
The model has demonstrated excellent performance in distinguishing between benign and malignant breast cancer cases, as evidenced by the high accuracy and other metrics.
Conclusion
This case study demonstrates the potential of machine learning in medical diagnostics. Logistic regression, a simple yet powerful algorithm, provides significant insights and aids in making informed decisions in breast cancer diagnosis. As machine learning continues to evolve, its applications in healthcare promise to enhance patient outcomes and revolutionize medical practices.
For a more in-depth explanation of the code and concepts discussed in this article, be sure to subscribe to my YouTube channel. You'll find detailed video tutorials that break down complex Python topics into understandable lessons.
Additionally, if you'd like to get your hands on the source code used in this analysis, visit my GitHub repository. Feel free to star the repository to show your support and keep up to date with my latest projects and contributions to the data science community.
Remember, your engagement and support inspire me to create and share more valuable content. So, subscribe, star, and join me on this exciting journey into the world of data science!