Turning Pain into Progress: One Data Scientist's Back-to-Health Story
Vishal Jain
Strategic growth, tactical execution, exceptional teams – that's my focus |Technical Project Manager | Engineering |Technological Innovation | PMP| Digital Transformation | Data Science | Fullstack | Cloud
I recently injured my back while lifting weights due to improper form. The doctor diagnosed me with strained muscles near my spine. Confined to bed for two days with intense pain, I felt frustrated and yearned to get back on my feet. However, the pain made standing impossible.
In that state, I had a realization. As a data scientist, I could approach my own recovery with the same analytical mindset I use for problems. So, I decided to take charge and identify the factors influencing my healing. With pen in hand, I started making a list while still bedridden.
Determined to take control of my situation, I opened Google Colab and built a hypothetical dataset. While this data might not perfectly mirror real-world cases, it served two purposes: firstly, it provided a welcome distraction from negative thoughts, and secondly, it allowed me to delve into research relevant to my recovery. The dataset included some fundamental parameters that I believed could be influential.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Number of samples
num_samples = 100
# Generate random data
np.random.seed(42)
data = {
'Patient ID': range(1, num_samples + 1),
'Age': np.random.randint(20, 70, size=num_samples),
'Gender': np.random.choice(['Male', 'Female'], size=num_samples),
'Injury Severity': np.random.randint(1, 6, size=num_samples),
'Treatment Type': np.random.choice(['Physical Therapy', 'Rest and Medication', 'RICE', 'Surgery'], size=num_samples),
'Recovery Time (weeks)': np.random.randint(1, 20, size=num_samples)
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the first few rows of the DataFrame
print(df.head())
In this dataset, 'Injury Severity' could be a subjective rating by a healthcare provider (1 being mild, 5 being severe), 'Treatment Type' includes the initial treatment plan, and 'Recovery Time' is the time taken for the patient to recover.
To be more realistic data distributions created some data visualizations using libraries like matplotlib or seaborn.
# Data visualization
plt.figure(figsize=(12, 6))
# Age distribution
plt.subplot(2, 2, 1)
sns.histplot(df['Age'], bins=20, kde=True)
plt.title('Age Distribution')
# Injury Severity
plt.subplot(2, 2, 2)
sns.countplot(data=df, x='Injury Severity', order=['Mild', 'Moderate', 'Severe'])
plt.title('Injury Severity Distribution')
# Treatment Type
plt.subplot(2, 2, 3)
sns.countplot(data=df, y='Treatment Type', order=['Physical Therapy', 'Rest and Medication', 'RICE', 'Surgery'])
plt.title('Treatment Type Distribution')
# Recovery Time
plt.subplot(2, 2, 4)
sns.histplot(df['Recovery Time (weeks)'], bins=20, kde=True)
plt.title('Recovery Time Distribution')
plt.tight_layout()
plt.show()
To understand the relationship between age and recovery time, or to predict recovery time based on age.I use a linear regression model, we first need to preprocess the data and then train a linear regression model. Here's how you can do it.
mport pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Sample data generation
np.random.seed(42)
data = {
'Age': np.random.normal(50, 15, size=100).astype(int),
'Recovery Time (weeks)': np.random.gamma(3, 2, size=100).astype(int)
}
df = pd.DataFrame(data)
# Extract features and target variable
X = df[['Age']]
y = df['Recovery Time (weeks)']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model
model = LinearRegression()
# Train the model on the training set
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate and print metrics
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R-squared:", r2_score(y_test, y_pred))
# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Age')
plt.ylabel('Recovery Time (weeks)')
plt.title('Linear Regression Prediction')
plt.legend()
plt.show()
This code creates a linear regression model to predict recovery time based on age. It then evaluates the model using mean squared error and R-squared metrics and plots the actual versus predicted values.
The lower the MSE, the better the model fits the data. A perfect model would have an MSE of 0.
R-squared ranges from 0 to 1, where 0 indicates that the model does not explain any variance in the dependent variable, and 1 indicates that the model explains all the variance. A higher R-squared value indicates a better fit of the model to the data.
Personal Update:
Program Management & Data Science practitioner
10 个月Insightful!