Gradient Boosting To Predict Hospital Length Of Stay
Dr.Mostafa Samy, BDS
Dentist | Healthcare Data Science / A.i | Generative A.i for healthcare "clinical LLM's" | Digital Healthcare Transformation & Integration | MSc Operations Research | Applied Statistics(FGSSR) | DTQM( AICPD).
We Are Still In The Same Context Of Prediction Where We Are Trying To Vanish Using Narrow Spreadsheets Capabilities That Limit Our Power In Only "Descriptive Analytics", And Enjoy The Power Comparing Different Machine Learning Algorithms Computationally Versus Linear Regression "Since Python Programming Has Made The Issue Relatively Simple And Easy".
You Do Not Have To Be Pioneer In Math And Stat To Start Wrangling Your Custom Data Sets (At Least For Now) i.e I Am Talking Here To My Friend From Healthcare Background.
Before We Upload Our Data To Google Colaboratory, Let"s First Define What is Gradient Boosting? And What Is The Difference Between Gradient Boosting And Gradient Descent.
Gradient Boosting?is a technique for building an?Ensemble?of Weak Models Such That the Predictions of the Ensemble Minimize a loss function.
So What Is Ensemble?
Ensemble Modeling is? Process where multiple diverse base models are used to predict an outcome.
While Gradient descent?is an algorithm for finding a set of parameters that optimizes a loss function. Given a loss function.
We Can Summarize It In 5 steps as follow
I Will Add A Video From YouTube In The First Comment For More Explanation Of The Concept, Watch It, Then Dive With Me In The Python Code.
Upload The Data To Google Colaboratory
Load Dependencies Using This Snippet
import?numpy?as?np
import?pandas?as?pd
import?seaborn?as?sns
from?sklearn.model_selection?import?train_test_split
from?sklearn.preprocessing?import?StandardScaler
from?sklearn.linear_model?import?LinearRegression
from?sklearn.ensemble?import??GradientBoostingRegressor
Load Our Dummy Data, Or Use Your Custom Data From Your Electronic Medical Records And Using This Snippet, If The Data Is Real, Use Offline JUPYTER Notebooks For Confidentiality.
df?=?pd.read_csv('/content/Healthcare_Investments_and_Hospital_Stay?(1).csv')
df.head(7)
draw Heat Map Using This Snippet "Seaborn Python Library"
One Hot Encoding + Train\Test Split
领英推荐
def?onehot_encode(df,?column):
????df?=?df.copy()
????dummies?=?pd.get_dummies(df[column])
????df?=?pd.concat([df,?dummies],?axis=1)
????df?=?df.drop(column,?axis=1)
????return?df
def?preprocess_inputs(df)
????df?=?df.copy()
????
????#?One-hot?encode?Location?column
????df?=?onehot_encode(df,?column='Location')
????
????#?Split?df?into?X?and?y
????y?=?df['Hospital_Stay'].copy()
????X?=?df.drop('Hospital_Stay',?axis=1).copy()
????
????#?Train-test?split
????X_train,?X_test,?y_train,?y_test?=?train_test_split(X,?y,?train_size=0.7,?random_state=123)
????
????#?Scale?X?with?a?standard?scaler
????scaler?=?StandardScaler()
????scaler.fit(X_train)
????
????X_train?=?pd.DataFrame(scaler.transform(X_train),?columns=X.columns)
????X_test?=?pd.DataFrame(scaler.transform(X_test),?columns=X.columns)
????
????return?X_train,?X_test,?y_train,?y_test
X_train,?X_test,?y_train,?y_test?=?preprocess_inputs(df)
Build Linear Regression Model and Print The R squared
R squared?is a number between 0 and 1 and measures the degree to which changes in the dependent variable can be estimated by changes in the independent variable(s).
it is 0.85
Lets Build The Gradient Boosting With This Few Lines With SKleaarn, The Library Documentation Link Is In The Second Comment.
Here The R Squared is 0.93 Which Means That It Performs Better In Prediction If Compared To Linear Regression on This Data.
In The Next "Article" We Will Discuss Usage Of Natural Language Processing NLP In Medical Records And See If We Can Use It In prediction Of Unplanned Readmission Within 3o Days From Discharge.
Google Colaboratory Notebook Is In The Third Comment.
An Awesome Kaggle Kernel and some Other Resources Are Also Attached In The Comments.
Dentist | Healthcare Data Science / A.i | Generative A.i for healthcare "clinical LLM's" | Digital Healthcare Transformation & Integration | MSc Operations Research | Applied Statistics(FGSSR) | DTQM( AICPD).
3 年https://www.kaggle.com/drscarlat/predict-hospital-length-of-stay-los-mimic2
Dentist | Healthcare Data Science / A.i | Generative A.i for healthcare "clinical LLM's" | Digital Healthcare Transformation & Integration | MSc Operations Research | Applied Statistics(FGSSR) | DTQM( AICPD).
3 年https://stats.stackexchange.com/questions/425622/does-over-fitting-a-model-affect-r-squared-only-or-adjusted-r-squared-too
Dentist | Healthcare Data Science / A.i | Generative A.i for healthcare "clinical LLM's" | Digital Healthcare Transformation & Integration | MSc Operations Research | Applied Statistics(FGSSR) | DTQM( AICPD).
3 年https://en.wikipedia.org/wiki/Gradient_boosting
Dentist | Healthcare Data Science / A.i | Generative A.i for healthcare "clinical LLM's" | Digital Healthcare Transformation & Integration | MSc Operations Research | Applied Statistics(FGSSR) | DTQM( AICPD).
3 年https://colab.research.google.com/drive/1AfnVfNkfgRK_YbuCIa0ShKqQ6NhZCJSt?usp=sharing
Dentist | Healthcare Data Science / A.i | Generative A.i for healthcare "clinical LLM's" | Digital Healthcare Transformation & Integration | MSc Operations Research | Applied Statistics(FGSSR) | DTQM( AICPD).
3 年scikitlearn documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html?highlight=gradient%20boosting%20regressor#