Machine Learning Framework

No alt text provided for this image

Just for sharing to people who are just starting their machine learning journey.

1)     Define the Problem and Success Measures

a)      Define if the problem is a supervised or unsupervised problem; regression or classification

b)     Determine level of accuracy for the model to be deemed successful

2)     Data Collection

a)      The larger the quantity of data, the more accurate the model will be

b)     Look beyond dataset for other relevant domain information/data

3)     Data Preparation

a)      Wrangle data and prepare it for training (e.g. setting up for lag in time series, multi variate)

b)     Clean that which may require it (remove duplicates, correct errors, deal with missing values, normalization, stationarity, data type conversions, removing bias, etc.)

c)      Randomize data (except for time series), which erases the effects of the particular order in which we collected and/or otherwise prepared our data

d)     Visualize data to help detect relevant relationships between variables or class imbalances or bias or perform other exploratory analysis

e)     Use information from visualization as well as other domain knowledge to generate features

f)       Split into training and evaluation sets (70/30, 80/20, etc.)

4)     Choose a Model

a)      Select the right algorithms for the specific tasks. Different algorithms perform better at different tasks (e.g. CNNs for Natural Language and Vision Systems, LSTM/Prophet/ARIMA for time series, K-Means for Classification, XGBoost/LightGBM/Catboost for tabular data, etc.)

b)     If unsure run AutoML with an ensemble (H2O.ai, Auto Gluon, etc.)

c)      Some algorithms run faster than others

5)     Train the Model

a)      Training assigns weights/importance to features (Linear regression example: algorithm would need to learn values for m (or W) and b (x is input, y is output)

b)     The more iterations or training step the more accurate the model however it will reach a saturation point. Need to balance time and computational power vs accuracy.

6)     Evaluate the Model

a)      Uses some metric or combination of metrics to "measure" objective performance of model (RMSE, MAE, MAPE, etc.)

b)     Test the model against previously unseen data to further tune the model.

c)      Compare train/eval split -> 80/20, 70/30, etc. Depending on domain, data availability, dataset particulars, etc.

d)     Prevent Over Fitting. Model should be able to generalize.

7)     Parameter Tuning

a)      Hyperparameter tuning of algorithm especially for Neural Networks

b)     Manually tune model parameters for improved performance (or use AutoKeras, Google’s AutoML for neural networks)

c)      Simple model hyperparameters may include: number of training steps (epochs), learning rate, initialization (seed) values and distribution, number of nodes, etc.

8)     Make Predictions

a)      Using further (test set) data which have, until this point, been withheld from the model (and for which class labels are known), are used to test the model; a better approximation of how the model will perform in the real world

要查看或添加评论,请登录

社区洞察

其他会员也浏览了