Artificial Intelligence Unfolded: Article 6 - Synthetic Data, Model Training to Model Interaction
Hrishi Kulkarni
Chief Technology Officer (CTO), Executive Director, Board Member, Innovation and Change Catalyst, Strategic Technologist, Product & Data Engineering, Cloud Computing, AI/ML, GenAI, MLOps, Programme Management
Today, in the sixth article in my series on Artificial Intelligence, I thought of sharing my experience of working on a project of retail demand prediction as part of my recent University of Oxford course on "Artificial Intelligence: Generative AI, Cloud and MLOps" run by Course Director Ajit Jaokar and his team. It's been a lot of fun, especially when seeing everything come together in a working model, which also needed me to go deep in my understanding of synthetic data generation, feature engineering, various machine learning algorithms, training model multiple times using hyperparameter tuning, and more importantly assessing model performance before it is deployed. As always, I'll try and keep it simple.
The Role of Synthetic Data in Model Training
One of the biggest challenges in machine learning is obtaining large and diverse datasets that are reflective of real-world scenarios. This is where synthetic data comes in. By generating my own data, I ensured that the model training was robust and comprehensive. I created datasets that mimic real-world sales data, complete with product IDs, sales figures, and economic indicators such as GDP and inflation.
Generating Synthetic Data with ChatGPT
To kick things off, I used ChatGPT to help brainstorm and outline the types of data points relevant for a demand prediction model. Through conversation (I had to ask right questions), we discussed various attributes like GDP growth rates, inflation rates, and typical sales figures, which helped me shape a dataset that was not only realistic but also tailored for the specific challenges of predicting product demand.
ChatGPT was able to assist me with in generating 500k-record dataset that contained sales transactions across various countries, stores, products and time period.
I had to go through several iterations to produce the data I needed, as I had to fine-tune my prompts and provide GPT improved data rules I wanted it to abide with each iteration.
#PromptEngineering If interested, please read my previous article https://www.dhirubhai.net/pulse/artificial-intelligence-unfolded-article-5-crafting-hrishi-kulkarni-ytvse
The Crucial Steps of Model Training
Loading and Preparing Data: The first step was to load the data into a suitable format for analysis and processing. This included basic cleaning and setting up data structures that support efficient access and manipulation.
Feature Engineering: The heart of a good model lies in its features. I spent considerable time crafting features that could capture the underlying patterns in the data. From rolling averages to more complex calculations like exponential moving averages (EMA), each feature was designed to provide the model with insightful inputs.
Data Exploration and Feature Selection: Before diving into model building, I explored the data through visualisations to understand the distributions and relationships. This step was crucial for feature selection, ensuring that only the most relevant and impactful features were included in the final model.
Model Development and Training
Before starting to train, Scaling, also known as feature scaling or data normalisation, was a crucial preprocessing step. Its primary purpose was to standardise the range of independent variables or features of data, which helps to ensure that the machine learning algorithm functions optimally
Choosing the right model was key. I tried Linear Regression, Random Forest Regressor but I settled on the Gradient Boosting Regressor for its robustness and effectiveness in handling diverse datasets. I spent good time on each of these models analysing their output before deciding on which one to use.
Using GridSearchCV was a game-changer — it automated the process of tuning, trying out various combinations of hyperparameters to find the best fit. In other words, Cross-Validation is a method used to tune hyperparameters — these are the parameters of a model that are set prior to training and significantly influence model performance.
Fitting process itself was quite interesting. I tried fitting models 96 times, 24, times and 12 times with different parameters. These essentially relate to number of training runs to find the best fit hyperparameters.
领英推荐
Hyperparameter Tuning and Retraining
I experimented with different settings, adjusting parameters like the number of estimators and the depth of the decision trees. This not only enhanced the model's accuracy but also gave me a deep dive into how each parameter impacts the model’s performance.
Another interesting observation was on how the learning rate can impact model training and outcomes. I tried training models with 0.1 and 0.01 training rates. After all observations, the use of a higher learning rate (0.1 compared to 0.01 in one of the previous iterations) and optimal tree parameters in the model helped regain much of the lost accuracy and model fit seen in some of the intervening iterations.
Below were key aspects I considered before model suitability for deployment:
Stability and Generalisation: The model settings had to offer a good balance between accuracy and generalisation. The model parameters need to well-tune to prevent overfitting while still capturing the essential patterns in the data.
Model Robustness: The mode robustness meaning, consistent performance across training and testing datasets, as indicated by similar Mean Squared Error (MSE) values, was important consideration.
Further Tuning: While the model performed very well, there will always be ways to improve the accuracy.
From Model Training to Interaction
The final piece of the puzzle was the forecast demand function, which uses past sales data to calculate key features like EMA and feed them into the model. This function became the bridge between raw data and actionable predictions, allowing me to interact with the model in a meaningful way.
Using the exact same scaler or transformation applied during training when making predictions was fundamental for maintaining consistency and accuracy.
Deploying the Best Model
After numerous training sessions and adjustments, I identified the best-performing model based on its accuracy and generalisation capabilities. Deploying this model and plugging it into a user-friendly interface was the culmination of all the hard work — a tool that not only predicts but also adapts and learns from new data.
Conclusion
This journey from data preparation to model deployment has been incredibly rewarding. I kept the model deployment simple and didn't deploy it as an end-point. It was real fun though and enjoyed every step, from the nitty-gritty of tuning models to the thrill of seeing accurate predictions unfold. It's a testament to how AI can transform data into insights, and insights into actionable intelligence.
Stay tuned for more updates as I continue exploring new aspects of AI. If you're embarking on a similar journey, I suggest you make your hands dirty and understand the principles.
Kelly Coutinho Anjali Jain tagging you as felt you'd like to read this :-)
Senior Managing Director
10 个月Hrishi Kulkarni Fascinating read. Thank you for sharing
Head of Business planning & Data Science at Ralph Lauren
10 个月Love this! Thank you Hrishi. The iterations and prompting to ensure synthetic data reflect statistical richness of real world resonates with my own experience. Really enjoying your articles!
Author | Co-founder@Erdos Research | AI & machine learning Senior Tutor at University of Oxford| Data architect at Metro Bank
10 个月Hrishi Kulkarni Always find your thoughts insightful and I look forward to next post too.