RapidMiner: Simplifying Machine Learning Models for M5 Competition
Alkiviadis Vazacopoulos
Educator & World Leading Expert in Combining AI & Prescriptive & Predictive Application Development and Deployment
Kara Pietrowicz & Alkis Vazacopoulos, Stevens Institute of Technology
Overview
Artificial intelligence (AI) is implemented in a wide range of industries for predictive analysis, pattern recognition systems, and other applications. Regardless of the end goals, organizations utilize artificial intelligence to create machine learning (ML) models to address various problems they face. Most of us are exposed to ML algorithms without knowing it. For example, let’s say you purchased a handbag from Macy’s online store. A few days later, you receive emails for shopping suggestions or notice online advertisements related to Macy’s or similar handbags. It is great that your shopping experience has been refined, but how did Macy’s know your personal preferences? Well, your product recommendations are made based on the results of a machine learning model! Your behavior on Macy’s website, past purchases, products added to your cart, brand preferences, and even filters applied to your product searches are all important details used by Macy’s ML models to create shopping suggestions [2].
Machine learning models are built after learning general patterns from training data and applying the acquired knowledge to new data to make desired predictions. Even though the shopping example makes ML models sound easy to design, the process is quite difficult. There are numerous aspects to consider for an accurate ML model. The following are a few questions that should be asked for any model [1].
-??????What data is necessary for training the model?
-??????How much data is needed?
-??????What is the current quantity and quality of the data?
-??????Do we need to categorize or label the data?
-??????How should the data be organized for the model?
-??????Do we have to replace any incorrect data?
-??????Should we remove any irrelevant data?
Working with a large dataset makes even the simplest tasks for data preparation challenging and time-consuming. After collecting and organizing data, the ML model will be trained on the data by applying various model selection techniques and algorithms. This phase is carried out in Python or other programming languages that support AI and ML concepts. However, there is no step-by-step guide for building ML models. The majority of the time dedicated to programming the model is spent on trial and error. The following is a summary of the general programming process [1].
-??????Identify the relevant data features that will provide the best model results
-??????Develop multiple models for improved performance
-??????Test each model
-??????Determine which model best meets the requirements for its operational goal
Again, training a model on a large dataset will lengthen the time spent on creating and testing a model. Often or not, programming errors will arise, whether it be from a miscalculated variable or a logic error. Programming the entire ML model for a potentially accurate prediction is chaotic and tedious. Fortunately, there is a new platform to reduce the effort required for developing ML models. Introducing, RapidMiner!
What is RapidMiner?
RapidMiner is a data science platform that replaces the hassle involved in data preparations and ML model development. Here is a quick overview of the platform’s operations [3].
-??????Data Engineering
o??Visualize and explore data
o??Prepare data for analysis
o??Select necessary features
-??????Model Building
o??Build models with or without code
o??Monitor models’ progress over time
o??1500+ algorithms and functions
o??Easily validate your models and present intuitive results
RapidMiner provides other functionalities, such as AI app building and easy collaboration, but for the purpose of this article, we will focus on how RapidMiner can build ML models for the M5 competition.
M5 Competition with RapidMiner
As a reminder, the M5 competition concentrated on forecasting methodologies for the retail industry using hierarchical sales data provided by Walmart stores in California, Texas, and Wisconsin. The dataset included the unit sales of over three thousand U.S. products, classified into three categories (Hobbies, Foods, and Households), and seven product departments. Walmart selected this data with the objective of representing the shopping habits of various selling locations.
To demonstrate RapidMiner’s capabilities, we will use a subset of the M5 data. From the Hobbies category of Store 1 in California, one year’s worth of data for two hundred products were selected to predict the prices of each selected product on a certain day. The following list introduces each feature of the adjusted product data [6].
-??????date: The date in a “y-m-d” format.
-??????wm_yr_wk: The id of the week the date belongs to.
-??????weekday: The type of the day (Saturday, Sunday, …, Friday).
-??????wday: The id of the weekday, starting from Saturday.
-??????event_name_1: If the date includes an event, the name of this event.
-??????event_type_1: If the date includes an event, the type of this event.
-??????event_name_2: If the date includes a second event, the name of this event.
-??????event_type_2: If the date includes a second event, the type of this event.
-??????snap_CA: A binary variable (0 or 1) indicating whether the CA store allowed SNAP (Supplemental Nutrition Assistance Program) purchases on the examined date. 1 indicates that SNAP purchases are allowed.
-??????item_id: The id of the product.
领英推荐
-??????cat_id: The id of the category the product belongs to.
-??????d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29.
-??????sell_price: The price of the product for the given week/store. If not available, this means that the product was not sold during the examined week.
RapidMiner displays loaded data in a clean, comprehensible format. Different operators appear above the dataset allowing manipulation of the data. The “Transform” function allows the user to remove columns, filter data, change the data type, and much more. “Cleanse” can remove duplicates, improve low-quality data, replace missing values, and perform normalization. These operators and more are located in RapidMiner’s TurboPrep, a user interface where data is always visible and step-by-step changes provide immediate results. It’s a convenient tool that eliminates the struggle of preparing clean, good-quality data for ML models.
After data preparation, it’s time to develop a machine learning model WITHOUT Python programming. RapidMiner allows users to build models manually (if certain requirements must be met for a model) or use the Auto Model extension that accelerates the process of constructing and testing models. Focusing on the Auto Model, the five major steps are described below [4].
1.???Load Data
2.???Select Task
a.???Auto Model can perform three tasks: make predictions, identify clusters, and identify outliers
b.???To make predictions, the desired data variable must be a column within the imported data
3.???Prepare Target
a.???The data in the selected column might be applicable for classification. If not, classification is skipped. ?
4.???Select Inputs
a.???Auto Model summarizes the importance of including each data variable within a model using a red, yellow, or green status bubble next to each variable. It is the user’s decision to include certain variables, depending on the variable’s effect towards the model’s prediction.
5.???Model Types ?
a.???The user is presented with the available models that can be constructed based on the imported data. After deciding which models to build, each model’s results are compared.
In the image above, three models were built to predict the unit sales for Walmart’s Store 1 in California for two hundred “Hobbies” products on the last day of the year (d_365). Relative Error (a measure of accuracy) is calculated to demonstrate how accurate each model’s predictions were compared to the true values on d_365. In other words, it helps us determine how far the predicted quantity is from the true value [5]. The smaller the relative error, the more accurate the model’s predictions. Other values, such as root mean square error or absolute error, can also be calculated if desired.??
From the image above, the Decision Tree model provided the best results because it had a relative error of zero. The Deep Learning and Generalized Linear Models had relative errors of 51% and 57%, respectively.
Auto Model displays the predictions from each model, and one can see that the Decision Tree predicted the exact unit sales for each product on different days compared to the Deep Learning model.
If the user wanted to predict the unit sales for each product on a specific day, Auto Model creates a simulator for each model to make a prediction depending on the weights for each variable listed on the left.
Conclusion
RapidMiner has an abundance of operations to offer compared to the traditional process of programming ML models. The visualizations of data and models create a friendly environment for the user to better decide what adjustments should be made to the data and construct a model based on their requirements. The three-month-long M5 competition gave first place to a team with 52% accuracy, indicating that the team’s model predicted 52% of the products’ unit sales correctly. The best model created in RapidMiner resulted in 0% relative error, meaning each prediction was spot on.
It is not a fluke that the decision tree model outperformed the deep learning or neural network model. Tree-based methods are known to routinely outperform neural networks. Both methods split data features into various categories to optimize the information gain, but neural networks apply a probabilistic view whereas trees apply a deterministic one. Using automatic feature selection, tree-based models become simplifications of neural networks, resulting in better accuracy [7].
Even though all data from the M5 competition was not used, it is still impressive that RapidMiner can create an accurate model to predict products’ unit sales using a year’s worth of data. RapidMiner is a beneficial tool for beginner or professional model developers, but it is still crucial for users to fully grasp ML concepts. Without understanding the functions and common models used in ML, accuracy will be lost. RapidMiner is only an assistant in the complex world of AI and ML.
?
Resources: