Product Matching: A Comparative Analysis of Various Machine Learning Algorithms using Word2Vex and TF-IDF Embedding Techniques
Abiola A. David, MSc, MVP
??Microsoft? Fabric & Excel MVP [5X] | Senior Data Engineer & BI Developer | Microsoft Fabric, Azure, Power BI, Databricks, SQL, Excel, Snowflake, Google BigQuery| MSc, Big Data & BI | Fabric Engineer | C# Corner MVP
Abstract
This master's thesis presents a comparative analysis of various machine learning algorithms using Word2Vec and TF-IDF embedding techniques for product matching. The objective to compare the performance of Word2Vec and TF-IDF embedding techniques when used with different machine learning algorithms for product matching, investigate the impact of hyperparameter tuning on the performance of machine learning algorithms in product matching, identify the combination of machine learning algorithm and embedding technique that achieves the highest accuracy in product matching, evaluate the performance of machine learning algorithms in classifying matching and non-matching products for each embedding technique and determine the machine learning algorithm that demonstrates the most balanced performance for both matching and non-matching product classifications.
The master’s project begins with a comprehensive literature review of the existing research in the field of product matching, machine learning algorithms, and text embedding techniques. In addition, the review provided a solid foundation for understanding the state-of-the-art approaches and sets the stage for the experimental analysis.
To conduct comparative analysis, Product Data Corpus dataset for product matching binary classification challenge on this website: (Semantic Web Challenge ISWC2020 (ir-ischool-uos.github.io ) was used for this master’s project. Extensive Exploratory Data Analysis (EDA) was performed. The EDA provided insights into the distribution of the data which enhances the identification of anomalies and outliers and helped in the feature selection for training models. In addition, detailed data vitalizations were performed after the EDA which enhanced the understanding of the dataset (Few, 2004), identification of patterns and relationships (Tufte, 2001), and detection of anomalies and outliers (Cleveland & McGill, 1984).
The dataset was preprocessed and transformed into feature vectors using Word2Vec and TF-IDF embedding techniques. Subsequently, multiple machine learning algorithms, including logistic regression (Hosmer et al., 2013), knn (Cover & Hart, 1967)), decision tree (Breiman et al.,1984), support vector machines (SVM)(Cortes & Vapnik, 1995), random forest (Breiman, 2001), na?ve bayes (Rish, 2001), gradient boosting (Friedman, 2001)), xgboost (Chen & Guestrin, 2016), and mlp (Rumelhart, et all., 1986), were applied to the transformed dataset.
?
The findings reveal that XGBoost, Random Forest, Gradient Boosting, and MLP classifiers exhibit strong performance in terms of accuracy, precision, recall, F1-score, and AUC-ROC. These classifiers, when combined with Word2Vec and TF-IDF embeddings, offer significant improvements in product matching accuracy.
The Word2Vec and TF-IDF embedding techniques prove effective in capturing semantic information and representing textual data for product matching. Both techniques demonstrate their potential to enhance decision-making processes in industries such as e-commerce, supply chain management, and data integration.
The practical implications of this research are twofold. First, the recommended classifiers and embedding techniques provide valuable insights for practitioners seeking to improve product matching accuracy. Second, the study highlights the importance of leveraging machine learning algorithms in decision-making processes, leading to improved customer experiences, optimized inventory management, and streamlined data integration.
While this research contributes to the understanding of machine learning algorithms and embedding techniques for product matching, further investigations are encouraged. Future research directions include exploring advanced embedding techniques like BERT and investigating ensemble methods for even higher accuracy. Additionally, real-world case studies and scalability assessments are needed to validate the findings in large-scale product matching applications.
Keywords: Product Matching, Machine Learning Algorithms, Logistic Regression, Decision Tree, Random Forest, SVM, Na?ve Bayes, Multi-Layer Perceptron, Gradient Boosting, Xgboost, K-Nearest Neighbors (KNN) Algorithm, Accuracy, Precision, Recall, F1-Score, Feature Selection, Anomaly Detection, Word2Vec, TF-IDF, E-Commerce, Supply Chain Management, Retail, Healthcare, Manufacturing, Text Embedding.
Chapter One
1. Introduction
The e-commerce industry has grown exponentially in the past few years, with increased retailers and brands setting up their online stores to tap into the global market (Rong et al., 2017). The digital transformation has also led to an increase in the number of online marketplaces and comparison sites, where customers can compare and purchase products from various sources or businesses (Zhang & Zhang, 2018). However, this has also led to an increase in the number of products and sellers, making it difficult for customers to find the right products and for sellers to reach their target customers. To address these challenges, product matching has become a crucial task for effective decision-making in the supply chain management and retail industries.
Product matching refers to the process of matching similar or identical products across various sources, such as different e-commerce websites, based on certain attributes or features. It helps retailers and brands to identify the products that are in demand, improve their inventory management, and increase sales. It also helps customers to find the products they are looking for, compare prices, and make informed purchase decisions. However, product matching is a complex and challenging task, as products can have different names, descriptions, and attributes, and can be listed in various categories and subcategories.
During the turn of the century, the use of machine learning algorithms for product matching has gained popularity due to their ability to learn from data and improve accuracy (Rong et al., 2017). Machine learning algorithms can be classified into rule-based, machine learning-based, and hybrid approaches. Rule-based approaches involve defining a set of rules to match products based on certain criteria, such as brand, color, and price. However, these approaches can be limited in their ability to match products with varying attributes and can result in low accuracy.
Machine learning-based approaches, on the other hand, use machine learning algorithms to learn from data and identify patterns and similarities between products offered by businesses within the same sector. A classic example is products offered on these two giant e-commerce websites, Amazon, and Walmart. These approaches can be more accurate than rule-based approaches, but they require substantial amounts of high-quality data to train the algorithms (Rong et al., 2017). Hybrid approaches combine both rule-based and machine learning-based approaches to improve accuracy and scalability.
Despite the growing interest in using machine learning algorithms for product matching, there is no clear consensus on the most effective algorithm for this task. Each algorithm has its strengths and weaknesses, and their suitability depends on the type and volume of data and the specific requirements of the task. Therefore, there is a need for a comparative analysis of different machine learning algorithms for product matching to identify the most effective algorithm for different scenarios.
In this context, this master's thesis is implemented to conduct a comprehensive comparative analysis of various machine learning algorithms for product matching, specifically leveraging on the utilization of Word2Vec and TF-IDF embedding techniques (Mikolov et al., 2013; Hu et al., 2019). These embedding techniques have gained prominence in natural language processing tasks and have shown promising results in capturing semantic relationships and context within textual data.
The proposed framework for product matching in this thesis, incorporates several machine learning algorithms, including Logistic Regression, Random Forest, Decision Tree, Support Vector Machine, Gradient Boosting, Na?ve Bayes, KNN, XGBoost and Multi-Layer Perceptron each employing Word2Vec and TF-IDF embedding techniques. By comparing the performance of these algorithms, this thesis aims to identify the most effective algorithm for product matching in the e-commerce sector.
The performance evaluation of the machine learning algorithms will be based on key performance metrics, including accuracy, precision, recall, and F1-score (Hastie et al., 2009). These metrics are expected to provide a comprehensive assessment of the algorithms’ abilities to accurately match similar or identical products across diverse sources when deployed, considering the strengths and weaknesses of Word2Vec and TF-IDF embedding techniques.
Furthermore, this thesis will shed light on the importance of effective product matching specifically in the e-commerce industry, emphasizing its impact on supply chain management, inventory management, pricing strategies, customer satisfaction and profitability (Breiman, 2001). It will also address the challenges and limitations associated with product matching, including data accessibility, data quality, scalability, complexity, limited information, and the dynamic nature of product data.
The rest of the thesis is organized as follows: Chapter 2 provides a comprehensive literature review while chapter 3 delves into the research methodology employed for this thesis. In Chapter 4, the results were presented and analyzed while Chapter 5 focuses on the discussion of findings, recommendations, and conclusion.
1. Background to the Study
The rise of e-commerce and online marketplaces has brought about a need for product matching, which involves identifying identical or related products from different players in the same industry or field. Product matching is critical to the success of businesses operating in the e-commerce, retail, healthcare, supply chain management, and manufacturing industries. The ability to match products effectively can lead to increased sales revenue, improved customer satisfaction, and streamlined operations.
The traditional approach to product matching involves manual matching, which is laborious, time-consuming, and prone to errors. With the explosion of data in recent years, manual matching is no longer feasible, and businesses are turning to machine learning algorithms for efficient and accurate product matching when combined with embedding techniques such as Word2Vec and TF-IDF.
1.2 Motivation
The need for effective product matching has driven significant research in the field of machine learning. Researchers have proposed and compared different machine learning algorithms for product matching, but the optimal algorithm for this task is still an open question. The selection of the appropriate machine learning algorithm for product matching is critical for businesses to make effective decisions and gain a competitive advantage.
Furthermore, the performance of machine learning algorithms varies with different datasets and features, making it challenging to determine the best algorithm for a given task. A fortiori, there is a need for comparative studies that evaluate the performance of different machine learning algorithms for product matching by leveraging on embedding techniques on a real-world business dataset.
This thesis aims to address this gap by performing a comparative analysis of various machine learning algorithms for product matching in the e-commerce industry. The study proposes a framework that incorporates several machines learning algorithms, including Logistic Regression, Random Forest, Decision Tree, Support Vector Machine, Gradient Boosting, XGBoost, Na?ve Bayes, KNN, and Multi-Layer Perceptron
The findings of this study is expected to provide valuable insights into the suitability of different machine learning algorithms (non-hyperparameter tuning and hyperparameter tuning), when combined with selected embedding techniques for product matching tasks and help businesses make informed decisions. The study will contribute to the existing body of knowledge on machine learning algorithms for product matching and provide recommendations for future research in this field.
1.3 Research Objectives
The research objectives of this dissertation is, among others, compare the performance of Word2Vec and TF-IDF embedding techniques when used with different machine learning algorithms for product matching, investigate the impact of hyperparameter tuning on the performance of machine learning algorithms in product matching, identify the combination of machine learning algorithm and embedding technique that achieves the highest accuracy in product matching, evaluate the performance of machine learning algorithms in classifying matching and non-matching products for each embedding technique and determine the machine learning algorithm that demonstrates the most balanced performance for both matching and non-matching product classifications.
1.4 Research Questions
Succinctly enumerated, the following are five broad-based research questions addressed in this master’s thesis:
1.5 Scope of the Study
This study will be restricted specifically to the following unsupervised machine learning algorithms: logistic regression, decision tree, random forest, support vector machines, na?ve bayes, knn, gradient boosting, extreme boosting (XGBoost), and Multi-layer Perceptrons (MLP) which are suitable for binary classification task. Other models such as AdaBoost, Convolutional Neural Networks, Recurrent Nural Networks and LightGBM, would not be considered in this master’s project.
The evaluation of the algorithms will be based on a set of predefined metrics such as the accuracy, precision, recall, and F1 scores and the Area Under the Receivers Operating Characteristics Curve (AUC-ROC) metrics. This project will not cover other evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (SMSE), Mean Absolute Error (MAE), and R-squared (R2).
The Classification Report will be used to provide detailed and comprehensive summary reports of the performances of the models across the evaluation metrics mentioned above. In addition, the confusion matrix will be used to provide a tabular representation of each models’ predictions by showing the counts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions for each class. Finally, heatmap and bar plots would be employed to visualize the f1 scores of the matrix. Finally, to optimize each model for performance improvement, selection of model, avoidance of overfitting or underfitting, and effective comparative analysis, hyperparameters tuning will be deployed for this project.
1.6 Significance of the Study
The significance of this dissertation lies in its contribution to the field of product matching and effective decision making. By comparing and evaluating the performance of different machines learning algorithms combined with embedding techniques, this project will provide valuable insights into the strengths and weaknesses of each model. This knowledge can be useful for businesses and organizations seeking to improve their product matching processes, increase efficiency, and reduce errors.
Additionally, the study has practical implications for various industries, including e-commerce, retail, and logistics, manufacturing, healthcare, agriculture to name a few. The findings can be used to optimize product recommendations, improve supply chain management, enhance overall customer experience, and ensure profitability of businesses.
Furthermore, this thesis serves as a bedrock for future research in the field of product matching and machine learning. The results and insights gained from this study are expected to inspire new approaches and methodologies for product matching, leading to further advancements in the field.
??
Chapter Two
2. Literature Review
2.1 Introduction to Product Matching
Product matching is a critical task in decision making, particularly in industries such as e-commerce, supply chain management, and retail. It involves the identification of identical or similar products from different sources. With the proliferation of online marketplaces and the need for automated operations, product matching has gained significant attention in computer science and data analytics research.
Accurately and efficiently identifying matching products poses challenges due to variations in product names, descriptions, and attributes across different sources. For instance, products may have different names or descriptions on various websites, or they may have missed or conflicting attributes. Such variability makes it difficult to achieve consistent and accurate product matching, leading to errors and inefficiencies in decision making.
To overcome these challenges, researchers have developed a myriad of techniques for product matching, including both rule-based approaches and machine learning algorithms. Rule-based approaches rely on predefined rules and heuristics to match products. On the other hand, machine learning algorithms leverage data to identify matching patterns and make predictions. In recent years, advanced embedding techniques such as Word2Vec and TF-IDF have emerged as powerful tools in enhancing the performance of product matching algorithms.
Word2Vec and TF-IDF embedding techniques offer ways to represent and encode textual information in a meaningful manner. Word2Vec represents words as dense vectors capturing semantic relationships, while TF-IDF measures the importance of terms in a document. BERT, a state-of-the-art transformer-based model, provides contextualized embeddings that capture the semantic meaning of words based on their surrounding context.
In this chapter, a comprehensive review of the extant literature on product matching, with a focus on the incorporation of Word2Vec and TF-IDF embedding techniques will be provided. This chapter will discuss the challenges associated with product matching and explore the techniques and approaches proposed in the field, including those by (Wang & Li,2016), (Zhang & Yang, 2019), and (Chen et al., 2017). Furthermore, this thesis examines how these embedding techniques have been applied to enhance the accuracy and efficiency of product matching algorithms. By reviewing the strengths and weaknesses of these techniques, this project aims to identify gaps and limitations in the existing literature and provide insights for future research directions.
?
2.2 Machine Learning Algorithms for Product Matching
Product matching is a complex task that requires a sophisticated approach to achieving accurate results. One promising approach is the use of machine learning algorithms, which are capable of handling large volumes of data and identifying patterns that are difficult for humans to detect. In this section, we will discuss the different machine learning algorithms used for product matching and their strengths and limitations.
2.3 Supervised Learning Algorithms
Supervised machine learning is a subfield of machine learning that deals with training algorithms to make predictions or decisions based on labeled training data. In supervised learning, the algorithm learns from examples where the input data is paired with the corresponding output or target labels. The goal is to generalize from the training data to make accurate predictions on unseen dataset for effective decision making.
Comprehensive Overview of Supervised Machine Learning
There are several supervised learning algorithms that can be used for product matching which are, in order of usage and execution in the Jupyter Notebook source code of this master’s thesis, considered below:
2.3.1 Logistic Regression Algorithm: Strengths and Limitations
Logistic regression is a popular supervised learning algorithm used for classification problems. It is a statistical method that is used to model the relationship between a dependent variable (categorical) and one or more independent variables (continuous or categorical). The goal of logistic regression is to estimate the probability of a certain event occurring, based on input data.
Based on the strength, Logistic Regression algorithm is a simple and easy-to-understand algorithm that can be easily implemented and interpreted. Also, it is a fast algorithm and can be trained efficiently on large datasets. In addition, Logistic regression algorithm is a robust algorithm and can handle noise and outliers in the data. The algorithm provides clear insights into the relationships between the dependent and independent variables.
Based on weakness, Logistic regression assumes that there is always a linear relationship between the independent variables and the log-odds of the dependent variable which is not always true based on datasets. In addition, the algorithm can potentially overfit supplied data if the number of features is too large compared to the number of samples. The algorithm does not perform feature selection, and therefore feature engineering is required to select relevant features for the model to return accurate output.
2.3.2 Random Forest Algorithm: Strengths and Limitations
An ensemble learning algorithm, Random Forest uses a combination of decision trees to improve the accuracy of classification or regression tasks. It works by constructing multiple decision trees and then combining the results to obtain the final output. Each tree is built using a random subset of data and features. This randomness helps to reduce overfitting and increase the model's generalization ability.
Some of the strengths of the Random Forest algorithm is that it can handle both categorical and continuous data, making it a versatile algorithm, return highly accurate algorithm and often outperforms other algorithms in terms of prediction accuracy. In addition, the algorithm can suitably trap and handle missing data and maintain accuracy, handle large datasets with many variables and observations and provides an estimate of feature importance, which can be useful in feature selection.
On the downside, Random Forest can potentially be slow and computationally expensive, especially for large datasets or when using many trees. Another limitation is that it can be difficult to interpret the results, because of the generation of multiple trees with different decision paths. It may overfit the data if the number of trees is too high or the data is noisy and it may not work well with imbalanced datasets, where one class has much fewer samples than the other.
?
2.3.3 Support Vector Machines (SVM): Strengths and Limitations
Support Vector Machines (SVM) is a powerful machine learning algorithm used for classification and regression analysis. It is a supervised learning algorithm that works by separating data into two or more classes based on the available training data. The algorithm constructs a hyperplane in high-dimensional space that separates the data into different classes.
Strength wise, SVM can handle large feature spaces and can work well with a high-dimensional dataset. It is also effective in cases where the number of features is greater than the number of samples. In addition, SVM works well in cases where the data is not linearly separable, as it can use kernel functions to transform the data to a higher dimension and often, it returns reliable accuracy and generalization performance when the number of training samples is small.
On the flip sides, SVM is sensitive to the choice of kernel function, and selecting the right kernel function can be challenging. Also, it can be slow to train and requires a significant number of computational resources, especially for large datasets. It has the potential to overfit data if the regularization parameter is incorrectly specified. SVM cannot handle missing data, and data pre-processing is often required to handle missing data.
?
?2.3.4 Decision Tree Algorithm: Strengths and Limitations
A popular machine learning algorithm, Decision Tree is widely used for solving classification and regression problems. It is a tree-based model that divides the data into smaller subsets based on a set of rules and then assigns a class label or a value to each subset.
The Decision Trees algorithm works by recursively dividing the data into subsets based on the most significant features. The algorithm selects the feature that provides the most information gain, which is calculated using entropy or Gini impurity. Once a feature is selected, the data is divided into subsets based on the possible values of the selected feature. This process is repeated until a stopping criterion is met, such as reaching a maximum depth or having a subset with only one class label.
One of the strengths of the Decision Trees algorithm is that it is simple to understand and interpret. The decision tree structure is easy to visualize, and the rules used to make the decisions are easily explainable. Decision Trees can also handle both categorical and continuous data, and they can handle missing data.
Another strength of the Decision Trees algorithm is that it can handle both classification and regression problems. The algorithm can be adapted to handle continuous outputs by using a regression tree, where the output value is the average of the values in the final subset.
One of the main limitations of the algorithm is that it is prone to overfitting. If the tree is allowed to grow too deep or if the stopping criterion is not well-tuned, the model may fit the training data too closely and fail to generalize well to new data. This problem can be addressed by using techniques such as pruning or setting a minimum number of samples required to split a node.
Another limitation of the Decision Trees algorithm is that it is sensitive to small variations in the data. A small change in the training data can lead to a completely different tree structure, which may affect the accuracy of the model.
2.3.5 KNN Algorithm: Strengths and Limitations
A versatile and intuitive algorithm, K-Nearest Neighbors (KNN) is a popular machine learning algorithm that falls under the category of supervised learning. It is a non-parametric algorithm, meaning it does not make any assumptions about the underlying data distribution. The algorithm is based on the concept that similar instances are located close to each other in the feature space. It classifies new instances by finding the k nearest neighbors in the training dataset and assigning the majority class label among them to the new instance. The value of k, a user-defined hyperparameter, determines the number of neighbors considered for classification.
One of the main strengths of the KNN algorithm is that it is easy to understand and implement, making it a popular choice for beginners in machine learning. Additionally, KNN can be applied to both classification and regression problems. In classification, the algorithm predicts the class label, while in regression, it estimates the numerical value associated with the new instance.
Another advantage of the KNN algorithm is that it does not make any assumptions about the underlying data distribution which makes it a versatile algorithm that can work well with various types of datasets. It can handle complex decision boundaries and can adapt to different data patterns.
In addition, by considering the majority class among the k nearest neighbors, the algorithm can assign the accurate class label to the new instance. Moreover, KNN can be easily updated with new training data without the need to retrain the entire model, making it an efficient algorithm in scenarios where data is constantly changing.
On the downside, since KNN algorithms requires calculating the distances between the new instance and all the instances in the training dataset, it can be time-consuming, especially for large datasets which can lead to performance issues based on the number of training instances and the dimensionality of the feature space. Furthermore, there is potential for a small value of k resulting in overfitting, where the model becomes too specific to the training data and performs poorly on unseen data. On the other hand, a large value of k may lead to underfitting, where the model oversimplifies the data and fails to capture the underlying patterns.
In addition, KNN does not provide explicit explanations or feature importance rankings, making it difficult to interpret the reasoning behind its predictions. The lack of interpretability may be a concern in certain domains where transparency and elucidation are crucial.
2.3.6 XGBoost: Strengths and Limitations
XGBoost, also known as eXtreme Gradient Boosting, is a highly effective and extensively utilized machine learning algorithm belonging to the gradient boosting family. It has garnered immense recognition and adoption in diverse domains, ranging from data science competitions to real-world applications. By iteratively combining weak predictive models, such as decision trees, XGBoost constructs a robust model that yields precise predictions. Its remarkable performance sets it apart and establishes it as a favored choice among data scientists and practitioners.
Specifically, XGBoost is known for its outstanding accuracy and is often considered a state-of-the-art algorithm for many machine learning tasks. It excels in both regression and classification problems, consistently delivering top-performing models. IN addition, the algorithm operates an advanced regularization techniques that help prevent overfitting and improve generalization. Regularization techniques, such as L1 and L2 regularization, can be applied to control the complexity of the model and avoid excessive reliance on specific features.
Another benefit associated with the algorithm is that it provides a wide range of hyperparameters that allow users to customize the model according to their specific requirements. It offers control over the learning rate, tree depth, regularization parameters, and more, enabling fine-tuning to optimize model performance. XGBoost provides valuable insights into feature importance, allowing users to understand which features contribute the most to the model's predictions. This information aids in feature selection, dimensionality reduction, and overall model interpretability.
Conversely, XGBoost can be computationally expensive, especially for large datasets or when using a high number of trees and complex models. Training and tuning XGBoost models may require significant computational resources and time. In addition, while the flexibility of XGBoost is a strength, it also poses a challenge in terms of parameter tuning. Selecting the optimal combination of hyperparameters can be a time-consuming process, requiring expertise and careful experimentation. The boosted ensemble nature of XGBoost makes it less interpretable compared to simpler models like linear regression. Understanding the inner workings of the model, feature interactions, and the impact of individual features on predictions can be more challenging.
Another limitation of the algorithm is that it tends to favour classes with larger sample sizes in imbalanced datasets. It may require additional techniques, such as class weighting or oversampling techniques, to handle class imbalance effectively and can potentially consume significant memory, particularly when dealing with large datasets or complex models. Memory optimization techniques or distributed computing frameworks may be required to handle memory limitations.
Efficiency and Performance Improvement Techniques of XGBoost Algorithm
?
2.3.7 Na?ve Bayes Algorithm: Strengths and Limitations
Na?ve Bayes is a popular classification algorithm used in machine learning for various applications, including product matching. The algorithm is based on Bayes' theorem, which states that the probability of a hypothesis is updated by incorporating new evidence. In the context of classification, the algorithm calculates the probability of a given input belonging to a particular class based on the occurrence of features in the input.
The Na?ve Bayes algorithm assumes that the features are independent, which is a simplifying assumption that enables efficient computation. The algorithm also assumes that the distribution of the features is normal. In practice, the algorithm performs well even when these assumptions are not strictly met.
On the positive side, the algorithm is a relatively simple algorithm to understand and implement because it requires only a small amount of training data, making it particularly useful for applications with limited data. In addition, the algorithm is computationally efficient and can handle many features, making it suitable for large datasets. It can perform incredibly well with categorical data because it can handle large numbers of discrete features. The algorithm can also handle irrelevant features by simply ignoring them, which can improve the efficiency of the algorithm output.
On the negative side, the algorithm assumes that features in any data are independent and normally distributed, which may not always be true in practice. These assumptions can lead to inaccurate predictions if the data violates the assumptions. In addition, it has limited expressiveness and may not capture complex relationships between features which can lead to underfitting if the model is too simple or overfitting if the model is too complex. The algorithm is, to a large extent, sensitive to interactions between features, which can lead to inaccuracies in predictions.
2.3.8 Multi-Layer Perceptron: Strengths and Limitations
Multi-Layer Perceptron (MLP) is a popular artificial neural network algorithm that falls under the category of supervised learning. It is widely used for various tasks, including classification, regression, and pattern recognition. The MLP consists of multiple layers of interconnected nodes, called neurons. The neurons are organized in a series of input, hidden, and output layers. Each neuron receives input signals, applies a non-linear activation function to the weighted sum of those inputs, and passes the output to the next layer. The connections between neurons have associated weights that are adjusted during the training process.
One of the main strengths of MLP is its ability to model complex non-linear relationships between inputs and outputs. The presence of multiple hidden layers allows MLP to learn and represent intricate patterns in the data. This flexibility makes it a powerful algorithm for tasks where the decision boundaries are complex and highly non-linear.
MLP is also known for its universal approximation capability which implies that a properly trained MLP with enough hidden neurons can approximate any continuous function to any desired degree of accuracy. This property makes MLP a versatile algorithm that can handle a wide range of problem domains.
Another benefit of the algorithm is that the hidden layers of an MLP learn to extract relevant features from the input data as part of the training process. This eliminates the need for manual feature engineering, which can be time-consuming and error-prone. In addition, MLP can handle raw input data, reducing the burden on the user to pre-process and extract meaningful features. Known for it capability of handling large-scale datasets, the algorithm leverages parallel computing and distributed processing frameworks which can efficiently process vast amounts of data. This scalability is crucial in modern machine learning applications where datasets are often large and dynamic.
On the contrary, due to its sensitivity to the choice of hyperparameters, selecting the appropriate number of hidden layers, the number of neurons in each layer, and the learning rate can be challenging. Poorly chosen hyperparameters can lead to overfitting or underfitting, resulting in degraded performance on unseen datasets.
Another weakness of MLP algorithm is that the optimization process of MLP model involves finding the set of weights that minimize a loss function. However, due to the high dimensionality of weight space, MLP may converge with suboptimal solutions. Howbeit, various techniques, such as different weight initialization strategies and regularization methods, can be employed to mitigate this issue. In addition, training an MLP can be computationally expensive, especially for large-scale datasets and complex architectures which can limit the scalability of MLP in certain scenarios or use cases. Additionally, the complex nature of the model makes it difficult to understand the reasoning behind its predictions. This lack of interpretability may be a concern in domains where elucidation is essential, such as healthcare or finance sectors.
?
2.3.9 Gradient Boosting Algorithm: Strengths and Limitations
Gradient Boosting is a popular machine learning algorithm that can be used for both regression and classification problems. The algorithm involves building a series of weak models, such as decision trees, and then combining them to create a stronger model. The models are built iteratively, with each new model attempting to correct the errors of the previous models. The final model is the combination of all the individual models.
One of the strengths of Gradient Boosting is that it has high predictive accuracy and is considered one of the most powerful algorithms for predictive modelling. It can also handle complex data, including data with a mix of categorical and numerical features. Furthermore, Gradient Boosting algorithm provides information on feature importance, allowing users to identify which features are most important in making predictions.
On the downside, the algorithm can be slow to train, particularly when working with large datasets which is because of the algorithm requiring multiple individual models. In addition, the algorithm can be prone to overfitting, particularly when the data is noisy or when there are too many trees in the model. The algorithm is considered a black box model, which implies that it can be difficult to interpret how the algorithm is making its predictions.
?
2.4 Evaluation Metrics for Product Matching
Evaluation metrics for product matching are specifically used to measure the effectiveness and accuracy of product matching algorithms. These metrics often help to determine the performance of the algorithms in identifying and matching products across various sources such as e-commerce websites, social media platforms, and other online marketplaces.
2.4.1 Accuracy Evaluation Metric: Strengths and Weaknesses
Accuracy is one of the most used evaluation metrics for product matching. It measures the proportion of correct matches out of the total number of matches made by the model. One of the strengths of accuracy is that it is easy to understand and interpret. It also provides a clear indication of how well the model is performing overall. On the flip side, the metric can be misleading when the dataset is imbalanced. For example, if most of the products in the dataset do not match with each other, a model that predicts that no products match will have a high accuracy even though it is not useful. Also, the metric does not consider false positives and false negatives, which can be important in certain applications. For example, in a product matching system, a false positive (when the model matches two products that are not actually the same) may be more costly than a false negative (when the model fails to match two products that are the same).
2.4.2 Accuracy Score Evaluation Metric Formula and Mathematical Notation
The formula for calculating the accuracy score is as follows:
Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)
In mathematical notation, the accuracy score (ACC) can be represented as:
ACC = TP + TN / (TP + TN + FP + FN)
Where:
TP (True Positive) represents the number of instances that are correctly predicted as positive.
TN (True Negative) represents the number of instances that are correctly predicted as negative.
FP (False Positive) represents the number of instances that are incorrectly predicted as positive (a type I error).
FN (False Negative) represents the number of instances that are incorrectly predicted as negative (a type II error).
To calculate the accuracy score, the number of true positives and true negatives are summed up and then divided by the sum of all four values (true positives, true negatives, false positives, and false negatives).
2.4.3 Precision Evaluation Metric: Strengths and Weaknesses
Precision is another evaluation metric used in product matching that measures the proportion of correctly matched products among all the products that were identified as a match by a matching algorithm. In other words, precision is the ratio of true positive matches to the total number of matches proposed by the algorithm.
Precision is a useful metric for product matching, especially when the cost of a false positive (i.e., a product that is incorrectly matched) is high. For example, in e-commerce, a false positive match could lead to a customer purchasing the wrong product, resulting in negative reviews, returns, and loss of revenue. Another benefit of the metric is that it provides a measure of how accurate a matching algorithm is in identifying true matches, regardless of how many actual matches exist in the dataset. This makes it useful for evaluating the effectiveness of the algorithm across different datasets and matching tasks.
On the other hand, the metric can be misleading if the dataset is imbalanced, with many negative (non-matching) examples. In such cases, a high precision score may be achieved simply by classifying all products as non-matches, without considering the true matches. This can result in a low recall score and poor performance overall. Additionally, Precision metric does not consider false negatives (i.e., products that were not identified as matches by the algorithm but should have been), which can also have negative consequences in certain applications.
2.4.4 Precision Score Evaluation Metric Formula and Mathematical Notation
The formula for calculating precision is as follows:
Precision = TP / (TP + FP)
In mathematical notation, the precision score (P) can be represented as:
P = TP / (TP + FP)
Where:
TP (True Positive) represents the number of instances that are correctly predicted as positive.
FP (False Positive) represents the number of instances that are incorrectly predicted as positive (a type I error).
To calculate the precision score, the number of true positive predictions is divided by the sum of true positive and false positive predictions.
?
2.4.5 Recall Evaluation Metric: Strength and Weaknesses
Recall is another popular evaluation metric used to assess the performance of product matching algorithms. It is defined as the ratio of correctly matched products to the total number of products that should have been matched. In other words, recall measures the ability of the algorithm to correctly identify all relevant products.
One of the key strengths of the metric is that it is useful when the cost of false negatives is high, such as in the case of product matching. It emphasizes the importance of correctly identifying all relevant products and minimizing the risk of missing a match. In addition, it provides insight into the completeness of the product matching process such that a low recall score indicates that the algorithm is missing relevant matches, which can be used to identify areas for improvement. The metric is simple and easy-to-understand in that it can be easily communicated to stakeholders and decision makers.
On the downsides, the metric does not consider the number of false positives generated by the algorithm, which can be a problem when the cost of false positives is high. In addition, the metric does not provide sufficient information on the number of correctly matched products but does not provide information on overall accuracy or precision. The metric can potentially have class imbalance skewness where the number of positive examples is much smaller than the number of negative examples. In such cases, a high recall score may not necessarily indicate good performance.
2.4.6 Recall Score Evaluation Metric Formula and Mathematical Notation
The formula for calculating recall is as follows:
Recall = TP / (TP + FN)
In mathematical notation, the recall score (R) can be represented as:
R = TP / (TP + FN)
Where:
TP (True Positive) represents the number of instances that are correctly predicted as positive.
FN (False Negative) represents the number of instances that are incorrectly predicted as negative (a type II error).
To calculate the recall score, the number of true positive predictions is divided by the sum of true positive and false negative predictions.
?
2.4.7 F1 Evaluation Metric: Strengths and Weaknesses
The F1 score is a commonly used evaluation metric for product matching that combines both precision and recall. It provides a single score that balances both measures, making it useful when both precision and recall are important.
The F1 score is calculated as the harmonic mean of precision and recall:
F1 = 2 (precision recall) / (precision + recall)
The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall.
One of the strengths of the F1 metric is that it provides single score that balances both the precision and recall metrics. In addition, the metric is a commonly used evaluation metric for product matching and is easy to compute.
On the flip side, F1 metric does not consider the true negatives, which can be a limitation in some cases. In addition, it assumes that precision and recall are equally important, which may not always be the case. In some scenarios, precision may be more important than recall, or vice versa.
2.4.8 F1 Score Evaluation Metric Formula and Mathematical Notation
The F1 score is calculated using the following formula:
F1 Score = 2 (Precision Recall) / (Precision + Recall)
In mathematical notation, the F1 score (F1) can be represented as:
F1 = 2 (P R) / (P + R)
Where: P represents precision, which is the ratio of true positive predictions to the sum of true positive and false positive predictions, R represents recall, which is the ratio of true positive predictions to the sum of true positive and false negative predictions.
To calculate the F1 score, the harmonic mean of precision and recall are computed, giving equal importance to both metrics. The harmonic mean places more weight on lower values, making the F1 score more sensitive to imbalances between precision and recall.
?
2.4.9 AUC-ROC Evaluation Metric: Strengths and Limitations
A popular evaluation metric, the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is used to assess the performance of binary classification models. It provides a comprehensive measure of the model's ability to discriminate between positive and negative instances across various classification thresholds.
The ROC curve is a graphical representation of the true positive rate (TPR) against the false positive rate (FPR) as the classification threshold varies. The AUC-ROC is the area under this curve and ranges from 0 to 1. A higher AUC-ROC score indicates better discriminatory power and performance of the model.
One of the strengths of the evaluation metric is its flexibility to handle imbalance datasets, where one class significantly outweighs the other. It considers the entire range of thresholds and is less affected by the imbalance in class distribution. In addition, the metric evaluates model performances across all possible classification thresholds, providing a comprehensive overview of its ability to distinguish between positive and negative instances which contrast with single-point metrics like accuracy, largely dependent on a specific threshold. The AUC-ROC is insensitive to the actual probability values predicted by the model. It only considers the rank order of instances, making it robust to calibration issues and different probability scales across different models.
Furthermore, the metric is insensitive to the absolute predicted probabilities or scores produced by models which invariably implies that it considers only the rank ordering of instances based on scores. This makes it robust to issues related to probability calibration used by different models. As a result, the AUC-ROC can provide a fair comparison of models even when they have different calibration characteristics. Finally on the strengths, the AUC-ROC captures the overall performance of a binary classifier across different operating points which considers both the true positive rate (sensitivity) and the false positive rate (1 - specificity) simultaneously, providing a balanced assessment of the model's performance. This makes it useful for comparing models and selecting the best-performing one in terms of discrimination ability.
On the contrary, while the AUC-ROC is robust to class imbalance, it can potentially provide an overly optimistic assessment of model performance when the class distribution is highly imbalanced which implies that a high AUC-ROC score may not necessarily indicate good performance on the minority class, as the metric focuses on overall discrimination ability. In such cases, it is crucial to consider additional evaluation metrics that specifically address class imbalance, such as precision, recall, or F1 score. On the strengths of threshold Optimization, the metric does not provide guidance on selecting an optimal threshold for classification. However, it assesses model performances across all possible thresholds but does not indicate the most suitable threshold for a specific task or application. In addition, the metric treats false positives and false negatives equally, assuming an equal cost for both types of errors. However, in real-world scenarios, the cost associated with misclassifying positive and negative instances may differ. For example, in medical diagnosis, the cost of a false negative (missing a positive case) may be much higher than that of a false positive. In such cases, the AUC-ROC may not provide a complete picture of the model's performance, and customized evaluation metrics that account for the specific costs should be utilized.
?
2.4.10AUC-ROC Evaluation Metric Formula and Mathematical Notation
The mathematical notation and formula for the AUC-ROC can be represented as follows:
Denoted the set of predicted probabilities or scores for the positive class as P_pos and for the negative class as P_neg.
Step 1: Compute the True Positive Rate (TPR) and False Positive Rate (FPR) at different classification thresholds.
Start with a threshold T, ranging from 0 to 1.
Calculate the number of true positive predictions (TP) as the count of instances where P_pos ≥ T and the true class label is positive.
Calculate the number of false positive predictions (FP) as the count of instances where P_pos ≥ T and the true class label is negative.
Calculate the number of true negative predictions (TN) as the count of instances where P_neg ≥ T and the true class label is negative.
Calculate the number of false negative predictions (FN) as the count of instances where P_neg ≥ T and the true class label is positive.
Compute the TPR as TP / (TP + FN).
Compute the FPR as FP / (FP + TN).
Step 2: Plot the (FPR, TPR) pairs on a graph to create the ROC curve.
Step 3: Calculate the AUC-ROC by integrating the area under the ROC curve.
The mathematical notation for the AUC-ROC can be represented as:
AUC-ROC = ∫ TPR d(FPR)
Where:
TPR denotes the True Positive Rate (Sensitivity), which is the ratio of true positive predictions to the sum of true positive and false negative predictions.
FPR denotes the False Positive Rate (1 - Specificity), which is the ratio of false positive predictions to the sum of false positive and true negative predictions.
The integration is performed over the range of FPR values.
?
2.5 Embedding Techniques
Embedding techniques play a vital role in natural language processing (NLP) and machine learning tasks by transforming textual data into numerical representations. These representations capture semantic relationships, contextual information, and underlying patterns in the text, enabling machine learning algorithms to process and analyze language effectively. In this section, the selected three embedding techniques for this master’s thesis would be extensively discussed.
2.5.1 Word2Vec Embedding Technique: Strengths and Limitation
Word2Vec is a popular word embedding technique in natural language processing (NLP) designed to capture semantic relationships and meanings of words in a continuous vector space. Developed by Tomas Mikolov and his team at Google, Word2Vec has revolutionized the field of NLP by providing a dense representation of words that can be used as input features for various downstream tasks, such as sentiment analysis, machine translation, and named entity recognition. In this section, we will delve into the details of Word2Vec and discuss its key concepts, training process, and applications.
Word2Vec is based on the distributed hypothesis, which states that words appearing in similar contexts tend to have similar meanings. The main idea behind Word2Vec is to learn word representations by training a neural network on a large corpus of text. The two popular algorithms for training Word2Vec models are Continuous Bag of Words (CBOW) and Skip-gram.
Both CBOW and Skip-gram use a neural network with a hidden layer to learn the word embeddings. The hidden layer represents the word vectors, and each word in the vocabulary is assigned a unique vector. The dimensionality of the word vectors is a hyperparameter that needs to be specified before training.
?During training, Word2Vec adjusts the word vectors by updating the weights of the neural network to minimize the loss function, such as the negative log-likelihood. The training process involves iterating through the entire corpus multiple times to optimize the word embeddings.
One of the main advantages of Word2Vec is that it provides a dense representation of words in a continuous vector space. This representation captures semantic relationships between words, allowing similar words to have similar vector representations. It preserves meaningful semantic information, such as word analogies and semantic similarities. For example, in the Word2Vec space, the vector representation of “king” might be close to “queen” and “royal,” indicating their semantic similarities. In addition, Word2Vec considers the context of words when generating word embeddings. By using neighbouring words as input, it captures the contextual information associated with a word. This context-awareness helps in capturing the meaning of a word based on its surrounding words, leading to more meaningful and contextually rich word embeddings. Word2Vec reduces the high-dimensional representation of words in a text corpus to a lower-dimensional continuous vector space which facilitates efficient storage and processing of word embeddings. Also, the embedding technique allows for faster computation and scalability when dealing with large vocabularies and voluminous datasets.
Furthermore, one of the major strengths of Word2Vec is its ability to capture analogies between words. In order words, the trained Word2Vec models can perform operations like word vector arithmetic, where mathematical operations on word vectors can result in meaningful analogies. For example, “father” – “man” + “woman” might yield a word vector close to “child.” This capability has been used to solve analogy-based reasoning tasks and showcase the model's understanding of semantic relationships.
On a final note, Word2Vec embeddings can be pre-trained on a large corpus of text and then transferred to downstream NLP tasks. The pre-trained embeddings capture general language patterns and semantics, providing a useful initialization for subsequent models. This transfer learning approach enables leveraging the knowledge captured by Word2Vec across different NLP tasks, even with limited task-specific data.
On the downside, Word2Vec embedding technique relies on a predefined vocabulary learned during the training phase which makes it to struggle to process words that are not present in the training corpus or are rare. Out-of-vocabulary (OOV) words can lead to missing or unreliable word embeddings, impacting the performance of downstream tasks such as binary classification tasks. Word2Vec treats each word as a single entity and does not explicitly capture different meanings or senses of a word because it assigns a singular vector representation to all occurrences of a word, irrespective of its context or sense. This limitation can be problematic when dealing with highly ambiguous words that have multiple meanings. The model might fail to distinguish between different senses of a word, leading to less precise representations for such words. In addition, Word2Vec embedding technique operates at the word level and does not directly capture relationships between phrases or longer sequences of words because it treats each word independently, neglecting the compositionality of phrases and idiomatic expressions. As a result, it may struggle to represent and understand the meaning of phrases, limiting its ability to capture certain linguistic phenomena.
The quality and size of the training corpus significantly impact on the performance of Word2Vec because the model heavily relies on the patterns and distribution of words in the training data. If the corpus is small or unrepresentative, the resulting word embedding may lack generalization and fail to capture the nuances of the language adequately. robust word embeddings. Finally, Word2Vec generates fixed embeddings, which implies that the word vectors remain static and do not adapt to changing contexts or data. This inflexibility limits the model's ability to handle dynamic or evolving language phenomena. In contrast, newer techniques such as contextual word embeddings (e.g., BERT, GPT) capture word meaning based on the surrounding context and can better handle context-dependent word representations.
2.5.2 Term Frequency-Inverse Document Frequency Embedding Technique: Strengths and Limitations
TF-IDF (Term Frequency-Inverse Document Frequency) is a widely used embedding technique in natural language processing (NLP) that represents textual data based on the importance of each word in a document collection. TF-IDF captures the relative significance of words by considering both their frequency in a document and their rarity across the entire corpus.
领英推荐
TF-IDF calculates weight for each term in a document based on its term frequency (TF) and inverse document frequency (IDF). The TF measures the frequency of a term within a specific document, while the IDF measures the rarity of the term across the entire document collection. The combination of these two measures provides a representation that highlights important terms while downplaying common or uninformative words.
The TF component of TF-IDF is calculated by dividing the number of times a term appears in a document by the total number of terms in the document. This value represents the local importance of the term within the document. A higher TF value indicates that the term is more relevant or significant in the document.
The IDF component of TF-IDF is calculated by taking the logarithm of the inverse of the document frequency of a term. The document frequency refers to the number of documents in the corpus that contain the term. The IDF value represents the global importance of the term across the document collection. Terms that appear in a small number of documents have higher IDF values, indicating their rarity and potential informativeness.
To obtain the TF-IDF weight for each term in a document, the TF value is multiplied by the IDF value. This multiplication emphasizes terms that are frequent within a document but rare across the corpus, giving them higher weights. On the other hand, terms that are frequent both within the document and across the corpus receive lower weights.
TF-IDF embeddings provide several advantages in NLP tasks. Firstly, it effectively highlights important terms in a document and suppresses common or less informative words. This makes them suitable for tasks such as information retrieval, keyword extraction, and document similarity analysis. Secondly, TF-IDF embeddings are computationally efficient and easy to calculate, making them scalable for large document collections. Additionally, TF-IDF embeddings can handle out-of-vocabulary terms and are relatively interpretable, as the weights reflect the relative importance of terms.
On the flip side, the embedding technique does not capture the semantic relationships between words or consider the context of the document. In addition, each term is treated independently, disregarding the surrounding words or sentence structure. TF-IDF does not account for word order, sentence meaning, or document semantics, which may limit their performance in tasks requiring deeper language understanding. Moreover, TF-IDF is sensitive to document length, as longer documents may have higher term frequencies and potentially skew the importance of terms.
2.6 Product Matching Application in Various Fields and Industries
Product matching is applicable in a wide range of fields and industries where accurate identification and comparison of similar products are essential. To give context to this project, the relevant fields where product matching is applicable would be discussed below:
2.7 Summary of the Literature Review
The literature review covered in this dissertation provided a comprehensive and detailed overview of the various algorithms suitable for binary classification task such as product matching. It highlights the strengths and limitations of the selected embedding techniques and machine learning algorithms. The supervised learning algorithms discussed include Decision Trees, Random Forest, Support Vector Machines, Na?ve Bayes, Logistic Regression, KNN, Gradient Boosting, XGBoost, and MLP. The embedding discussed focuses on Word2Vec and TF-IDF techniques.
The evaluation metrics discussed in the review include Accuracy, Precision, Recall, F1 scores and AUC-ROC respectively. In conclusion, Application of Product Matching in Various Fields and Industries was extensively discussed with substantial examples.
In the next chapter, I will be discussing research design, data collection, data pre-processing, feature extraction, machine learning algorithms, and the evaluation metrics which are all part of chapter three Research Methodology.
??
Chapter Three
3. Research Methodology
3.1 Introduction
This chapter outlines the methodology used in this dissertation to analyze and compare various machine learning algorithms for product matching combined with selected embedding techniques. The chapter begins with a discussion of the research design, followed by a description of the data collection and pre-processing techniques deployed. Then, the chapter describes the different machine learning algorithms used in this thesis. Finally, the chapter presents the performance metrics used to evaluate the algorithms and the experimental setup.
3.2 Experimental Setup
In this section, we describe the experimental setup used to evaluate the performance of the machine learning algorithms for product matching. The experimental setup encompasses the dataset used, the pre-processing steps applied, and the configuration of the algorithms.
3.2.1 Dataset
Data Corpus dataset for product matching binary classification challenge on this website: (Semantic Web Challenge ISWC2020 (ir-ischool-uos.github.io ) was used for this thesis. Compressed as a JSON file, the dataset has several key/values pairs. After reading the dataset into a panda DataFrame using Jupyter Notebook, the data was displayed in a tabular format. In addition, the dataset comprises several instances, where each instance includes two product descriptions (product offers) known as description_left and description_right and a label indicating whether the products are a match or not. The label 1 implies that the products are matching while label 0 implies that the product description are not matching.
3.2.2 Pre-processing
Prior to training the machine learning algorithms, the dataset underwent pre-processing steps to prepare the textual data for feature extraction. The following pre-processing techniques were applied:
?
3.2.3 Algorithm Hyperparameter Tuning Configuration
We employed established best practices and prior literature to configure the relevant hyperparameters for each algorithm. Through cross-validation, we fine-tuned these hyperparameters to maximize the performance of the algorithms.
To ensure a fair and consistent evaluation, we maintained the same set of hyperparameters and configuration for each algorithm in all experiments. Furthermore, we consistently used an 80:20 train-test split ratio throughout the experiments.
3.2.4 Model Training and Evaluation
The models were evaluated using various performance metrics such as accuracy, precision, recall, F1-score, and AUC-ROC Curve to assess their effectiveness in product matching.
?
?
3.3 Libraries and Tools
To support various stages of the research process, several additional Python libraries were utilized. These libraries facilitated tasks such as data acquisition, exploratory data analysis, and other pre-processing steps. The following libraries were particularly instrumental:
3.3.1 Pandas
Pandas is a widely used library for data manipulation and analysis. It offers data structures and functions that facilitate reading datasets from various file formats, such as CSV or Excel. In this study, Pandas were employed to read and pre-process the acquired dataset, enabling efficient data handling and exploration.
3.3.2 Matplotlib and Seaborn
Matplotlib and Seaborn are popular data visualization libraries in Python. Matplotlib provides a wide range of plotting functionalities, while Seaborn offers higher-level abstractions for creating aesthetically pleasing and informative visualizations. These libraries were used to generate charts, histograms, and other visual representations during the exploratory data analysis phase.
?
3.3.3 NLTK (Natural Language Toolkit)
NLTK is a powerful library specifically designed for NLP tasks. It provides a suite of tools and resources for tasks such as tokenization, stemming, and part-of-speech tagging. In this research, NLTK was utilized for text pre-processing steps, including tokenization and stop word removal.
?
3.3.4 NumPy
NumPy is a fundamental library for scientific computing in Python. It provides support for handling multidimensional arrays and a wide range of mathematical operations. NumPy was used in this study for various numerical computations and array manipulations, supporting the implementation of machine learning algorithms.
?
3.3.5 Jupyter Notebook
Jupyter Notebook is an interactive computing environment that allows the creation and sharing of documents containing code, visualizations, and explanatory text. Jupyter Notebook was utilized for writing and executing the research code, ensuring reproducibility, and providing a clear narrative flow.
The string library in Python provides a collection of useful functions and constants for working with strings. It was used in this research for tasks such as text cleaning and manipulation, including operations like removing punctuation and special characters from the text data. The functions and methods offered by the string library helped ensure that the text data was properly processed before further analysis.
3.3.6 re (Regular Expression)
The re module in Python provides support for regular expressions, which are powerful tools for pattern matching and string manipulation. In this study, the re library was utilized to perform more advanced text pre-processing tasks, such as pattern matching and substitution. Regular expressions helped identify and remove specific patterns or structures from the text data, enabling more precise text cleaning and pre-processing.
3.3.7 Wordcloud
The wordcloud library allows for the creation of visually appealing word clouds, which are visual representations of the most frequent words in a text dataset. Word clouds are useful for gaining insights into the most prominent terms and concepts within the data. In this research, the wordcloud library was utilized to generate word clouds, helping visualize and understand the key themes and patterns present in the textual data.
3.3.8 Gensim
Gensim is a widely used library in the natural language processing (NLP) domain which provides an efficient implementation of the Word2Vec model, employed for creating word embedding. Gensim's robust functionality enables the training of high-quality word vectors from the dataset.
?
3.3.9 Scikit-learn
Scikit-learn is a powerful machine learning library that offers a wide range of tools and algorithms for various tasks, including text processing. In this study, Scikit-learn was employed for TF-IDF vectorization, allowing the conversion of text documents into numerical feature vectors suitable for machine learning algorithms.
3.3.10 Torch
Torch is a widely adopted library for deep learning in Python. It provides tensor computations and automatic differentiation capabilities, essential for building and training neural network models. In this research, Torch was employed to support the implementation of deep learning models and facilitate training and evaluation processes.
3.3.11 Other Libraries
Several other libraries played smaller but significant roles in the research project. These include:
3.4 Upgrade of Hardware
To ensure efficient and optimal execution of the research project, an upgrade to the hardware setup was implemented. The initial hardware configuration consisted of an HP laptop with 8GB RAM and an Intel Core i3 processor. However, recognizing the potential limitations in terms of computational capacity and processing speed, a decision was made to upgrade the hardware to HP Envy laptop equipped with 16GB RAM and an Intel Core i7 processor.
This upgrade to the hardware setup played a crucial role in facilitating the execution of various computational tasks involved in the project. With the increased RAM capacity and processing power, the new laptop provided substantial improvements in terms of performance and efficiency. It enabled faster processing times and enhanced the ability to handle computationally demanding operations of this thesis.
The upgraded hardware configuration ensured that the machine learning algorithms, embedding techniques, and other computational processes could be executed with improved efficiency and reduced computation time. This upgrade played a significant role in mitigating potential challenges related to processing limitations, allowing for a smoother and more expedited research process.
By leveraging the enhanced hardware capabilities, the research project benefited from reduced computation times, enabling a more comprehensive exploration of hyperparameter settings, extensive experimentation, and rigorous evaluation of the machine learning models. It also facilitated the comparison and analysis of the different embedding techniques (Word2Vec and TF-IDF) with greater precision and accuracy.
3.5 Legal, Social, Ethical and Professionals Issues
In this master’s thesis, I recognize the significance of addressing various legal, social, ethical, and professional issues. These considerations are crucial in ensuring the responsible and ethical conduct of the research. Here, I outline the key issues in each category:
By considering and addressing these legal, social, ethical, and professional issues, I am committed to conducting this research in a responsible and ethical manner. This approach will not only ensure the validity and reliability of the results but also contribute to the development of a fair, transparent, and ethically sound product matching system.
?
3.6 Ethical Considerations
Throughout the research, ethical considerations will be given due importance. Proper permissions and data usage policies will be followed while collecting the dataset. Anonymization and data privacy measures will be implemented to protect the privacy of individuals and organizations involved.
3.7 Summary of the Research Methodology
To summarize, this chapter outlined the research methodology adopted for this master’s project. The research design, data collection and pre-processing, feature extraction, model selection, evaluation metrics, experimental setup, statistical analysis, and ethical considerations were described. The next chapter will present the results and analysis obtained from the experiments conducted.
?
Chapter Four
4. Results and Analysis
4.1 Introduction
This chapter presents the results and analysis of the comparative analysis of various machine learning algorithms using Word2Vec and TF-IDF embedding techniques for product matching. The chapter provides an in-depth evaluation of the performance of each algorithm and examines their effectiveness in achieving accurate product matching. Furthermore, the results are analyzed and interpreted to gain insights into the strengths and weaknesses of each algorithm.
4.2 Evaluation of the ML Models using Word2Vec Embedding Technique (Non- hyperparameter Tuning and Hyperparameter Tuning)
The table below shows the evaluation of each of the machine learning models using the Word2Vec Embedding Technique with and without hyperparameter tuning:
4.3 Evaluation of the ML Models using the TF-IDF Embedding Technique (Non-hyperparameter Tuning and Hyperparameter Tuning)
The table below shows the evaluation of each of the machine learning models using the TF-IDF Embedding Technique with and without hyperparameter tuning:
4.4 Interpretation of the Results
The evaluation results of the machine learning models using Word2Vec and TD-IDF embedding techniques with and without hyperparameter tuning were presented in Table 4.2.1 and 4.2.2 respectively. Each algorithm's performance was measured in terms of precision, recall, F1-score, and support for both class 0 and class 1.
4.4.1 Logistic Regression Model
Non-hyperparameter Tuning:
?
Hyperparameter Tuning:
?
4.4.2 Random Forest Model
Non-hyperparameter Tuning:
?Hyperparameter Tuning:
?
4.4.3 Support Vector Machines
Non-hyperparameter Tuning:
Hyperparameter Tuning
?
4.4.4 Decision Tree Model
Non-hyperparameter Tuning:
Hyperparameter Tuning:
?
4.4.5 KNN Model
Non-hyperparameter Tuning:
Hyperparameter Tuning
?
4.4.6 XGBoost Model
Non-hyperparameter Tuning:
Hyperparameter Tuning
4.4.7 Na?ve Bayes
Non-hyperparameter Tuning:
Hyperparameter Tuning:
?
4.4.8 Multi-Layer Perceptron Model
Non-hyperparameter Tuning:
Hyperparameter Tuning:
4.4.9 Gradient Boosting
Non-hyperparameter Tuning:
Hyperparameter Tuning:
?
Chapter 5
5. Discussion of Findings
5.1 Introduction
This chapter presents the discussion of findings of the various machine learning algorithms (both non-hyperparameter tuning and hyperparameter tuning), using Word2Vec, TF-IDF, and BERT embedding techniques for product matching. The chapter begins with the comparative analysis of the models and the embedding techniques deployed. Finally, the chapter focuses on the comprehensive discussion of the findings.
5.2 Discussion of Findings
Considering research question 1 in this master's thesis, the utilization of Word2Vec and TF-IDF embedding techniques for product matching reveals distinct performance variations across different machine learning algorithms.
In respect of the research question 2, by adjusting the hyperparameters, the models were able to achieve better accuracy, precision, recall, F1-score, and AUC-ROC values compared to their non-hyperparameter tuned counterparts. This indicates that the fine-tuning of hyperparameters allowed the models to better capture the underlying patterns and relationships in the data, resulting in improved performance metrics. In addition, hyperparameter tuning led to significant improvements in recall, precision, and F1-score, indicating better overall performance in terms of correctly identifying matching pairs.
According to the research question 3, the MLP model with hyperparameter tuning and TF-IDF embedding technique exhibited the highest accuracy score of 0.97 for matching pairs. This accuracy score of 97% indicates that the model successfully identified and classified matching products with a remarkable level of precision, effectively minimizing both false positives and false negatives.
Regarding research question 4, the performance of machine learning algorithms in categorizing matching and non-matching products across different embedding techniques can be summarized as follows:
·?????? XGBoost Model: Accuracy = 0.94, Precision = 0.92, Recall = 0.72, F1-Score = 0.81, AUC-ROC = 0.95
XGBoost Word2Vec Non-hyperparameter Tuning Confusion Matrix Visual
XGBoost Word2Vec Non-hyperparameter Tuning AUC-ROC Curve Visual
Random Forest Word2Vec Non-hyperparameter Tuning Confusion Matrix Visual
Random Forest Word2Vec Non-hyperparameter Tuning Classification Report Visual
Random Forest Word2Vec Non-hyperparameter Tuning AUC-ROC Curve Visual
Gradient Boosting Word2Vec Non-hyperparameter Tuning Confusion Matrix Visual
Gradient Boosting Word2Vec Non-hyperparameter Tuning Classification Report Visual
Gradient Boosting Word2Vec Non-hyperparameter Tuning AUC-ROC Curve Visual
XGBoost Word2Vec Hyperparameter Tuning Confusion Matrix Visual
5.3 Changes and Challenges
During the research project, a handful of changes and challenges were encountered that influenced the overall methodology and implementation. One of the major changes that was made is the revision of the research questions leading to a more targeted analysis of different machine learning algorithms using Word2Vec and TF-IDF embedding techniques for product matching.
Another significant change that was made to the original research plan was the decision to drop the BERT (Bidirectional Encoder Representations from Transformers) embedding technique from the analysis and the project altogether.
The decision to exclude BERT embeddings stemmed from the realization that the processing time for generating BERT embeddings for the project’s dataset was significantly longer on my local machine compared to other embedding techniques such as Word2Vec and TF-IDF. Therefore, it was necessary to reassess the embedding techniques and focus on the remaining Word2Vec and TF-IDF approaches.
Although the exclusion of BERT embeddings was an adjustment from the original plan, it did not compromise the overall research objectives. Word2Vec and TF-IDF still provided valuable insights into the product matching task, and their computational requirements were more manageable within the project's constraints. By adapting the methodology to exclude BERT embeddings, the research project was able to maintain its focus on comparing and analyzing the effectiveness of Word2Vec and TF-IDF in product matching tasks.
In addition to the change regarding the exclusion of BERT embeddings, another notable challenge encountered during the research project was the time-consuming nature of hyperparameter tuning for the Word2Vec and TF-IDF embedding techniques.
The process of hyperparameter tuning for Word2Vec and TF-IDF involved iterating through various parameter configurations, training multiple models, and evaluating their performance. However, due to the large number of hyperparameters and the extensive range of possible values, the tuning process became computationally intensive and time-consuming. Consequently, it was necessary to strike a balance between the thoroughness of the hyperparameter search and the available computational resources.
5.4 Recommendations and Justifications
In line with the presentation of results and discussion of findings, the following are recommendations put forward:
This recommendation is based on the classifier's strong performance across multiple evaluation metrics, including accuracy, precision, recall, F1-score, and AUC-ROC. Its ability to achieve high scores in these metrics indicates its effectiveness in accurately classifying matching and non-matching products. The Word2Vec embedding technique complements the XGBoost model, providing meaningful representations of product descriptions. Therefore, the XGBoost Model with non-hyperparameter tuning is a reliable option for product classification tasks.
Although the Random Forest Model may have slightly lower accuracy compared to XGBoost, it still demonstrates respectable scores in precision, recall, F1-score, and AUC-ROC. This classifier can be considered a viable alternative for product classification tasks, particularly when interpretability and ensemble-based learning are preferred. The Word2Vec embedding technique enhances the model's ability to capture semantic relationships between words, contributing to its overall performance.
The Gradient Boosting Model exhibits balanced performance across multiple evaluation metrics, including accuracy, precision, recall, F1-score, and AUC-ROC. This classifier is particularly suitable for scenarios where a good trade-off between different evaluation metrics is desired. The Word2Vec embedding technique enhances the model's understanding of product descriptions, leading to improved classification accuracy and overall performance.
The recommendation for the XGBoost classifier with hyperparameter tuning is based on its consistent and competitive performance in terms of accuracy, precision, recall, F1-score, and AUC-ROC. By optimizing the model's hyperparameters, it can effectively leverage the Word2Vec embedding technique to capture complex relationships between words and improve classification accuracy. This well-optimized model is a strong choice when higher performance is required for product classification tasks.
The MLP classifier demonstrates excellent scores across all evaluation metrics, particularly in accuracy, precision, and AUC-ROC. The TF-IDF embedding technique, which represents text documents based on term frequencies and inverse document frequencies, complements the MLP model by capturing important features in product descriptions. This classifier is recommended for scenarios that prioritize high precision and overall performance in product matching tasks.
The XGBoost classifier with hyperparameter tuning delivers robust performance in terms of accuracy, precision, recall, F1-score, and AUC-ROC. By optimizing the hyperparameters, this classifier maximizes its effectiveness in leveraging the TF-IDF embedding technique to represent text data. It proves to be an effective choice for product classification tasks when using TF-IDF embeddings.
5.5 Future Research Recommendations
Future research should focus on:
Investigating the applicability of advanced embedding techniques, such as BERT, and their impact on product matching performance.
Exploring ensemble methods that combine multiple classifiers to achieve even higher accuracy and robustness in product matching.
Conducting real-world case studies and experiments to validate the findings and assess the scalability of the recommended classifiers and embedding techniques in large-scale product matching applications.
?
By addressing these future research directions, we can further enhance the accuracy, efficiency, and scalability of product matching algorithms, making them more applicable and valuable in various domains, such as e-commerce, supply chain management, and recommendation systems.
5.6 Practical Implications
Algorithm Selection: The findings provide valuable insights into the performance of different machine learning algorithms for product matching tasks. Businesses can use this information to select the most suitable algorithm based on their specific requirements and data characteristics.
Embedding Technique: Understanding the effectiveness of Word2Vec and TF-IDF embedding techniques in product matching allows practitioners to make informed decisions about which technique to employ. This knowledge can improve the accuracy and efficiency of product matching systems.
Hyperparameter Tuning: The examination of hyperparameter tuning highlights its impact on algorithm performance in product matching. Organizations can optimize their machine learning models by fine-tuning hyperparameters to achieve better accuracy, precision, recall, and F1 scores.
Industry Applications: Identifying scenarios or industries where specific machine learning algorithms exhibit superior effectiveness in product matching can guide practitioners in selecting appropriate algorithms for their domain-specific applications. This knowledge enhances the applicability and effectiveness of product matching solutions across different industries.
Performance Evaluation: The evaluation of machine learning algorithms in classifying matching and non-matching products provides practical insights into their performance characteristics. This information aids in setting realistic expectations, assessing the reliability of results, and understanding the strengths and limitations of different algorithms.
5.7 Conclusion
In conclusion, this master’s thesis has provided valuable insights into the performance of different machine learning algorithms and embedding techniques for product matching tasks. The key findings highlight the following:
The recommended classifiers, such as XGBoost, Random Forest, Gradient Boosting, and MLP, demonstrated strong performance in terms of accuracy, precision, recall, F1-score, and AUC-ROC. These classifiers can significantly improve product matching accuracy and serve as reliable options for decision-making processes.
In addition, the Word2Vec and TF-IDF embedding techniques showed their effectiveness in capturing semantic information and representing textual data for product matching. Both techniques offer valuable features for classification tasks, enabling better accuracy and overall performance.
The significance of the recommended classifiers and embedding techniques lies in their potential to enhance product matching accuracy, which is crucial for industries such as e-commerce, supply chain management, healthcare, and other industries where product matching is applicable.
While this research provides valuable insights, there is still room for further exploration and application of machine learning algorithms in product matching binary classification tasks.
By continuing to advance research in this field and applying machine learning algorithms effectively, we can unlock the full potential of product matching in various industries. This will lead to improved decision-making processes, enhanced customer satisfaction, and increased operational efficiency. It is imperative for researchers and practitioners to collaborate and drive innovation in the application of machine learning algorithms for product matching tasks.
Físico y economista. Data Analysis - Power BI - Excel - VBA - Power Query - Microsoft 365
10 个月Masterpiece. Brilliant and powerful. Congratulations David!!