Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide
Kay Chansiri, Ph.D.
Research Scientist | ML & GenAI for Social Impacts | Human-Computer Interaction
Hello there, data enthusiasts! I'm back with another article, where we unravel the various types of Machine Learning (ML) and their real-world applications. This article is designed for beginners as well as those who are intermediate or advanced in the field and wish to brush up on their conceptual knowledge about ML types. This article not only covers key concepts but also includes practical code applications. So buckle up for an insightful and hands-on journey!
First, let's start by understanding what ML actually is from different perspectives.
Machine Learning Definitions
?? Layperson's Definition: Machine learning is often seen as a branch of artificial intelligence. More specifically, ML is the process by which computers learn from data to develop artificial intelligence.
?? Data Scientist's View: If you ask a data scientist, they may say that ML involves providing a training dataset to a computer, enabling it to generate optimal parameter values, patterns, and minimal errors for a specific algorithm. As the algorithm encounters new datasets or testing sets, it uses previously learned parameters to predict trends and categorize groups, continually improving based on new data.
?? Definition as Found in Academic Journal Articles: Frequently, academic journal articles describe ML as a computer program that improves its performance on a specific task (T), based on experience (E) and guided by a performance measure (P) (Mitchell, 1997). This definition emphasizes the program's ability to enhance its effectiveness in performing task T as it accumulates more experience E, as evaluated by performance measure P.
I agree with all these definitions. However, ML is not just about training or testing data. It's also about how we can develop a system capable of addressing our research questions or fulfilling specific objectives, regardless of the industry.
Take child welfare, for instance, where I currently work. Here, ML extends beyond simply training a machine on large administrative datasets to predict a child's potential involvement in the juvenile justice system. It's also about creating an ML system that is:
In essence, beyond using training and testing data to build and refine a specific algorithm, ML encompasses the deployment, monitoring, and ongoing maintenance of the system. Whether you work for a non-profit or private organization, it's crucial to align your ML model's deployment, scalability, and maintenance with the company’s strategy and operations.
Machine Learning Terminology Explained
In ML, some terms may differ from those typically encountered in traditional statistics. Let's delve into these differences:
Returning to Our Main Question: "How Many Types of ML Are There?”
The answer depends on the criteria used for categorization. In this article, I will discuss the types of machine learning based on three classification criteria: styles of supervision in training, patterns of processing incoming data, and abilities in generalization. Let's begin with the classification of ML based on the style of training supervision:
1. Supervised Learning:
2. Unsupervised Learning:
3. Semi-Supervised Learning:
4. Reinforcement Learning:
Exploring Supervised vs. Unsupervised Machine Learning: KNN and K-Means Clustering
To gain a clearer insight into the distinctions between supervised and unsupervised ML, let’s explore two commonly used algorithms: K-nearest neighbors (KNN) for supervised learning and K-means clustering for unsupervised learning.
1. Supervised Learning with K-Nearest Neighbors (KNN)
Imagine you're a researcher at a furniture company like IKEA (I love their Scandinavian decor!). Your task is to predict whether customers will purchase 'as-is' showroom furniture pieces. Factors such as age, gender, race, purchasing history, and credit scores play a role in this decision. In your dataset, these features are accompanied by labels indicating whether customers have purchased the 'as-is' furniture (0 = not purchased, 1 = purchased).
KNN, a supervised machine learning algorithm, shines in this scenario. It is trained on a dataset with known outcomes (customers who have already made purchases). For a new customer, KNN finds the 'K' nearest neighbors in the dataset based on similarities in features like age, gender, race, etc. The algorithm then predicts the new customer's likelihood of purchasing based on the majority decision of these neighbors.
For instance, If K=3, the KNN algorithm will find the three customers in your dataset that are closest to the new customer in terms of age, gender, and race. If two of these customers have purchased as-is showroom furniture pieces, and one has not, the algorithm will classify that the new customer will likely make a purchase. To find the nearest neighbors, KNN measures the distance between points using distance metrics such as Euclidean distance, which is the most common one, Manhattan distance, or others.
It's important to note that KNN is sensitive to the scale of the data. Variables such as income and age, which often have widely varying scales, can disproportionately influence the outcome if they are not scaled or standardized. Therefore, scaling or standardizing your variables is an essential step before applying KNN.
Additionally, the choice of the K value is crucial. A smaller K value might lead to a higher influence of noise on the result, as it's possible for samples in your data to be in the same cluster even though they may not share feature similarities. Conversely, a larger K value can increase computational costs. KNN may not be as effective when dealing with a high number of features or dimensions, as data can become sparse with increased dimensions, complicating the process of finding close neighbors. Determining an appropriate K depends on multiple factors, such as research objectives and sample sizes, which I will discuss further in one of my upcoming articles.
For now, let's explore the following code snippets to gain a better understanding of KNN as a supervised algorithm.
Step 1: Import the necessary Python libraries. The key player here is scikit-learn — a comprehensive library for machine learning. For this tutorial, I've also included pandas, which is excellent for handling data in a dataframe format. Additionally, numpy, known for its array manipulations, is part of our toolkit. If you're curious about the difference between dataframes and arrays, you can refer to the article I previously wrote on this topic.
Step 2: Import data. For this tutorial, I've generated a fake dataset tailored to our IKEA scenario.
According to the code snippet above, I created a dataset with 100 random numbers for each characteristic.
A quick note for those used to R: Python behaves a bit differently. In Python, when you use the np.random.randint function (from the NumPy library) to generate random numbers, it includes the lower number but not the upper one. For example, if you write np.random.randint(18, 70, 100), it will produce numbers from 18 to 69, but not 70.
Having generated the dataset, the next step is to split the data into features and the target. In the following code snippet, all variables except 'purchased' are designated as features (also known as predictors).
After identifying the roles of the variables, I divided the data into training and testing sets, allocating 80% for training and 20% for testing.
As mentioned earlier, the KNN algorithm is sensitive to varying scales among features. Hence, standardizing the data is a crucial step.
Now that the data is ready, we can proceed to train the KNN model. After training, we'll use it to make predictions on the testing dataset set aside earlier.
领英推荐
According to the output below, you'll notice that most samples (i.e., cases) are classified as 1. And there you have it! A straightforward approach to executing a supervised machine learning algorithm.
2. Unsupervised Learning with K-Means Clustering
Now, imagine you switch gears to work on a project with the marketing team. Here, the goal is to categorize customers based on age, gender, income, race, postcode, and purchasing history for targeted email advertisements. This is where K-Means, an unsupervised learning algorithm, comes into play.
Unlike KNN, K-Means doesn't rely on labeled data. It groups customers into 'K' clusters (let's say K=3 for simplicity) based on their similarity. The algorithm starts by selecting three random points in the dataset as initial centroids. Each customer is then assigned to the nearest centroid based on their features, minimizing the variance within each cluster.? Like KNN, the "nearest" is usually determined based on Euclidean distance. Once all data points are assigned to clusters, the centroids are recalculated as the mean (center) of all data points in their respective clusters.
This step adjusts the position of each centroid to be truly representative of its cluster. The process is repeated iteratively. With each iteration, the assignments of data points to clusters and the positions of the centroids are refined. The algorithm continues to iterate until a stopping criterion is met (e.g., a set number of iterations) or until the algorithm has converged (i.e., the centroids no longer move significantly between iterations).
Keep in mind that the initial choice of centroids can affect the final clusters. To address this, K-Means is often run multiple times with different initial centroids. The best clustering result (based on a criterion like within-cluster sum of squares) is chosen.
According to the IKEA example I discussed previously, to categorize a new customer, the KMeans algorithm would simply measure their proximity to the existing cluster centroids and assign them to the closest cluster. This method enables the marketing team to develop targeted strategies for customers with similar characteristics.
Let's look at the code snippets below to gain a deeper understanding for K-Means as an unsupervised algorithm. The process starts with importing all the necessary libraries, building upon those previously imported. This time, I've also included matplotlib for visualization purposes.
Next, I enhance the dataset by generating additional synthetic data for clustering. In this iteration, I've refined our example to incorporate income and postcode as new features.
The next step is to select features for clustering, which is then followed by data standardization. Similar to KNN, the KMeans algorithm is sensitive to variations in data scales. Therefore, it is crucial to standardize and scale the variables appropriately before training the model.
Now, let's train the model.
Notice that the warning I encountered above indicates that in an upcoming scikit-learn release (version 1.4), the default value of the 'n_init' parameter in the KMeans algorithm will be changed to 'auto'. The 'n_init' parameter dictates how many times the algorithm is executed with varying centroid seeds. In the current version I'm using, its default value is set to 10, meaning it automatically performs ten iterations with different centroid seeds, selecting the most optimal outcome. To preemptively address this warning, you can manually set the 'n_init' parameter when initializing the KMeans algorithm, as shown:
kmeans = KMeans(n_clusters=3, n_init=10, random_state=42).
Returning to our machine learning journey, once the model is trained, we proceed to retrieve the cluster labels and attempt to visualize the clusters.
Note that the plot output below only displays two features, although our model includes seven. This limitation stems from the scatter plot's inherent two-dimensional nature, capable of illustrating only two dimensions at a time. While Python offers methods to visualize data in higher dimensions, we will maintain simplicity in this tutorial.
Let's create a new, randomly generated data point to introduce into the model. This data point is characterized by an age of 36, female gender, an annual income of $110,637, classification in race group 4, a postcode of 63484, a purchasing history score of 85, and an excellent credit score of 818.
Now that I generated a new data point, as always, I scaled this new data point to match the existing data's scale. In the code below, care has been taken to ensure that the format of the data undergoing transformation, including feature names, aligns with the data used to fit the scaler previously.
To achieve this, I convert my numpy array into a pandas DataFrame, ensuring it mirrors the column names of the original DataFrame before performing the transformation. This step is vital because transforming a numpy array, which lacks inherent feature names, with this scaler function would trigger a warning due to the expected consistency in feature names.
Finally, we predict the cluster classification of the new data point. The results indicate that this new data point is assigned to Group 2.
Keep in mind that the journey of machine learning isn't just about training, testing, and predicting. There are also the crucial steps of evaluating, scaling, and maintaining your model. I will share more about these aspects in my next article, so stay tuned.
Online Versus Batch Learning
In addition to categorizing ML by supervision training style, ML could be also categorized by how it processes incoming data:
1. Online learning
2. Batch learning
Instance-Based vs. Model-Based Learning
As we explore the varied landscapes of ML, let's delve into our final categorization topic for today's post: how ML systems generalize to make predictions.
1. Instance-Based Learning: Learning from Specific Examples
2. Model-Based Learning: Predictive Frameworks from General Patterns
Key Takeaways
?? Instance-Based vs. Model-Based Learning: We concluded with a look at how ML systems generalize, either by learning from specific examples or by creating predictive models from general patterns.
Remember, the world of ML is vast and ever-evolving. While this article covers fundamental concepts and types, there are other types of ML like zero-shot, one-shot, and few-shot learning that we haven't touched upon. So, stay tuned for my upcoming articles.
Thank you for joining me on this exploratory journey. Your curiosity and eagerness to learn are the first steps towards mastering ML. Let's continue this learning adventure in future articles! ????
Executive Leader | Board Member | Ai for Impact
1 年Excellent!