Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide
Credit: www.vpnsrus.com

Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide

Hello there, data enthusiasts! I'm back with another article, where we unravel the various types of Machine Learning (ML) and their real-world applications. This article is designed for beginners as well as those who are intermediate or advanced in the field and wish to brush up on their conceptual knowledge about ML types. This article not only covers key concepts but also includes practical code applications. So buckle up for an insightful and hands-on journey!

First, let's start by understanding what ML actually is from different perspectives.

Machine Learning Definitions

?? Layperson's Definition: Machine learning is often seen as a branch of artificial intelligence. More specifically, ML is the process by which computers learn from data to develop artificial intelligence.

?? Data Scientist's View: If you ask a data scientist, they may say that ML involves providing a training dataset to a computer, enabling it to generate optimal parameter values, patterns, and minimal errors for a specific algorithm. As the algorithm encounters new datasets or testing sets, it uses previously learned parameters to predict trends and categorize groups, continually improving based on new data.

?? Definition as Found in Academic Journal Articles: Frequently, academic journal articles describe ML as a computer program that improves its performance on a specific task (T), based on experience (E) and guided by a performance measure (P) (Mitchell, 1997). This definition emphasizes the program's ability to enhance its effectiveness in performing task T as it accumulates more experience E, as evaluated by performance measure P.

I agree with all these definitions. However, ML is not just about training or testing data. It's also about how we can develop a system capable of addressing our research questions or fulfilling specific objectives, regardless of the industry.

Take child welfare, for instance, where I currently work. Here, ML extends beyond simply training a machine on large administrative datasets to predict a child's potential involvement in the juvenile justice system. It's also about creating an ML system that is:

  • Inclusive: Incorporating training data that reflects and addresses racial disparities in child welfare.
  • Reliable: Utilizing data collected from thoroughly validated survey measurements across various groups of children and families at risk.
  • Scalable: Ensuring the model performs efficiently without overfitting or underfitting.
  • Maintainable: Equipping the system with ample resources, like sufficient RAM, for efficient operation.
  • Adaptable: Designing the model or data structure to be flexible, accommodating evolving trends and new datasets as they emerge in the child welfare system.

In essence, beyond using training and testing data to build and refine a specific algorithm, ML encompasses the deployment, monitoring, and ongoing maintenance of the system. Whether you work for a non-profit or private organization, it's crucial to align your ML model's deployment, scalability, and maintenance with the company’s strategy and operations.

Machine Learning Terminology Explained

In ML, some terms may differ from those typically encountered in traditional statistics. Let's delve into these differences:

  • Features, Predictors, Attributes: In traditional statistics, these are known as independent variables.
  • Labels: In machine learning, labels refer to what is traditionally known as continuous outcomes in statistics.
  • Targets: These are akin to categorical outcomes in traditional statistics.
  • Samples: Referred to as ‘cases’ or ‘participants’ in the context of statistics.
  • Regression Models: These are models used to predict continuous (interval/ratio) outcomes, similar to certain types of general linear models in statistics, such as linear regression.
  • Classification Models: These models are designed to predict categorical (ordinal/nominal) outcomes, comparable to generalized linear models (e.g., binary logistic regression) in statistics. However, classification models in ML can involve binary or multiclass classification, akin to some types of generalized linear modeling (e.g., multilevel logistic regression).In ML, the number of outcome classes can be extensive, such as when training a machine to identify different types of stones on Mars or to differentiate among thousands of diseases in biopsy analysis. When dealing with a large number of classes, we refer to the task as having high cardinality. It's important to have a sufficient number of samples for each class, which relates to considerations of power and sample size in traditional statistics.
  • Multilabel Classification: Besides multiclass classification, there are also multilabel classification models in ML, where a sample may belong to several classes. This differs from multiclass classification problems, where a sample belongs to one among multiple classes. For example, consider building a model to categorize the content of a 'day in the life of a data scientist' YouTube vlog. This content might be classified into several categories simultaneously, such as entertainment, technology, lifestyle, and even drama, especially when codes don't work as expected! The idea of multilabel classification might be somewhat novel to social science researchers and some traditional statisticians, while it tends to be more familiar territory for machine learning scientists.

Returning to Our Main Question: "How Many Types of ML Are There?”

Figure 1: Types of Machine Learning

The answer depends on the criteria used for categorization. In this article, I will discuss the types of machine learning based on three classification criteria: styles of supervision in training, patterns of processing incoming data, and abilities in generalization. Let's begin with the classification of ML based on the style of training supervision:

1. Supervised Learning:

  • This type of ML involves training data that includes ‘labels’ or the desired outcomes the algorithm is expected to learn and predict.
  • For example, consider a study on how demographic factors like age, gender, race, and religion affect IQ. If you have a dataset including participants' IQs, these IQs serve as labels to train the algorithm. Using regression, the computer will learn from the training data to minimize the loss function (e.g., Mean Absolute Error or Mean Squared Error, which essentially measures the distance between predicted and actual values) and then apply this learning to predict IQs in a new dataset.
  • Everyday examples include Netflix's algorithm suggesting shows based on your viewing history or email systems filtering spam based on word choices, sender information, or your interaction with previous emails.

2. Unsupervised Learning:

  • In contrast to supervised learning, unsupervised learning does not use specifically labeled data for training and testing. Instead, the computer learns to identify patterns within the dataset.
  • A familiar example if you are a social science researcher is factor analysis, used to perform dimensional reduction or categorize multiple survey items. In this case, the dataset lacks a specific column indicating categories for each survey item. The computer learns to group items based on an algorithm like exploratory factor analysis, clustering them according to coefficients or factor loadings.
  • Beyond clustering survey items, unsupervised learning has a variety of applications. These include visualization, such as instructing a computer to generate images of dogs based on a collection of dog pictures previously fed into the system. Anomaly detection is another area, where during the data preparation phase, the machine learns the data's patterns and is thus able to identify outliers or data points in new datasets that don't align with the training set's patterns.
  • Additionally, unsupervised learning is used in association rule learning. This involves exploring relationships between features and using these relationships to make predictions in new datasets. For instance, if a model in Facebook's advertising system discovers that individuals in a specific postal code A who subscribe to Netflix are also likely to subscribe to Hulu, Facebook might start recommending both Netflix and Hulu to users in that postal code.

3. Semi-Supervised Learning:

  • Semi-supervised learning is beneficial when not all data can be labeled due to time or cost constraints.
  • For instance, a startup with a limited budget might only afford to label a small portion of data. For example, Initially, with a budget like $50,000 from an angel investor, a startup could hire humans to annotate data from about 4000 customers. But without the resources for extensive human annotation, the startup can employ semi-supervised learning. The machine uses the limited labeled data to learn and predict trends in the broader, unlabeled customer base.

4. Reinforcement Learning:

  • Reinforcement learning is observable in various domains, including robotics, self-driving cars, and interactive systems like ChatGPT that seek feedback on their responses.

Figure 2: ChatGPT Requests Additional Feedback as an Example of Reinforcement Learning.

  • The reinforcement learning approach involves the system observing data patterns, making decisions, and performing actions independently. It receives rewards or penalties based on the outcomes of these actions. Through this process, the system develops a policy or strategy to improve its performance in future tasks.

Exploring Supervised vs. Unsupervised Machine Learning: KNN and K-Means Clustering

To gain a clearer insight into the distinctions between supervised and unsupervised ML, let’s explore two commonly used algorithms: K-nearest neighbors (KNN) for supervised learning and K-means clustering for unsupervised learning.

1. Supervised Learning with K-Nearest Neighbors (KNN)

Imagine you're a researcher at a furniture company like IKEA (I love their Scandinavian decor!). Your task is to predict whether customers will purchase 'as-is' showroom furniture pieces. Factors such as age, gender, race, purchasing history, and credit scores play a role in this decision. In your dataset, these features are accompanied by labels indicating whether customers have purchased the 'as-is' furniture (0 = not purchased, 1 = purchased).

KNN, a supervised machine learning algorithm, shines in this scenario. It is trained on a dataset with known outcomes (customers who have already made purchases). For a new customer, KNN finds the 'K' nearest neighbors in the dataset based on similarities in features like age, gender, race, etc. The algorithm then predicts the new customer's likelihood of purchasing based on the majority decision of these neighbors.

For instance, If K=3, the KNN algorithm will find the three customers in your dataset that are closest to the new customer in terms of age, gender, and race. If two of these customers have purchased as-is showroom furniture pieces, and one has not, the algorithm will classify that the new customer will likely make a purchase. To find the nearest neighbors, KNN measures the distance between points using distance metrics such as Euclidean distance, which is the most common one, Manhattan distance, or others.

It's important to note that KNN is sensitive to the scale of the data. Variables such as income and age, which often have widely varying scales, can disproportionately influence the outcome if they are not scaled or standardized. Therefore, scaling or standardizing your variables is an essential step before applying KNN.

Additionally, the choice of the K value is crucial. A smaller K value might lead to a higher influence of noise on the result, as it's possible for samples in your data to be in the same cluster even though they may not share feature similarities. Conversely, a larger K value can increase computational costs. KNN may not be as effective when dealing with a high number of features or dimensions, as data can become sparse with increased dimensions, complicating the process of finding close neighbors. Determining an appropriate K depends on multiple factors, such as research objectives and sample sizes, which I will discuss further in one of my upcoming articles.

For now, let's explore the following code snippets to gain a better understanding of KNN as a supervised algorithm.

Step 1: Import the necessary Python libraries. The key player here is scikit-learn — a comprehensive library for machine learning. For this tutorial, I've also included pandas, which is excellent for handling data in a dataframe format. Additionally, numpy, known for its array manipulations, is part of our toolkit. If you're curious about the difference between dataframes and arrays, you can refer to the article I previously wrote on this topic.

Step 2: Import data. For this tutorial, I've generated a fake dataset tailored to our IKEA scenario.

According to the code snippet above, I created a dataset with 100 random numbers for each characteristic.

  • Age: The numbers range from 18 to 69. This range is wide enough to include a diverse group of ages.
  • Gender: Represented by 0 for male and 1 for female.
  • Race: Categorized into five groups, represented by numbers 0 through 4.
  • Purchasing History Scores: These are between 0 and 99, which fits into a 100-point scale.
  • Credit Scores: Set from 300 to 849, mimicking real-life credit score ranges.
  • Purchase Status ('Purchased'): This is marked with 0 for 'not purchased' and 1 for 'purchased'.

A quick note for those used to R: Python behaves a bit differently. In Python, when you use the np.random.randint function (from the NumPy library) to generate random numbers, it includes the lower number but not the upper one. For example, if you write np.random.randint(18, 70, 100), it will produce numbers from 18 to 69, but not 70.

Having generated the dataset, the next step is to split the data into features and the target. In the following code snippet, all variables except 'purchased' are designated as features (also known as predictors).

After identifying the roles of the variables, I divided the data into training and testing sets, allocating 80% for training and 20% for testing.

As mentioned earlier, the KNN algorithm is sensitive to varying scales among features. Hence, standardizing the data is a crucial step.

Now that the data is ready, we can proceed to train the KNN model. After training, we'll use it to make predictions on the testing dataset set aside earlier.

According to the output below, you'll notice that most samples (i.e., cases) are classified as 1. And there you have it! A straightforward approach to executing a supervised machine learning algorithm.

2. Unsupervised Learning with K-Means Clustering

Now, imagine you switch gears to work on a project with the marketing team. Here, the goal is to categorize customers based on age, gender, income, race, postcode, and purchasing history for targeted email advertisements. This is where K-Means, an unsupervised learning algorithm, comes into play.

Unlike KNN, K-Means doesn't rely on labeled data. It groups customers into 'K' clusters (let's say K=3 for simplicity) based on their similarity. The algorithm starts by selecting three random points in the dataset as initial centroids. Each customer is then assigned to the nearest centroid based on their features, minimizing the variance within each cluster.? Like KNN, the "nearest" is usually determined based on Euclidean distance. Once all data points are assigned to clusters, the centroids are recalculated as the mean (center) of all data points in their respective clusters.

This step adjusts the position of each centroid to be truly representative of its cluster. The process is repeated iteratively. With each iteration, the assignments of data points to clusters and the positions of the centroids are refined. The algorithm continues to iterate until a stopping criterion is met (e.g., a set number of iterations) or until the algorithm has converged (i.e., the centroids no longer move significantly between iterations).

Keep in mind that the initial choice of centroids can affect the final clusters. To address this, K-Means is often run multiple times with different initial centroids. The best clustering result (based on a criterion like within-cluster sum of squares) is chosen.

According to the IKEA example I discussed previously, to categorize a new customer, the KMeans algorithm would simply measure their proximity to the existing cluster centroids and assign them to the closest cluster. This method enables the marketing team to develop targeted strategies for customers with similar characteristics.

Let's look at the code snippets below to gain a deeper understanding for K-Means as an unsupervised algorithm. The process starts with importing all the necessary libraries, building upon those previously imported. This time, I've also included matplotlib for visualization purposes.

Next, I enhance the dataset by generating additional synthetic data for clustering. In this iteration, I've refined our example to incorporate income and postcode as new features.

The next step is to select features for clustering, which is then followed by data standardization. Similar to KNN, the KMeans algorithm is sensitive to variations in data scales. Therefore, it is crucial to standardize and scale the variables appropriately before training the model.

Now, let's train the model.

Notice that the warning I encountered above indicates that in an upcoming scikit-learn release (version 1.4), the default value of the 'n_init' parameter in the KMeans algorithm will be changed to 'auto'. The 'n_init' parameter dictates how many times the algorithm is executed with varying centroid seeds. In the current version I'm using, its default value is set to 10, meaning it automatically performs ten iterations with different centroid seeds, selecting the most optimal outcome. To preemptively address this warning, you can manually set the 'n_init' parameter when initializing the KMeans algorithm, as shown:

kmeans = KMeans(n_clusters=3, n_init=10, random_state=42).        

Returning to our machine learning journey, once the model is trained, we proceed to retrieve the cluster labels and attempt to visualize the clusters.

Note that the plot output below only displays two features, although our model includes seven. This limitation stems from the scatter plot's inherent two-dimensional nature, capable of illustrating only two dimensions at a time. While Python offers methods to visualize data in higher dimensions, we will maintain simplicity in this tutorial.

Let's create a new, randomly generated data point to introduce into the model. This data point is characterized by an age of 36, female gender, an annual income of $110,637, classification in race group 4, a postcode of 63484, a purchasing history score of 85, and an excellent credit score of 818.

Now that I generated a new data point, as always, I scaled this new data point to match the existing data's scale. In the code below, care has been taken to ensure that the format of the data undergoing transformation, including feature names, aligns with the data used to fit the scaler previously.

To achieve this, I convert my numpy array into a pandas DataFrame, ensuring it mirrors the column names of the original DataFrame before performing the transformation. This step is vital because transforming a numpy array, which lacks inherent feature names, with this scaler function would trigger a warning due to the expected consistency in feature names.

Finally, we predict the cluster classification of the new data point. The results indicate that this new data point is assigned to Group 2.

Keep in mind that the journey of machine learning isn't just about training, testing, and predicting. There are also the crucial steps of evaluating, scaling, and maintaining your model. I will share more about these aspects in my next article, so stay tuned.

Online Versus Batch Learning

In addition to categorizing ML by supervision training style, ML could be also categorized by how it processes incoming data:

1. Online learning

  • This type of ML is a fast-paced learner and real-time adaptation.
  • The system is trained incrementally, processing data instances sequentially, either one by one or in small clusters known as mini-batches. This approach is particularly valuable in dynamic business environments where real-time updates on consumer demand and behavior are critical for profitability.
  • The learning rate in online learning can be adjusted according to need. For instance, in high-stakes environments like day trading, where staying abreast with the stock market is crucial, a fast learning rate is beneficial. However, this speed can lead to the model quickly forgetting initial parameters, despite being quick at adapting to new data.
  • On the other hand, a slower learning rate allows the model to retain initial parameters more effectively, albeit with slower adaptation to new data. This slower pace might be more suitable for fields like index fund management or 401Ks, where real-time updates are less critical. Additionally, a slower rate reduces the model's sensitivity to noise in new data, due to its familiarity with initial parameters.

2. Batch learning

  • Now, let’s shift gears to Batch Learning. Here, the ML system learns everything it can from a large dataset in one comprehensive pass, akin to a thorough study session.
  • Once trained, it doesn’t update its knowledge base. This method is stable and thorough but might struggle to stay current in a rapidly changing environment. It's like mastering a subject but not keeping up with the latest developments.
  • If you are a social science researcher in academia, batch learning might be more familiar than online learning. This is often due to the nature of social science projects, where data is collected once, analyzed, and then used to inform future data collection, rather than connecting a machine learning system to a continuously updating database

Instance-Based vs. Model-Based Learning

As we explore the varied landscapes of ML, let's delve into our final categorization topic for today's post: how ML systems generalize to make predictions.

1. Instance-Based Learning: Learning from Specific Examples

  • Instance-based learning in ML operates on the principle of learning from specific examples and using these to predict patterns in new data.
  • Imagine you're training an image recognition program to identify your spouse's face. You'd feed the system various images of your spouse, showcasing different facial expressions. Rather than learning from a single, static image, the model compares and contrasts these multiple examples.
  • Each time you introduce a new image of your spouse, the model assesses its similarity to the previously learned images. It determines whether the new image matches the known facial features of your spouse. Essentially, the model learns by drawing parallels between the new instance and the collected examples.

2. Model-Based Learning: Predictive Frameworks from General Patterns

  • On the other side of the spectrum, we have model-based learning. This approach involves creating a model based on a representative sample of data. You start by selecting a specific model type suited to your data and objectives.
  • The key to success in model-based learning is fine-tuning the model to align precisely with your training data. This process, known as feature engineering, is critical in ensuring that the model makes accurate predictions.
  • The accuracy of predictions in model-based learning heavily depends on how well the chosen features represent the training data and whether the selected algorithm aligns with the underlying data patterns. For instance, applying a linear regression model to data that inherently follows a sigmoid function would result in ineffective learning.

Key Takeaways

  1. ?? Machine Learning Definitions: We explored ML from various perspectives – the layperson's view, the data scientist's lens, and the academic eye. These definitions underscore ML's versatility across different fields.
  2. ?? Beyond Basics: ML transcends training and testing data. It's about creating inclusive, reliable, scalable, maintainable, and adaptable systems that meet specific objectives and align with business strategies.
  3. ?? Terminology Differences: We distinguished between terms commonly used in ML and traditional statistics, such as features, labels, targets, and samples, clarifying their roles in ML models.
  4. ?? Types of ML Based on Training Supervision: We delved into supervised, unsupervised, semi-supervised, and reinforcement learning, each with unique applications and challenges.
  5. ?? Practical Applications: Through examples like IKEA's furniture sales and marketing strategies, we illustrated how supervised and unsupervised algorithms like KNN and K-Means Clustering operate.
  6. ?? Online vs. Batch Learning: We compared these two learning methods, highlighting their suitability in different business contexts.

?? Instance-Based vs. Model-Based Learning: We concluded with a look at how ML systems generalize, either by learning from specific examples or by creating predictive models from general patterns.

Remember, the world of ML is vast and ever-evolving. While this article covers fundamental concepts and types, there are other types of ML like zero-shot, one-shot, and few-shot learning that we haven't touched upon. So, stay tuned for my upcoming articles.

Thank you for joining me on this exploratory journey. Your curiosity and eagerness to learn are the first steps towards mastering ML. Let's continue this learning adventure in future articles! ????

Takkeem Leon Morgan MBA

Executive Leader | Board Member | Ai for Impact

1 年

Excellent!

回复

要查看或添加评论,请登录

Kay Chansiri, Ph.D.的更多文章

社区洞察

其他会员也浏览了