登录查看更多内容

An eye on Machine Learning

Steven Murhula

Machine Learning l Data Engineer l Scala l Python l Data Analysis l Big Data Development l SQL I AWS l DevOps l ETL GCP & Azure

发布日期: 2020年1月6日

Introduction

Knowledge is quite often defined as a model that can be constantly updated or tweaked as new data comes into play. Models are obviously domain-specific ranging from credit risk assessment, face recognition, maximization of quality of service,classification of pathological symptoms of disease, optimization of computer networks,and security intrusion detection, to customers' online behavior and purchase history.

Machine learning problems are categorized as classification, prediction, optimization,and regression.

Classification

The purpose of classification is to extract knowledge from historical data. For instance, a classifier can be built to identify a disease from a set of symptoms. The scientist collects information regarding the body temperature (continuous variable),congestion (discrete variables HIGH, MEDIUM, and LOW), and the actual diagnostic (flu). This dataset is used to create a model such as IF temperature > 102 AND congestion = HIGH THEN patient has the flu (probability 0.72), which doctors can use in their diagnostic.

Prediction

Once the model is extracted and validated against the past data, it can be used to draw inference from the future data. A doctor collects symptoms from a patient, such as body temperature and nasal congestion, and anticipates the state of his/her health.

Optimization

Some global optimization problems are intractable using traditional linear and non-linear optimization methods. Machine learning techniques improve the chances that the optimization method converges toward a solution (intelligent search). You can imagine that fighting the spread of a new virus requires optimizing a process that may evolve over time as more symptoms and cases are uncovered.

Regression

Regression is a classification technique that is particularly suitable for a continuous model. Linear (least square), polynomial, and logistic regressions are among the most commonly used techniques to fit a parametric model, or function, y= f (xj), to a dataset. Regression is sometimes regarded as a specialized case of classification for which the output variables are continuous instead of categorical.

Why Scala?

Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven dynamically into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular MapReduce model for distributed computation of large amounts of data.

Among the capabilities of the language, the following features are deemed essential to machine learning and statistical analysis.

On my previous article I did talk in lengthy way the important of scala in data analysis spectrum but let me talk about scalability .

Scalability

As seen previously, monoids and monads enable parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, Actors are the core elements that make Scala scalable. Actors act as coroutines, managing the underlying threads pool. Actors communicate through passing asynchronous messages. A distributed computing Scala framework such as Akka and Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Spark are described in detail in the last chapter of this book [1:3].

In a nutshell, a workflow is implemented as a sequence of activities or computational tasks. Those tasks consist of high-order Scala methods such as flatMap, map, fold,reduce, collect, join, or filter applied to a large collection of observations. Scala allows these observations to be partitioned by executing those tasks through a cluster of actors. Scala also supports message dispatching and routing of messages between local and remote actors. The engineers can decide to execute a workflow either locally or distributed across CPU cores and servers with no code or very little code changes.

These tasks are actually executed over multiple worker nodes that are implemented by the Scala actors. The master node exchanges messages with the workers to manage the state of the execution of the workflow as well as its reliability. High availability of these tasks is implemented through a hierarchy of supervising actors.

Taxonomy of machine learning algorithms

The purpose of machine learning is to teach computers to execute tasks without human intervention. An increasing number of applications such as genomics, social networking, advertising, or risk analysis generate a very large amount of data that can be analyzed or mined to extract knowledge or provide insight into a process,a customer, or an organization. Ultimately, machine learning algorithms consist of identifying and validating models to optimize a performance criterion using historical, present, and future data .

Data mining is the process of extracting or identifying patterns in a dataset.

Unsupervised learning

The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. The process known as density estimation in statistics is broken down into two categories: discovery of data clusters and discovery of latent factors. The methodology consists of processing input data to understand patterns similar to the natural learning process in infants or animals.Unsupervised learning does not require labeled data, and therefore, is easy to implement and execute because no expertise is needed to validate an output.

However, it is possible to label the output of a clustering algorithm and use it for future classification.

Clustering

The purpose of data clustering is to partition a collection of data into a number of clusters or data segments. Practically, a clustering algorithm is used to organize observations into clusters by minimizing the observations within a cluster and maximizing the observations between clusters. A clustering algorithm consists of the following steps:

1. Creating a model by making an assumption on the input data.

2. Selecting the objective function or goal of the clustering.

3. Evaluating one or more algorithms to optimize the objective function.

Data clustering is also known as data segmentation or data partitioning.

Dimension reduction

Dimension reduction techniques aim at finding the smallest but most relevant set of features that models dataset reliability. There are many reasons for reducing the number of features or parameters in a model, from avoiding overfitting to reducing computation costs.

There are many ways to classify the different techniques used to extract knowledge from data using unsupervised learning. The following taxonomy breaks down these techniques according to their purpose, although the list is far for being exhaustive, as shown in the following diagram:

Supervised learning

The best analogy for supervised learning is function approximation or curve fitting.In its simplest form, supervised learning attempts to extract a relation or function f x → y from a training set {x, y}. Supervised learning is far more accurate and reliable than any other learning strategy. However, a domain expert may be required to label (tag) data as a training set for certain types of problems.

Supervised machine learning algorithms can be broken into two categories:

? Generative models

? Discriminative models

Generative models

In order to simplify the description of statistics formulas, we adopt the following simplification: the probability of an event X is the same as the probability of the discrete random variable X to have a value x, p(X) = p(X=x). The notation of joint probability (resp. conditional probability) becomes p(X, Y) = p(X=x, Y=y) (resp.p(X|Y)=p(X=x | Y=y).

2. Discriminative models

Contrary to generative models, discriminative models compute the conditional probability p(Y|X) directly, using the same algorithm for training and classification. Generative and discriminative models have their respective advantages and drawbacks. Novice data scientists learn to match the appropriate algorithm to each problem through experimentation.

Here is a brief guideline describing which type of models makes sense according to the objective or criteria of the project:

We can further refine the taxonomy of supervised learning algorithms by segregating between sequential and random variables for generative models and breaking down discriminative methods as applied to continuous processes (regression) and discreteprocesses (classification):

Reinforcement learning

Reinforcement learning is not as well understood as supervised and unsupervised learning outside the realms of robotics or game strategy. However, since the 90s,genetic-algorithms-based classifiers have become increasingly popular to solve problems that require collaboration with a domain expert. For some types of applications, reinforcement learning algorithms output a set of recommended actions for the adaptive system to execute. In its simplest form, these algorithms compute or estimate the best course of action. Most complex systems based on reinforcement learning establish and update policies that can be vetoed by an expert.

The foremost challenge developers of reinforcement learning systems face is that the recommended action or policy may depend on partially observable states and how to deal with uncertainty.

This is a brief overview of machine learning algorithms with a suggested taxonomy.There are almost as many ways to introduce machine learning as there are data and computer scientists. We encourage you to browse through the list of references at the end of the book and find the documentation appropriate to your level of interest and understanding.

要查看或添加评论，请登录

查看全部

An eye on Machine Learning

Steven Murhula

Machine Learning l Data Engineer l Scala l Python l Data Analysis l Big Data Development l SQL I AWS l DevOps l ETL GCP & Azure

更多精彩文章

社区洞察

其他会员也浏览了

Machine Learning nd It's Applications

10 Machine Learning Algorithms every Data Scientist should know

Why Big Data And Machine Learning Are Important In Our Society

Top 8 Machine Learning Algorithms Explained In Less Than 1 Minute Each

Unleashing the Power of Big Data: A Comprehensive Look at Machine Learning Algorithms

Decision Tree for Satellite Image Classification

Artificial Intelligence, Machine Learning and Data Science: Differences and Connection

Yet ANOther MAchine Learning OPerations Article (YANO MALOPA)

Analyzing Brazilian Payment Methods Using Machine Learning and Deep Learning: A Comprehensive Guide

Causal Regularization: Steering Machine Learning with Cause and Effect ????

HOW TO HANDLE HISTORICAL DATA IN DATA WAREHOUSE IMPLEMENTATION

2023年5月1日

when to chose graph database vs nosql database

2023年4月25日

best practices for data warehousing with Azure real world scenario

2023年4月25日

Apache Airflow as an open source orchestration solution for Data Engineering

2022年12月7日

K-means clustering - European countries protein consumption. R Project

2020年1月11日

Data Ingestion with Spark Scala and SQL through JDBC

2020年1月11日

Stock Price Prediction with Regression Algorithms

2020年1月6日

SCALA FOR DATA SCIENCES PROFESSIONALS

2020年1月3日

Creating a Chatbot That Gets Smarter Over Time with a Business Case Study

2018年5月25日

Business Value in Sentiment Analysis and Natural Processing with Use Case

2018年4月16日