登录查看更多内容

Part 2 - Keep it Simple : Machine Learning & Algorithms for Big Boys

Dr.Dinesh Chandrasekar (DC)

Chief Strategy Officer & Country Head, India, Centific AI | Nasscom Deep Tech ,Telangana AI Mission & HYSEA - Mentor & Advisor | Alumni of Hitachi, GE & Citigroup | DeepTech evangelist |Author & Investor| Be Passionate

发布日期: 2017年7月19日

+ 关注

Part 1 of this Article :Click

The Picture below summarizes the Machine Learning Algorithms in one Picture

Parameters

Parameters are the knobs a data scientist gets to turn when setting up an algorithm. They are numbers that affect the algorithm's behavior, such as error tolerance or number of iterations, or options between variants of how the algorithm behaves. The training time and accuracy of the algorithm can sometimes be quite sensitive to getting just the right settings. Typically, algorithms with large numbers parameters require the most trial and error to find a good combination.+

While this is a great way to make sure you've spanned the parameter space, the time required to train a model increases exponentially with the number of parameters.The upside is that having many parameters typically indicates that an algorithm has greater flexibility. It can often achieve very good accuracy. Provided you can find the right combination of parameter settings.

Number of features

For certain types of data, the number of features can be very large compared to the number of data points. This is often the case with genetics or textual data. The large number of features can bog down some learning algorithms, making training time unfeasibly long.

Algorithms

A brief look into some of the algorithms for us to understand a bit more on the feasibility of these algorithm in real life use cases

Linear regression

As mentioned previously, linear regression fits a line (or plane, or hyperplane) to the data set. It's a workhorse, simple and fast, but it may be overly simplistic for some problems.

Data with a linear trend

Logistic regression

Although it confusingly includes 'regression' in the name, logistic regression is actually a powerful tool for two-class and multiclass classification. It's fast and simple. The fact that it uses an 'S'-shaped curve instead of a straight line makes it a natural fit for dividing data into groups. Logistic regression gives linear class boundaries, so when you use it, make sure a linear approximation is something you can live with.

A logistic regression to two-class data with just one feature - the class boundary is the point at which the logistic curve is just as close to both classes

Trees, forests, and jungles

Decision forests (regression, two-class, and multi class), decision jungles (two-class and multi class), and boosted decision trees (regression and two-class) are all based on decision trees, a foundation machine learning concept. There are many variants of decision trees, but they all do the same thing—subdivide the feature space into regions with mostly the same label. These can be regions of consistent category or of constant value, depending on whether you are doing classification or regression.

A decision tree subdivides a feature space into regions of roughly uniform values+

Because a feature space can be subdivided into arbitrarily small regions, it's easy to imagine dividing it finely enough to have one data point per region. This is an extreme example of overfitting. In order to avoid this, a large set of trees are constructed with special mathematical care taken that the trees are not correlated. The average of this "decision forest" is a tree that avoids overfitting. Decision forests can use a lot of memory. Decision jungles are a variant that consumes less memory at the expense of a slightly longer training time.

Boosted decision trees avoid overfitting by limiting how many times they can subdivide and how few data points are allowed in each region. The algorithm constructs a sequence of trees, each of which learns to compensate for the error left by the tree before. The result is a very accurate learner that tends to use a lot of memory.

Fast forest quantile regression is a variation of decision trees for the special case where you want to know not only the typical (median) value of the data within a region, but also its distribution in the form of quantiles.

Neural networks and perceptrons

Neural networks are brain-inspired learning algorithms covering multiclass, two-class, and regression problems. They come in an infinite variety, but the neural networks within Machine Learning are all of the form of directed acyclic graphs. That means that input features are passed forward (never backward) through a sequence of layers before being turned into outputs. In each layer, inputs are weighted in various combinations, summed, and passed on to the next layer. This combination of simple calculations results in the ability to learn sophisticated class boundaries and data trends, seemingly by magic. Many-layered networks of this sort perform the "deep learning" that fuels so much tech reporting and science fiction.

This high performance doesn't come for free, though. Neural networks can take a long time to train, particularly for large data sets with lots of features. They also have more parameters than most algorithms, which means that parameter sweeping expands the training time a great deal. And for those overachievers who wish to specify their own network structure, the possibilities are inexhaustible.

The boundaries learned by neural networks can be complex and irregular

The two-class averaged perceptron is neural networks' answer to skyrocketing training times. It uses a network structure that gives linear class boundaries. It is almost primitive by today's standards, but it has a long history of working robustly and is small enough to learn quickly.

SVMs

Support vector machines (SVMs) find the boundary that separates classes by as wide a margin as possible. When the two classes can't be clearly separated, the algorithms find the best boundary they can. As written in Machine Learning, the two-class SVM does this with a straight line only. (In SVM-speak, it uses a linear kernel.) Because it makes this linear approximation, it is able to run fairly quickly. Where it really shines is with feature-intense data, like text or genomic. In these cases SVMs are able to separate classes more quickly and with less overfitting than most other algorithms, in addition to requiring only a modest amount of memory.

A typical support vector machine class boundary maximizes the margin separating two classes

Bayesian methods

Bayesian methods have a highly desirable quality: they avoid overfitting. They do this by making some assumptions beforehand about the likely distribution of the answer. Another byproduct of this approach is that they have very few parameters. Machine Learning has both Bayesian algorithms for both classification (Two-class Bayes' point machine) and regression (Bayesian linear regression). Note that these assume that the data can be split or fit with a straight line.

PCA-based anomaly detection - the vast majority of the data falls into a stereotypical distribution; points deviating dramatically from that distribution are suspect

I would definitely recommend to check some more article on the Machine Learning and related data science concepts. I am exploring currently Microsoft Azure Machine Learning and will share more on this one in next few weeks. Whats in store for us , Quick preview picture below

Regards

Dinesh Chandrasekar DC*

Srinidhi Boray

Advisor - Data Science based Healthcare Transformation at Ingine Inc

7 年

Advancement to Bayesian machine learning for Knowledge Extraction https://www.dhirubhai.net/pulse/bioinginecom-hdn-semantic-knowledge-general-graph-best-boray?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_search_srp_content%3BpS9wtx5RypEul610VF7rlQ%3D%3D

Srinidhi Boray

Advisor - Data Science based Healthcare Transformation at Ingine Inc

7 年

Great Primer.... Chekout https://www.bioingine.com/?page_id=1274

Rosemarie Mac Sweeney

Sustainability Adviser

7 年

Am I allowed read it? (I am not a boy)

Kaushal Parikh

Devsecops, Threat Intelligence and Automations.

7 年

Awesome article. Well written, covering many points.

1 次回应

Johny William Conde Cáceres

Big Data Analytics & IA | Data Scientist

7 年

Thanks!!

1 次回应

查看更多评论

要查看或添加评论，请登录

Dr.Dinesh Chandrasekar (DC)的更多文章

The Jedi Strategists in the AI era

2025年1月26日

The Jedi Strategists in the AI era

How Chief Strategy Officers Navigate the AI Galaxy. In a galaxy not so far away, nestled in the AI-driven corporate…

3 条评论
Generation #Y: The Extraordinary Bridge Between Two Worlds

2025年1月19日

Generation #Y: The Extraordinary Bridge Between Two Worlds

Let’s raise a toast to Generation Y, the millennials, born between 1975 and 1995—arguably the most fascinating cohort…
Whats in Tech : Week of Jan 16th 2025

2025年1月16日

Whats in Tech : Week of Jan 16th 2025

Generative AI: Transforming Business Value and Driving ROI in 2024 The rapid evolution of artificial intelligence (AI)…

1 条评论
What's in Tech : Jan 14th 2025

2025年1月14日

What's in Tech : Jan 14th 2025

The AI Odyssey: How Hyperscalers and Unicorns Are Shaping the Future The narrative of artificial intelligence (AI) in…
2025: The Inflection Point for a New Era in Artificial Intelligence

2025年1月3日

2025: The Inflection Point for a New Era in Artificial Intelligence

In 2025, the transformative power of artificial intelligence (AI) is no longer theoretical. This year serves as a…

1 条评论
From COBOL to Generative AI: The Epic, Sometimes Hilarious Evolution of Coding

2024年12月28日

From COBOL to Generative AI: The Epic, Sometimes Hilarious Evolution of Coding

It was 1998. The world was on the brink of chaos—or so we thought.

4 条评论
What's in Tech : 2024 Recap & AI Special Ed.

2024年12月28日

What's in Tech : 2024 Recap & AI Special Ed.

A Year in Review For 2024’s last issue of the “What’s in Tech” newsletter, we will be exploring the top news of the…
2024 – My Year in Review

2024年12月24日

2024 – My Year in Review

?? Life has a unique way of teaching us resilience. It doesn’t come as a neatly wrapped gift, nor does it knock…
What's in Tech : Wk. of December 14th 2024

2024年12月14日

What's in Tech : Wk. of December 14th 2024

Issue #32 -13 December 2024 A Weekly Digest of Global Tech Updates: Your go-to source for the latest trends…

2 条评论
What's in Tech : Wk of Dec 6th 2024

2024年12月6日

What's in Tech : Wk of Dec 6th 2024

In this week’s "What’s in Tech" newsletter, we highlight the latest developments in the AI space. Amazon dominates this…

See all articles

Part 2 - Keep it Simple : Machine Learning & Algorithms for Big Boys

Dr.Dinesh Chandrasekar (DC)

Chief Strategy Officer & Country Head, India, Centific AI | Nasscom Deep Tech ,Telangana AI Mission & HYSEA - Mentor & Advisor | Alumni of Hitachi, GE & Citigroup | DeepTech evangelist |Author & Investor| Be Passionate

Number of features

Linear regression

Logistic regression

Trees, forests, and jungles

Neural networks and perceptrons

SVMs

Bayesian methods

Dr.Dinesh Chandrasekar (DC)的更多文章

社区洞察

其他会员也浏览了

Understanding Support Vector Machines (SVM) and Decision Trees in Machine Learning

Machine Learning - Hyperparameter Tuning

A Simple Machine Learning Example.

K-means clustering

Machine Learning Algorithms

Get your machine learning programs right every time - most comprehensive guide ever ( with code)!

Why Big Data And Machine Learning Are Important In Our Society

List of Top 10 Algorithms Used in Machine Learning Models

Machine Learning Algorithms Every Data Scientist Should Know

Boosting Techniques Battle: CatBoost vs XGBoost vs LightGBM vs scikit-learn GradientBoosting vs Hierarchical GB

Number of features

Linear regression

Logistic regression

Trees, forests, and jungles

Neural networks and perceptrons

SVMs

Bayesian methods

Dr.Dinesh Chandrasekar (DC)的更多文章

The Jedi Strategists in the AI era

Generation #Y: The Extraordinary Bridge Between Two Worlds

Whats in Tech : Week of Jan 16th 2025

What's in Tech : Jan 14th 2025

2025: The Inflection Point for a New Era in Artificial Intelligence

From COBOL to Generative AI: The Epic, Sometimes Hilarious Evolution of Coding

What's in Tech : 2024 Recap & AI Special Ed.

2024 – My Year in Review

What's in Tech : Wk. of December 14th 2024

What's in Tech : Wk of Dec 6th 2024

社区洞察

其他会员也浏览了

Understanding Support Vector Machines (SVM) and Decision Trees in Machine Learning

Machine Learning - Hyperparameter Tuning

A Simple Machine Learning Example.

K-means clustering

Machine Learning Algorithms

Get your machine learning programs right every time - most comprehensive guide ever ( with code)!

Why Big Data And Machine Learning Are Important In Our Society

List of Top 10 Algorithms Used in Machine Learning Models

Machine Learning Algorithms Every Data Scientist Should Know

Boosting Techniques Battle: CatBoost vs XGBoost vs LightGBM vs scikit-learn GradientBoosting vs Hierarchical GB