登录查看更多内容

Feature Clustering: A Simple Solution to Many Machine Learning Problems

Vincent Granville

AI/LLM Disruptive Leader | GenAI Tech Lab

发布日期: 2023年3月13日

Feature clustering is an unsupervised machine learning technique to separate the features of a dataset into homogeneous groups. In short, it is a clustering procedure, but performed on the features rather than on the observations. Such techniques often rely on a similarity metric, measuring how close two features are to each other. In this article, I use the absolute value of the correlation between two features. An immediate consequence is that the technique is scale-invariant: it does not depend on the units of measurement in your dataset. Of course, in some instances, it makes sense to transform the data using a logit or log transform prior to using the technique, to turn a multiplicative setting into an additive one.

The technique can also be used for traditional clustering performed on the observations. In that case, it is useful in the presence of wide data: when you have a large number of features but a small number of observations, sometimes smaller than the number of features as in clinical trials. When applied to features, it allows you to break down a high-dimensional problem (the dimension is the number of features), into a number of low-dimensional problems. It can accelerate many algorithms — those with computing time growing exponentially fast with the dimension — and at the same time avoid issues related to the “curse of dimensionality”. In fact it can be used as a data reduction technique, where feature clusters with a low average correlation (in absolute value) are removed from the data set.

Applications are numerous. In my case I used it in the context of synthetic data generation, especially with generative adversarial networks (GAN). The idea is is to identify clusters of related features, and apply a separate GAN to each of them, then put the synthetizations altogether back into one dataset. The benefits are faster processing with little to no loss in terms of capturing the full correlation structure present in the data set. It also increases the robustness and explainability of the method, making it less volatile during the successive epochs in the GAN model.

Ivan Reznikov 1 年前

Supervised Machine Learning in Time Series Forecasting

BI4ALL 2 年前

Top Trending AI tools for 2023

Infosec Train 1 年前

I summarize the feature clustering results in section 2. I used the technique on a Kaggle dataset with 9 features, consisting of medical measurements. I offer two Python implementations: one based on hierarchical clustering in section 3.1, and one based on connected components (a fundamental graph theory algorithm) in section 3.2. In addition, the technique leads to a simple visualization of the 9-dimensional dataset, with one scatterplot and two colors: orange for diabetes and blue for non-diabetes. Here diabetes is the binary response feature. This is because the largest feature cluster contains only 3 features, and one of them is the response. In any well-designed experiment, you would expect the response to always be in a large feature cluster.

Access and download the free article and Python code from this link .

GenAI and Machine Learning

203,901 位关注者

Ankit Kumar Hirwane

1 年

great

Velimir Radanovic

Architect, Development Manager, Product Manager, Developer

1 年

nice

2 次回应

查看更多评论

要查看或添加评论，请登录

Vincent Granville的更多文章

New LLM & RAG Courses and Certifications

2024年11月14日

New LLM & RAG Courses and Certifications

Covering LangChain, reinforcement learning, LLMs for businesses, LLM Ops, LLM in Python, RAG and more. Learn the nuts…

2 条评论
Optimizing AI Systems: Fintech Case Study

2024年11月5日

Optimizing AI Systems: Fintech Case Study

I downloaded 40 years' worth of daily data for the S&P 500 index, to devise trading strategies that outperform the…

12 条评论
LLM, RAG, GPT & GenAI: Free Certifications and Courses from Leading Experts

2024年11月1日

LLM, RAG, GPT & GenAI: Free Certifications and Courses from Leading Experts

This unique opportunity to grow or jump-start your career by learning state-of-the-art technology in one of the fastest…

4 条评论
Building a GenAI/LLM app on AWS with Anthropic Claude

2024年10月28日

Building a GenAI/LLM app on AWS with Anthropic Claude

Register here. Anthropic develops AI systems and large language models (LLMs).

7 条评论
AI/RAG Tutorial: Building Enterprise-Grade, Secure, Scalable Data APIs

2024年10月22日

AI/RAG Tutorial: Building Enterprise-Grade, Secure, Scalable Data APIs

Register here Most LLMs, RAGs or any data-intensive AI system are eventually accessed by users via a web API or SDK…

4 条评论
AI, GenAI, LLM, Prompt Engineering, NLP: Review of the Ecosystem

2024年10月18日

AI, GenAI, LLM, Prompt Engineering, NLP: Review of the Ecosystem

In this post, I highlight recent articles that I read on the topic, covering the ecosystem in-depth, and explained in…

12 条评论
New Book: Building Disruptive AI & LLM Technology from Scratch

2024年10月15日

New Book: Building Disruptive AI & LLM Technology from Scratch

Available here. This book features new advances in game-changing AI and LLM technologies built by GenAI Techlab.

3 条评论
Building an Enterprise-Grade Agentic RAG

2024年10月14日

Building an Enterprise-Grade Agentic RAG

Register here. Agents, optimum context-based chunking, and specialized RAGs/LLMs with large context window and…
Databases For AI, GenAI & RAG/LLMs: Vendor Comparison

2024年10月9日

Databases For AI, GenAI & RAG/LLMs: Vendor Comparison

Register here. This hands-on workshop is for developers and AI professionals, featuring state-of-the-art technology…

5 条评论
Building a Ranking System to Enhance Prompt Results: The New PageRank for RAG/LLM

2024年10月8日

Building a Ranking System to Enhance Prompt Results: The New PageRank for RAG/LLM

In this document, you will learn how to build a system that decides, among dozens of candidate paragraphs selected from…

3 条评论

See all articles

Feature Clustering: A Simple Solution to Many Machine Learning Problems

Vincent Granville

AI/LLM Disruptive Leader | GenAI Tech Lab

领英推荐

GenAI and Machine Learning

203,901 位关注者

Vincent Granville的更多文章

社区洞察

其他会员也浏览了

Implementing AdaGrad Optimizer in Spark

Building a Machine Learning Pipeline

Machine Learning - Feature Scaling Techniques

Supervised Machine Learning: Step-by-Step Guide (with code)

TensorFlow library same example with code

Data Science and its Nearest-Neighbours

Top Machine Learning Tools to Master

Top-111 Data Science Interview Questions & Detailed Answers

The Ultimate Guide to Feature Scaling in Data Science

Demystifying the Machine: Essential Skills for Machine Learning

领英推荐

GenAI and Machine Learning

203,901 位关注者

Vincent Granville的更多文章

New LLM & RAG Courses and Certifications

Optimizing AI Systems: Fintech Case Study

LLM, RAG, GPT & GenAI: Free Certifications and Courses from Leading Experts

Building a GenAI/LLM app on AWS with Anthropic Claude

AI/RAG Tutorial: Building Enterprise-Grade, Secure, Scalable Data APIs

AI, GenAI, LLM, Prompt Engineering, NLP: Review of the Ecosystem

New Book: Building Disruptive AI & LLM Technology from Scratch

Building an Enterprise-Grade Agentic RAG

Databases For AI, GenAI & RAG/LLMs: Vendor Comparison

Building a Ranking System to Enhance Prompt Results: The New PageRank for RAG/LLM

社区洞察

其他会员也浏览了

Implementing AdaGrad Optimizer in Spark

Building a Machine Learning Pipeline

Machine Learning - Feature Scaling Techniques

Supervised Machine Learning: Step-by-Step Guide (with code)

TensorFlow library same example with code

Data Science and its Nearest-Neighbours

Top Machine Learning Tools to Master

Top-111 Data Science Interview Questions & Detailed Answers

The Ultimate Guide to Feature Scaling in Data Science

Demystifying the Machine: Essential Skills for Machine Learning