登录查看更多内容

Data distributions where K-means clustering fails; can DBSCAN be a solution? Examples with R, Python and Spark

Fisseha Berhane, PhD

Senior Principal Data Scientist

发布日期: 2018年11月4日

For K-means clustering to work well the variance of the distribution of each attribute (variable) should be approximately spherical, all variables should have similar variance and each cluster should have roughly equal number of observations. Can DBSCAN be a solution for datasets that do not have the properties mentioned above?

Let's see examples with R, Python and Spark.

Article is available here

Nargess Memarsadeghi, PhD

Associate Branch Head

6 年

Nice. How would it compare with IsoData/IsoClus in which the algorithm adaptively adjusts final number of clusters, and one can specify heuristics such as minimum number of points to establish a cluster, maximum standard deviation, etc.?

Debjani Ghatak

Lecturer, Loyola University Chicago

6 年

Interesting and useful.

Amin Dezfuli

NASA Climate Scientist | Consultant | Academic | Communicator | Children’s Author

6 年

Interesting! I'd be curious to see the comparison with hierarchical clustering methods, particularly for applications like climate regionalization.?

1 次回应

Mulugeta Gebregziabher

Professor at Medical University of South Carolina

6 年

Excellent demonstration! How would these procedures perform when all the items/variables are categorical (eg. Binary)? Looking forward to closely read your article!

1 次回应

查看更多评论

要查看或添加评论，请登录

Fisseha Berhane, PhD的更多文章

Deep Learning with TensorFlow and Keras

2019年7月5日

Deep Learning with TensorFlow and Keras

I have been teaching deep learning (DNN and CNN) with TensorFlow and Keras to colleagues at work and I have shared them…

3 条评论
How using scikit-learn in Spark could save the day

2019年4月5日

How using scikit-learn in Spark could save the day

We may want to use scikit-learn with Spark when: 1- training a model in scikit-learn takes so long 2- the machine…

1 条评论
ROC Curve could be misleading with imbalanced data: Precision-Recall Curve is more informative

2019年4月2日

ROC Curve could be misleading with imbalanced data: Precision-Recall Curve is more informative

Even if ROC curve and area under the ROC curve are commonly used to evaluate model performance with balanced and…

2 条评论
Sampling using truncated hash

2018年10月25日

Sampling using truncated hash

Let's suppose tens of millions of people visit your website everyday and you want to do ad hoc analysis. However, you…
Hive Partitioning with Spark

2018年10月17日

Hive Partitioning with Spark

I experimented with Hive partitioning and some of the things I discussed in this blog post are: Query response time…
Issues to pay attention to when performing PCA in Spark, Python and R

2018年10月8日

Issues to pay attention to when performing PCA in Spark, Python and R

Recently, I was using PCA with Spark with sparse matrix with millions of rows and to make sure everything was right I…

5 条评论
Simpson’s Paradox

2018年10月3日

Simpson’s Paradox

Today, I was listening to a data science podcast from DataCamp and they talked about Simpson's paradox: it is a…

1 条评论
Exploring the Pareto Distribution with R

2018年9月20日

Exploring the Pareto Distribution with R

We have learned various distributions in college. The most common one being the bell-curve, based on our area of study…

1 条评论
Benefits of and Tips on Hortonworks Apache Spark Certification

2018年3月25日

Benefits of and Tips on Hortonworks Apache Spark Certification

Recently, I took hands-on, performance-based certification for Spark on the Hortonworks Data Platform (HDPCD), and in…
Machine Learning with Text in PySpark - Part 1

2018年3月10日

Machine Learning with Text in PySpark - Part 1

We usually work with structured data in our machine learning applications. However, unstructured data can also have…

3 条评论

See all articles

Data distributions where K-means clustering fails; can DBSCAN be a solution? Examples with R, Python and Spark

Fisseha Berhane, PhD

Senior Principal Data Scientist

Fisseha Berhane, PhD的更多文章

社区洞察

其他会员也浏览了

Step-by-Step Guide to Building Your First Regression Model in Python

?? Day8 of #100DaysOfPython ??

Decision Trees in Machine Learning

Accurate and fast PI in Python

Day 10 - Chai Time Automation Series

Interpolation search

Free Python Tutorials - Graph/Network Analysis Made Simple

Leetcode #383 Ransom Note

?? Speed up your Pandas code! ??

Sqrt(x)

Fisseha Berhane, PhD的更多文章

Deep Learning with TensorFlow and Keras

How using scikit-learn in Spark could save the day

ROC Curve could be misleading with imbalanced data: Precision-Recall Curve is more informative

Sampling using truncated hash

Hive Partitioning with Spark

Issues to pay attention to when performing PCA in Spark, Python and R

Simpson’s Paradox

Exploring the Pareto Distribution with R

Benefits of and Tips on Hortonworks Apache Spark Certification

Machine Learning with Text in PySpark - Part 1

社区洞察

其他会员也浏览了

Step-by-Step Guide to Building Your First Regression Model in Python

?? Day8 of #100DaysOfPython ??

Decision Trees in Machine Learning

Accurate and fast PI in Python

Day 10 - Chai Time Automation Series

Interpolation search

Free Python Tutorials - Graph/Network Analysis Made Simple

Leetcode #383 Ransom Note

?? Speed up your Pandas code! ??

Sqrt(x)