登录查看更多内容

Issues to pay attention to when performing PCA in Spark, Python and R

Fisseha Berhane, PhD

Senior Principal Data Scientist

发布日期: 2018年10月8日

Recently, I was using PCA with Spark with sparse matrix with millions of rows and to make sure everything was right I first started with a small dataset and compared the results from Spark, Python and R which led to this blog post.

In this post, I will cover data preprocessing required and how to implement PCA in R, Python and Spark and how to translate the results.

These are my final remarks:

Unlike in R, in Spark and in Python, you have to center and optionally standardize the data before applying PCA
In Spark, unlike that of R and Python, a given list of columns has to be transformed into a single vector column
The loadings from R and Spark have the same format but in Python they are transposed. So, you have to clearly understand which rows and columns to access while dealing with the loading matrix otherwise you can work with the wrong data.
If your variables are in the same unit and with similar scale, use covariance matrix for your PCA (center the data)
If your columns are in different units and have different scales, use correlation matrix (center the data and then standardize it so that it will have standard deviation of one and mean of zero)
Remember standardizing the data may led to lose of information
Once you center and or standardize your data, the matrix is no more sparse.
If you are dealing with millions of rows like me, Spark will be your best friend.
If you are working with big data and want to apply some algorithm, first understanding the mechanics using small data could save you time and help you avoid mistakes.

You can read the article here.

Enrico Spada

Tech Lead | Data @ Trade Republic

6 年

Fisseha Berhane, PhD thank you, useful article :) I have one question maybe you can give me a hint: when you perform PCA and the model your data, how can you make sense of the results? How can you explain a model based on principal components which are linear combination of maybe hundreth of features?

查看更多评论

要查看或添加评论，请登录

Fisseha Berhane, PhD的更多文章

Deep Learning with TensorFlow and Keras

2019年7月5日

Deep Learning with TensorFlow and Keras

I have been teaching deep learning (DNN and CNN) with TensorFlow and Keras to colleagues at work and I have shared them…

3 条评论
How using scikit-learn in Spark could save the day

2019年4月5日

How using scikit-learn in Spark could save the day

We may want to use scikit-learn with Spark when: 1- training a model in scikit-learn takes so long 2- the machine…

1 条评论
ROC Curve could be misleading with imbalanced data: Precision-Recall Curve is more informative

2019年4月2日

ROC Curve could be misleading with imbalanced data: Precision-Recall Curve is more informative

Even if ROC curve and area under the ROC curve are commonly used to evaluate model performance with balanced and…

2 条评论
Data distributions where K-means clustering fails; can DBSCAN be a solution? Examples with R, Python and Spark

2018年11月4日

Data distributions where K-means clustering fails; can DBSCAN be a solution? Examples with R, Python and Spark

For K-means clustering to work well the variance of the distribution of each attribute (variable) should be…

10 条评论
Sampling using truncated hash

2018年10月25日

Sampling using truncated hash

Let's suppose tens of millions of people visit your website everyday and you want to do ad hoc analysis. However, you…
Hive Partitioning with Spark

2018年10月17日

Hive Partitioning with Spark

I experimented with Hive partitioning and some of the things I discussed in this blog post are: Query response time…
Simpson’s Paradox

2018年10月3日

Simpson’s Paradox

Today, I was listening to a data science podcast from DataCamp and they talked about Simpson's paradox: it is a…

1 条评论
Exploring the Pareto Distribution with R

2018年9月20日

Exploring the Pareto Distribution with R

We have learned various distributions in college. The most common one being the bell-curve, based on our area of study…

1 条评论
Benefits of and Tips on Hortonworks Apache Spark Certification

2018年3月25日

Benefits of and Tips on Hortonworks Apache Spark Certification

Recently, I took hands-on, performance-based certification for Spark on the Hortonworks Data Platform (HDPCD), and in…
Machine Learning with Text in PySpark - Part 1

2018年3月10日

Machine Learning with Text in PySpark - Part 1

We usually work with structured data in our machine learning applications. However, unstructured data can also have…

3 条评论

See all articles

Issues to pay attention to when performing PCA in Spark, Python and R

Fisseha Berhane, PhD

Senior Principal Data Scientist

Fisseha Berhane, PhD的更多文章

社区洞察

其他会员也浏览了

Python Data Science Code hacks: Part 1

Data Visualization with Python and Bokeh. 2

Twelve @dataclass Examples for Better Python?Code

The difference between <class 'bytes'> and <class '_io.BytesIO'> in Python

Part 2: Python Tuples: Immutable Collections Explained (Advanced Concepts and Use Cases)

Advanced Dictionary Techniques in Python

How to Compare Two Binary Trees in Python

NUMPY

Let's Python with Vijin (Part-2)

NumPy - Numerial Python | Belayet Hossain

Fisseha Berhane, PhD的更多文章

Deep Learning with TensorFlow and Keras

How using scikit-learn in Spark could save the day

ROC Curve could be misleading with imbalanced data: Precision-Recall Curve is more informative

Data distributions where K-means clustering fails; can DBSCAN be a solution? Examples with R, Python and Spark

Sampling using truncated hash

Hive Partitioning with Spark

Simpson’s Paradox

Exploring the Pareto Distribution with R

Benefits of and Tips on Hortonworks Apache Spark Certification

Machine Learning with Text in PySpark - Part 1

社区洞察

其他会员也浏览了

Python Data Science Code hacks: Part 1

Data Visualization with Python and Bokeh. 2

Twelve @dataclass Examples for Better Python?Code

The difference between <class 'bytes'> and <class '_io.BytesIO'> in Python

Part 2: Python Tuples: Immutable Collections Explained (Advanced Concepts and Use Cases)

Advanced Dictionary Techniques in Python

How to Compare Two Binary Trees in Python

NUMPY

Let's Python with Vijin (Part-2)

NumPy - Numerial Python | Belayet Hossain