Issues to pay attention to when performing PCA in Spark, Python and R

Issues to pay attention to when performing PCA in Spark, Python and R

Recently, I was using PCA with Spark with sparse matrix with millions of rows and to make sure everything was right I first started with a small dataset and compared the results from Spark, Python and R which led to this blog post.

In this post, I will cover data preprocessing required and how to implement PCA in R, Python and Spark and how to translate the results.

These are my final remarks:

  1. Unlike in R, in Spark and in Python, you have to center and optionally standardize the data before applying PCA
  2. In Spark, unlike that of R and Python, a given list of columns has to be transformed into a single vector column
  3. The loadings from R and Spark have the same format but in Python they are transposed. So, you have to clearly understand which rows and columns to access while dealing with the loading matrix otherwise you can work with the wrong data.
  4. If your variables are in the same unit and with similar scale, use covariance matrix for your PCA (center the data)
  5. If your columns are in different units and have different scales, use correlation matrix (center the data and then standardize it so that it will have standard deviation of one and mean of zero)
  6. Remember standardizing the data may led to lose of information
  7. Once you center and or standardize your data, the matrix is no more sparse.
  8. If you are dealing with millions of rows like me, Spark will be your best friend.
  9. If you are working with big data and want to apply some algorithm, first understanding the mechanics using small data could save you time and help you avoid mistakes.

You can read the article here.

Enrico Spada

Tech Lead | Data @ Trade Republic

6 年

Fisseha Berhane, PhD thank you, useful article :) I have one question maybe you can give me a hint: when you perform PCA and the model your data, how can you make sense of the results? How can you explain a model based on principal components which are linear combination of maybe hundreth of features?

回复

要查看或添加评论,请登录

Fisseha Berhane, PhD的更多文章

社区洞察

其他会员也浏览了