Which one to pick? GPU Computing vs Apache Spark to scale your next Big Task

No alt text provided for this image

Parallel computing is a type of computation in which calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. Both Apache Spark and GPU computing, provide this capability and power of parallel computing. I have utilized the power of both Apache Spark and GPU at times trying to solve the business problem.

In this blog post, I intend to reveal when provided with a use case, how I choose between the two or how I break down the problem and decide which part which framework to utilize Spark or GPU. This will be a short post, directly coming to the rule of selection from my personal experience. Do reach out to me if you have any use-case at work that you are trying to solve and need any guidance. I will be more than happy to help.

1. Brief about Apache Spark and GPU Computing

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

GPU Computing

GPU computing is the use of a GPU (graphics processing unit) as a co-processor to accelerate CPUs for general-purpose scientific and engineering computing.

A CPU consists of four to eight CPU cores, while the GPU consists of hundreds of smaller cores. Together, they operate to crunch through the data in the application. This massively parallel architecture is what gives the GPU its high compute performance. There are a number of GPU-accelerated applications that provide an easy way to access high-performance computing (HPC).

2. Rule of Thumb for selection of the Framework

The rules that I use for the selection of the framework are -

  1. Amount of Data
  2. Number of Computations

Amount of Data

If the amount of data is large as if I have to process the whole Wikipedia data to filter the profiles of sportsperson or movie celebrities, I will go with Apache Spark. I have a blog post doing that. Or if I have to run a Machine learning model for multiple markets, I will go with Apache Spark.

Number of Computations

If the amount of data is is not the caveat but number of operations or computations are, I will prefer GPU.
For example if I have k-dimensional embeddings of n-items and I have to find top-3 items for each item in the list. The task will require comparing each item with other item and doing a k-dimension cosine similarity. The complexity of task is O(n2k). If there are 1 million items and 100-bit embedding for each item, number of computations to be performed are 101? O(n2k = 10? * 10? * 100 = 101?). But, finding top-k items for one item is independent to other, so I can utilise GPU for it. I have a blog post doing that.

3. Conclusion

In this post, I mostly covered my rule of thumb for selecting the right parallel computing framework very naively. Whenever I mentioned GPU computing, all I meant was one or two GPUs and not a GPU cluster. There are many other important factors that were not considered to keep the post simple. With the advancement in GPU and Spark technology, many other things are getting tried like the Spark-based GPU Clusters. In the near future, things will change a lot due to these advancements.

About the author-:
Abhishek Mungoli is a seasoned Data Scientist with experience in ML field, with a Computer Science background, spanning over various domains and problem-solving mindset. Excelled in various Machine learning and Optimization problems specific to Retail. Enthusiastic about implementing Machine Learning models at scale and knowledge sharing via blogs, talks, meetups, and papers, etc.
My motive always is to simplify the toughest of the things to its most simplified version. I love problem-solving, data science, product development, and scaling solutions. I love to explore new places and working out in my leisure time. Follow me up at MediumLinkedin or Instagram and check out my previous posts. I welcome feedback and constructive criticism. Some of my blogs -


4. References

  1. https://databricks.com/session/deep-learning-with-apache-spark-and-gpus
  2. https://www.boston.co.uk/info/nvidia-kepler/what-is-gpu-computing.aspx
  3. https://towardsdatascience.com/process-wikipedia-using-apache-spark-to-create-spicy-hot-datasets-1a59720e6e25
  4. https://medium.com/walmartlabs/how-gpu-computing-literally-saved-me-at-work-fc1dc70f48b6
  5. https://nusit.nus.edu.sg/services/hpc-newsletter/apache-spark-achieves-hadoop-didnt/


Padam Tripathi (Learner)

AI Architect | Generative AI, LLM | NLP | Image Processing | Cloud Architect | Data Engineering (Hands-On)

9 个月

great article.

回复
Subhradeep Ray

Serving Notice || GoLang && Java || LIGD - 11.04 || 4.8 YOE || Fujitsu || Ex-Nokia || Docker && K8S || Helm || Ansible

5 年

Depends on the Task and Scaling Required, I Think

要查看或添加评论,请登录

社区洞察

其他会员也浏览了