登录查看更多内容

Why use Spark?

Rithwik Chhugani

Data Scientist | Data & AI Consultant

发布日期: 2021年1月10日

The very first reason you'll find on the internet is to work with Big Data, but why not use Pandas, Hadoop, or Dask? Let's find out about it in a few words.

Pandas is good for a dataset with millions of data points and works perfectly fine in most cases but when your data is too huge to be stored on one machine and is distributed across a network of machines, using pandas won't be a valid choice.

Hadoop is one of the leading Big Data technologies out there, so why do we need Spark? Hadoop also manages the data very efficiently but there is a thin line of difference between Spark and Hadoop. Hadoop does most of its computation by utilizing space on the hard disk whereas Spark utilizes memory (RAM). Though Spark without to be more expensive it's for you to decide between speed or cost. Spark is proven to be faster than Hadoop as it consumes the RAM on the hardware.

DASK is a python package also written in python that helps you work with the HDFS system but still, there are drawbacks to it in comparison to Spark. Spark supports Scala, Java, Python, R, and SQL whereas DASK only supports python. Spark gives you a complete package for all your needs but DASK is dependent on other libraries to get its job done. A plus point is that DASK and Spark both can handle up to 1000 nodes.

Based on the aforementioned differences it's for you to decide whether or not you would need Spark.

要查看或添加评论，请登录

Rithwik Chhugani的更多文章

Tools for Smart/Lazy Data Scientists (ft. LazyPredict)

2020年12月22日

Tools for Smart/Lazy Data Scientists (ft. LazyPredict)

Being a data scientist you don't necessarily need to write tons and tons of code to see the performance of your models.…
Vanilla Regression VS Robust Regression

2020年12月19日

Vanilla Regression VS Robust Regression

Regression is one of the most widely used algorithms for forecasting. Regression is the first thing you'd learn in the…
Likelihood VS Probability

2020年12月17日

Likelihood VS Probability

It may look simple, but it's capable to create head-scratching situations at times. Let's understand in a few words…
Popular CNN Architectures

2020年12月17日

Popular CNN Architectures

Every now and then researchers try to fine-tune their existing model or come up with new architectures to win the…
Types of Hyperparameter Tuning

2020年12月16日

Types of Hyperparameter Tuning

What is hyperparameter tuning? Hyperparameter tuning is an extra step to make sure that your model is using the right…

See all articles

Why use Spark?

Rithwik Chhugani

Data Scientist | Data & AI Consultant

Rithwik Chhugani的更多文章

社区洞察

其他会员也浏览了

Apache Pig Architecture

About Apache Spark, Lightning-fast cluster computing (Big Data)

Impala

Introduction to Hadoop

Map-Filter-Reduce

Impala

Automating LVM Partition using Python-Script

#bigdata 27e?—?PIG and Hive Languages

Pig vs Hive

What is HIVE?

Rithwik Chhugani的更多文章

Tools for Smart/Lazy Data Scientists (ft. LazyPredict)

Vanilla Regression VS Robust Regression

Likelihood VS Probability

Popular CNN Architectures

Types of Hyperparameter Tuning

社区洞察

其他会员也浏览了

Apache Pig Architecture

About Apache Spark, Lightning-fast cluster computing (Big Data)

Impala

Introduction to Hadoop

Map-Filter-Reduce

Impala

Automating LVM Partition using Python-Script

#bigdata 27e?—?PIG and Hive Languages

Pig vs Hive

What is HIVE?