登录查看更多内容

Dask vs Spark

Rohan Chikorde

VP - AIML at BNY Mellon | 17k+ followers | AIML Corporate Trainer | University Professor | Speaker

发布日期: 2021年7月8日

#Apache Spark ?is a popular distributed computing tool for tabular datasets that is growing to become a dominant name in Big Data analysis today. #Dask has several elements that appear to intersect this space and we are often asked, “How does Dask compare with Spark?”

Answering such comparison questions in an unbiased and informed way is hard, particularly when the differences can be somewhat technical. This document tries to do this.

Generally, Dask is smaller and lighter weight than Spark. This means that it has fewer features and, instead, is used in conjunction with other libraries, particularly those in the numeric Python ecosystem. It couples with libraries like Pandas or Scikit-Learn to achieve high-level functionality.

Language

Spark is written in Scala with some support for Python and R. It interoperates well with other JVM code.
Dask is written in Python and only really supports Python. It interoperates well with C/C++/Fortran/LLVM or other natively compiled code linked through Python.

Ecosystem

Spark is an all-in-one project that has inspired its own ecosystem. It integrates well with many other Apache projects.
Dask is a component of the larger Python ecosystem. It couples with and enhances other libraries like NumPy, Pandas, and Scikit-Learn.

Age and Trust

Spark is older (since 2010) and has become a dominant and well-trusted tool in the Big Data enterprise world.
Dask is younger (since 2014) and is an extension of the well trusted NumPy/Pandas/Scikit-learn/Jupyter stack.

Scope

Spark is more focused on traditional business intelligence operations like SQL and lightweight machine learning.
Dask is applied more generally both to business intelligence applications, as well as a number of scientific and custom situations.

Internal Design

Spark’s internal model is higher level, providing good high level optimizations on uniformly applied computations, but lacking flexibility for more complex algorithms or ad-hoc systems. It is fundamentally an extension of the Map-Shuffle-Reduce paradigm.
Dask’s internal model is lower level, and so lacks high level optimizations, but is able to implement more sophisticated algorithms and build more complex bespoke systems. It is fundamentally based on generic task scheduling.

Scale

Spark scales from a single node to thousand-node clusters.
Dask scales from a single node to thousand-node clusters.

APIs

DataFrames

Spark DataFrame has its own API and memory model. It also implements a large subset of the SQL language. Spark includes a high-level query optimizer for complex queries.
Dask DataFrame reuses the Pandas API and memory model. It implements neither SQL nor a query optimizer. It is able to do random access, efficient time series operations, and other Pandas-style indexed operations.

Spiral Mantra 5 个月前

Accessing Columns in PySpark: A Comprehensive Guide

Sachin D N ???? 8 个月前

Apache Spark - Memory Allocation

Kumar Preeti Lata 3 个月前

Machine Learning

Spark MLLib is a cohesive project with support for common operations that are easy to implement with Spark’s Map-Shuffle-Reduce style system. People considering MLLib might also want to consider?other?JVM-based machine learning libraries like H2O, which may have better performance.
Dask relies on and interoperates with existing libraries like Scikit-Learn and XGBoost. These can be more familiar or higher performance, but generally results in a less-cohesive whole. See the?dask-ml ?project for integrations.

Arrays

Spark does not include support for multi-dimensional arrays natively (this would be challenging given their computation model), although some support for two-dimensional matrices may be found in MLLib. People may also want to look at the?Thunder ?project, which combines Apache Spark with NumPy arrays.
Dask fully supports the NumPy model for?scalable multi-dimensional arrays .

Streaming

Spark’s support for streaming data is first-class and integrates well into their other APIs. It follows a mini-batch approach. This provides decent performance on large uniform streaming operations.
Dask provides a?real-time futures interface ?that is lower-level than Spark streaming. This enables more creative and complex use-cases, but requires more work than Spark streaming.

Graphs / complex networks

Spark provides GraphX, a library for graph processing.
Dask provides no such library.

Custom parallelism

Spark generally expects users to compose computations out of their high-level primitives (map, reduce, groupby, join, …). It is also possible to extend Spark through subclassing RDDs, although this is rarely done.
Dask allows you to specify arbitrary task graphs for more complex and custom systems that are not part of the standard set of collections.

Reasons you might choose Spark

You prefer Scala or the SQL language
You have mostly JVM infrastructure and legacy systems
You want an established and trusted solution for business
You are mostly doing business analytics with some lightweight machine learning
You want an all-in-one solution

Reasons you might choose Dask

You prefer Python or native code, or have large legacy code bases that you do not want to entirely rewrite
Your use case is complex or does not cleanly fit the Spark computing model
You want a lighter-weight transition from local computing to cluster computing
You want to interoperate with other technologies and don’t mind installing multiple packages

Reasons to choose both

It is easy to use both Dask and Spark on the same data and on the same cluster.

They can both read and write common formats, like CSV, JSON, ORC, and Parquet, making it easy to hand results off between Dask and Spark workflows.

They can both deploy on the same clusters. Most clusters are designed to support many different distributed systems at the same time, using resource managers like Kubernetes and YARN. If you already have a cluster on which you run Spark workloads, it’s likely easy to also run Dask workloads on your current infrastructure and vice versa.

In particular, for users coming from traditional Hadoop/Spark clusters (such as those sold by Cloudera/Hortonworks) you are using the Yarn resource manager. You can deploy Dask on these systems using the?Dask Yarn ?project, as well as other projects, like?JupyterHub on Hadoop .

Dask vs Spark

Rohan Chikorde

VP - AIML at BNY Mellon | 17k+ followers | AIML Corporate Trainer | University Professor | Speaker

Language

Ecosystem

Age and Trust

Scope

Internal Design

Scale

APIs

DataFrames

领英推荐

Machine Learning

Arrays

Streaming

Graphs / complex networks

Custom parallelism

Reasons you might choose Spark

Reasons you might choose Dask

Reasons to choose both

更多精彩文章

社区洞察

其他会员也浏览了

How to use PySpark on your computer

Spark Tidbits - Lesson 8

Practical Apache Spark in 10 minutes. Part 7 — GraphX and Neo4j

Beyond Pandas: How to tame your large Datasets in Python

Hight level API in Spark

PySpark

Big Data Processing with Python and Apache Spark

Practical Apache Spark in 10 minutes. Part 2 - RDD

Python and Future of Data

PYSPARK - SHOULD MANAGERS USE IT?

Language

Ecosystem

Age and Trust

Scope

Internal Design

Scale

APIs

DataFrames

领英推荐

Machine Learning

Arrays

Streaming

Graphs / complex networks

Custom parallelism

Reasons you might choose Spark

Reasons you might choose Dask

Reasons to choose both

Key Steps to Learn Machine Learning in 2024

2024年3月10日

From Content to Art: An Introduction to Neural Style Transfer using Python and TensorFlow

2023年2月15日

How to Handle Large Data for Machine Learning

2021年6月30日

Quick Understanding: Instance segmentation vs. Semantic segmentation in Image Analysis

2020年3月12日

Configure Deep Learning Architecture

2019年1月6日

Recurrent Neural Networks (#RNN) and #LSTM- Deep Learning

2018年10月18日

Deep Learning vs Traditional Machine Learning... Which one I should use?

2018年8月25日

Use Cases of Deep Learning

2018年7月28日

Simplifying Deep Learning - Part II

2018年2月10日

Simplifying Deep Learning - Part I

2017年11月19日

社区洞察

其他会员也浏览了

How to use PySpark on your computer

Spark Tidbits - Lesson 8

Practical Apache Spark in 10 minutes. Part 7 — GraphX and Neo4j

Beyond Pandas: How to tame your large Datasets in Python

Hight level API in Spark

PySpark

Big Data Processing with Python and Apache Spark

Practical Apache Spark in 10 minutes. Part 2 - RDD

Python and Future of Data

PYSPARK - SHOULD MANAGERS USE IT?