登录查看更多内容

Lecture notes: an intro to Apache Spark programming

Pawel Plaszczak

Data Scientist / Data Engineer

发布日期: 2018年12月13日

In Lecture 7 of our Big Data in 30 hours class, we discussed Apache Spark and did some hands-on programming. The purpose of this memo is to summarize the terms and ideas presented.

Apache Spark is the currently one of the most popular platforms for parallel execution of computing jobs in a distributed environment. The idea is not new. Starting in the late 1980’s, the HPC (high performance computing) community executed jobs in parallel over clusters, supercomputers and compute farms. Technologies of the time, broadly related to scheduling jobs, included: PVM, MPI, PBS, Platform LSF, Sun Grid Engine, Globus, Moab, and many more. In the first decade of 2000’s, cluster computing saw mainstream with advent of high-level APIs, cloud environments and Hadoop MapReduce model (discussed in previous lecture).

Hadoop (orignal credits to Doug Cutting and Mike Cafarella, working for Google and Yahoo! respectively, later Apache took over) became so popular that for a while it was a de-facto standard in the field. However, some deficiencies of MapReduce model included:

focus on batch operations, while the market demand drifted toward real-time online processing
limited, non-flexible API. Only some type of computations could be represented in this model
missing abstractions for advanced workflows: streamed data, interactive data, DAG workflows, heterogeneous tasks

Apache Spark (original credits to Matei Zaharia at UC Berkeley’s AMPLab) came in light in the early 2010’s because it had response to these deficiencies. Spark is a distributed processing engine:

written in Scala, with programming interface in Scala, Python, R and SQL
Focused on in-memory processing
10 – 100 faster than Hadoop
allows for a wide range of workflows. Flexible and easy in programming
many distributed operations happen implicitly. Programmer only implies them in the source code
Leverages a lot of Hadoop infrastructure underneath: Mesos, YARN, HDFS

Read more here, to find out about the first steps in Spark programming, building RDD datasets, understanding partitioning and DAG scheduler.

要查看或添加评论，请登录

Pawel Plaszczak的更多文章

Data Science and Data Engineering: general introduction

2024年4月11日

Data Science and Data Engineering: general introduction

identifications of anomalies against a trend heatmaps of weekly frequency of events Over the past years I have written…

1 条评论
How much a Data Scientist should know about the infrastructure?

2018年12月6日

How much a Data Scientist should know about the infrastructure?

We are in the middle of this semester’s Big Data in 30 Hours class. We just did lecture 7 out of 15.

Lecture notes: an intro to Apache Spark programming

Pawel Plaszczak

Data Scientist / Data Engineer

Pawel Plaszczak的更多文章

社区洞察

其他会员也浏览了

Apache Spark

Apache Spark - Memory Allocation

WHAT IS SPARK

What is Apache Spark?

Spark Tidbits - Lesson 8

Practical Apache Spark in 10 minutes. Part 7 — GraphX and Neo4j

Spark Tidbits - Lesson 10

Distributed Programming: An In-Depth Exploration

What is Apache Spark? The big data platform!

Lets learn Spark...

Pawel Plaszczak的更多文章

Data Science and Data Engineering: general introduction

How much a Data Scientist should know about the infrastructure?

社区洞察

其他会员也浏览了

Apache Spark

Apache Spark - Memory Allocation

WHAT IS SPARK

What is Apache Spark?

Spark Tidbits - Lesson 8

Practical Apache Spark in 10 minutes. Part 7 — GraphX and Neo4j

Spark Tidbits - Lesson 10

Distributed Programming: An In-Depth Exploration

What is Apache Spark? The big data platform!

Lets learn Spark...