Lecture notes: an intro to Apache Spark programming

In Lecture 7 of our Big Data in 30 hours class, we discussed Apache Spark and did some hands-on programming. The purpose of this memo is to summarize the terms and ideas presented.

Apache Spark is the currently one of the most popular platforms for parallel execution of computing jobs in a distributed environment. The idea is not new. Starting in the late 1980’s, the HPC (high performance computing) community executed jobs in parallel over clusters, supercomputers and compute farms. Technologies of the time, broadly related to scheduling jobs, included: PVM, MPI, PBS, Platform LSF, Sun Grid Engine, Globus, Moab, and many more. In the first decade of 2000’s, cluster computing saw mainstream with advent of high-level APIs, cloud environments and Hadoop MapReduce model (discussed in previous lecture). 

Hadoop (orignal credits to Doug Cutting and Mike Cafarella, working for Google and Yahoo! respectively, later Apache took over) became so popular that for a while it was a de-facto standard in the field. However, some deficiencies of MapReduce model included:

  • focus on batch operations, while the market demand drifted toward real-time online processing
  • limited, non-flexible API. Only some type of computations could be represented in this model
  • missing abstractions for advanced workflows: streamed data, interactive data, DAG workflows, heterogeneous tasks

Apache Spark (original credits to Matei Zaharia at UC Berkeley’s AMPLab) came in light in the early 2010’s because it had response to these deficiencies. Spark is a distributed processing engine:

  • written in Scala, with programming interface in Scala, Python, R and SQL
  • Focused on in-memory processing
  • 10 – 100 faster than Hadoop
  • allows for a wide range of workflows. Flexible and easy in programming
  • many distributed operations happen implicitly. Programmer only implies them in the source code
  • Leverages a lot of Hadoop infrastructure underneath: Mesos, YARN, HDFS

Read more here, to find out about the first steps in Spark programming, building RDD datasets, understanding partitioning and DAG scheduler.

要查看或添加评论,请登录

Pawel Plaszczak的更多文章

社区洞察

其他会员也浏览了