登录查看更多内容

A Guide to Spark Streaming?—?Code Examples Included

Jim Scott

Executive experienced in strategic planning, and leveraging innovative solutions to create new revenue streams.

发布日期: 2016年7月1日

Apache Spark is great for processing large amounts of data over large clusters, but wouldn’t it be great if you could process data in near real time? You can with Spark Streaming.

What Is Spark Streaming?

Spark Streaming is a special SparkContext that you can use for processing data quickly in near-time. It’s similar to the standard SparkContext, which is geared toward batch operations. Spark Streaming uses a little trick to create small batch windows (micro batches) that offer all of the advantages of Spark: safe, fast data handling and lazy evaluation combined with real-time processing. It’s a combination of both batch and interactive processing.

You can adjust the window for processing latency down to half a second, but this is more memory intensive. Spark Streaming is used for everything ranging from credit card fraud detection to the identification of threats on the Internet.

What kinds of data can you analyze?

You might be wondering what kind of data you can ingest into Apache Spark. The short answer is pretty much everything.

More specifically, you can import data from Twitter, Flume, Kafka, ZeroMQ, a custom feed, and HDFS. You can also export into HDFS, as well as other databases, applications, and dashboards.

DStream and RDDs

How is all this possible? It all comes down to the primary data type of Spark: the RDD, or Resilient Distributed Dataset. An RDD is an abstraction of the data that’s held in memory, which is a lot faster than storing and fetching things from disk. This already gives you a significant speed boost over other systems.

RDDs are also safer to use because the transformations keep the original data in lineages, returning new RDDs with the transformations applied. This allows Spark to reconstruct the data with all the changes should something go wrong with one of the nodes in the cluster, such as a power failure.

DStream takes the concept of RDDs and applies it to streams. A DStream is simply a stream of RDDs, giving all of the advantages of speed and safety in near real time. The DStream API offers a limited set of transformations compared to the standard Apache Spark.

Spark Streaming Transformations

If you’re wondering what kind of transformations you can do on DStreams, they’re pretty similar to the standard Spark transformations.

We’ll borrow some examples from the Apache Spark Reference Card to give you a taste. Let’s pretend we’re reading data over some kind of stream, such as from a social media feed.

Let’s start our Spark Streaming Context:
val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream(“localhost”, 9999)

For example, map(func) takes func as an argument and applies it to each element, returning a new RDD.

Here’s an example multiplying each line by 10:
lines.map(x=>x.toInt*10).print()

We’ll send some data with the Netcat or nc program available on most Unix-like systems. Spark is reading from port 9999, so we’ll have to make sure Netcat points there.

prompt> nc –lk 9999
12
34

Here’s what the output looks like:
120
340

flatmap() is similar, but can return 0 or more items.

This example splits text, putting each word on separate lines:
lines.flatMap(_.split(“ “)).print()

Let’s try it with the string “Spark is fun”:
prompt> nc –lk 9999
Spark is fun

And here’s the output:
Spark
is
fun

count() is obvious enough. It counts the number of data elements.

We can count the lines in our stream:
lines.flatMap(_.split(“ “)).count()

And here are some lines:
prompt> nc –lk 9999
say
hello
to
spark

The output should be 4.

reduce() is similar, but applies a function as an argument to the data elements instead of just adding them.

We can use this to add up all the numbers:
lines.map(x=>x.toInt).reduce(_+_).print()

And let’s get some numbers into Spark Streaming:
prompt> nc –lk 9999
1
3
5
7

The answer should be 16.

countByValue() counts the number of occurrences of each data set.

We can use this to count the number of times each word occurs:
lines.countByValue().print()

We’ll include some duplicate lines just to show you how it works:
prompt>nc –lk 9999
spark
spark
is
fun
fun

The output will look like this:
(is,1)
(spark,2)
(fun,2)

An alternate way we could do this is by using the reduceByKey() function:
val words = lines.flatMap(_.split(“ “))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_)
wordCounts.print()

Conclusion

By now, you have seen the power of Spark Streaming and what it can do for your near-real-time big data needs. If you want to experiment, you can download a private Sandbox, a full virtual machine with Apache Spark that you can play around with.

Originally published at www.smartdatacollective.com.

要查看或添加评论，请登录

Jim Scott的更多文章

Apache Spark in a Hadoop-based Big Data Architecture – Infographic

2016年7月15日

Apache Spark in a Hadoop-based Big Data Architecture – Infographic

Note: If you’re interested in learning more about Apache Spark, download this free interactive ebook?—?Getting Started…
The Importance of Apache Drill to the Big Data Ecosystem

2016年7月13日

The Importance of Apache Drill to the Big Data Ecosystem

There are many lessons that our high school teachers tried to teach us. Some stuck and others went in one ear and out…

7 条评论
Turning Data Into Value with Hadoop and Spark?—?Infographic

2016年7月11日

Turning Data Into Value with Hadoop and Spark?—?Infographic

The faster questions can be asked the faster you can get answers. Waiting for data to be shipped off of servers to a…
Big Data on the Road

2016年7月6日

Big Data on the Road

Getting from point A to point B has been one of humanity’s greatest preoccupations throughout history. While we’ve…
Zeta Architecture: Hexagon is the new circle

2016年7月4日

Zeta Architecture: Hexagon is the new circle

Data processing in the enterprise goes very swiftly from “good enough” to “we need to be faster!” as expectations grow.…

4 条评论
A Closer Look at RDDs

2016年6月29日

A Closer Look at RDDs

Apache Spark has gotten a lot of attention for its fast processing of large amounts of data. But how does it get up to…

5 条评论
How the Internet of Things Impacts Big Data Strategies

2016年6月28日

How the Internet of Things Impacts Big Data Strategies

What exactly is the Internet of Things? Put simply, the Internet of Things (IoT) connects devices such as everyday…
NoSQL and the Internet of Things

2016年6月24日

NoSQL and the Internet of Things

Internet of Things technology is a hot topic. You can’t read a tech news site without coming across at least one…

1 条评论
NoSQL and Real-Time Analytics: What You Need to Know

2016年6月20日

NoSQL and Real-Time Analytics: What You Need to Know

If you’re in business, you need to know what’s going on both inside and outside your company operations. You need some…

8 条评论
Recognizing and Rescuing a Failing Big Data Project

2016年6月17日

Recognizing and Rescuing a Failing Big Data Project

If you ran an analysis, you may discover there is a good chance your big data project will not proceed according to…

28 条评论

See all articles

A Guide to Spark Streaming?—?Code Examples Included

Jim Scott

Executive experienced in strategic planning, and leveraging innovative solutions to create new revenue streams.

What Is Spark Streaming?

What kinds of data can you analyze?

DStream and RDDs

Spark Streaming Transformations

Conclusion

Jim Scott的更多文章

社区洞察

其他会员也浏览了

Real-Time Data Streaming Simplified with Apache Kafka

A Deep Dive into Apache Kafka: Real-Time Data Streaming

Spark Structured Streaming

Modern Data Integration with Streaming Analytics: Real-Time Ingestion & Processing

Streamlining Your Data: An Overview of Different Types of Streaming Pipelines

Apache Kafka - Summary

Real-Time Streaming Data Pipelines With Apache Kafka, Spark Streaming, And Hbase with BP in the Gulf coast OIL Use Case

Apache Flink and Confluent: The Use Cases and Benefits of Integration with Confluent’s Data Streaming Platform

Do You Know all about Apache Spark Streaming? Read This.

OpenShift 4.X Operators - Installing and playing with AMQ Streams operator

What Is Spark Streaming?

What kinds of data can you analyze?

DStream and RDDs

Spark Streaming Transformations

Conclusion

Jim Scott的更多文章

Apache Spark in a Hadoop-based Big Data Architecture – Infographic

The Importance of Apache Drill to the Big Data Ecosystem

Turning Data Into Value with Hadoop and Spark?—?Infographic

Big Data on the Road

Zeta Architecture: Hexagon is the new circle

A Closer Look at RDDs

How the Internet of Things Impacts Big Data Strategies

NoSQL and the Internet of Things

NoSQL and Real-Time Analytics: What You Need to Know

Recognizing and Rescuing a Failing Big Data Project

社区洞察

其他会员也浏览了

Real-Time Data Streaming Simplified with Apache Kafka

A Deep Dive into Apache Kafka: Real-Time Data Streaming

Spark Structured Streaming

Modern Data Integration with Streaming Analytics: Real-Time Ingestion & Processing

Streamlining Your Data: An Overview of Different Types of Streaming Pipelines

Apache Kafka - Summary

Real-Time Streaming Data Pipelines With Apache Kafka, Spark Streaming, And Hbase with BP in the Gulf coast OIL Use Case

Apache Flink and Confluent: The Use Cases and Benefits of Integration with Confluent’s Data Streaming Platform

Do You Know all about Apache Spark Streaming? Read This.

OpenShift 4.X Operators - Installing and playing with AMQ Streams operator