登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Few points On Apache Spark 2.0 Streaming Over cluster

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

发布日期: 2016年8月23日

+ 关注

Experience on Apache Spark 2.0 Streaming Over cluster

Apache Spark streaming documentation has enough details about its basic, how to use and other details.

Please check https://spark.apache.org/docs/latest/streaming-programming-guide.html

What is the way to start Spark Streaming in 2.0 ?

SparkSession
.builder
.config(sparkConf)
.getOrCreate() val ssc = new StreamingContext(conf, Seconds(windowtime.toInt))

Spark 2.0 has brought a Builder pattern to start any Spark component.

How to calculate Window Length or Right batch Interval?

For stable Spark Streaming, the processing of data must be done within the window or queue will grow and excessive queue size will crash the job. It happened to me several times J

There are several things to be considered and few of them –

If possible, data rate can be slow down. If the processing job cannot be negotiated, try with lower data rate.
Increase the Window size. If your application can accept a larger window, then intelligent programming and larger window will make the streaming very stable. Intelligent programming means using the entire power/tricks of Spark, like caching as much as possible, avoid unbalanced join, re-partition and coalesce based in your need.Multiple write operation during streaming is sometimes mandatory but resource inefficient, so consider partition/re-partitioning/coalesce shuffle block size and other tricks to optimize that.

Introduce a cool-off period. After every interval of X time, stop the data flow to clear the existing queue of streaming job and then reset the data rate again.
Check Spark Configuration and use them properly. Like using CMS Garbage collector, spark memory fraction, spark.shuffle.file.buffer, spark.shuffle.spill.compress, spark.memory.offHeap.size and others.

Things to consider

When running spark streaming job for long time, you might get an error of No Space found, but that may be not the real case.Most of the time, it happens because of spark.local.dir setting as /tmp, so change the same in spark configuration.

While Using kafkaStreaming, make sure about the number of Threads while receiving because you must not ask for all the threads available in your cpu and if Kafka Streaming is non performent consider using multiple kafka streaming for each topic approach
Keep checking each of your executor. Sometime some of your executors might be very busy but not all of them, that will lead to unnecessary time waste during any shuffling operation and that lead to spark streaming queue. Monitor all the jobs properly and keep cleaning your garbage.

val kafkaStreams = (1 to 3).map { i => KafkaUtils.createStream(ssc, broker, "kfkrawdatagroup",Map(topicList(0) -> 1)).map(_._2.toString)}

val unifiedStream = ssc.union(kafkaStreams) unifiedStream.repartition(3)

Spark Streaming Logging

Logging while Spark Streaming can be important sometime but one can imagine the amount of logs. So there is small api which is handy –

Valid log levels include:
* ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN spark.sparkContext.setLogLevel(“DEBUG”)

Or consider using Log4J RollingFileAppender-

log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8

Spark Dynamic Allocation of Resources

There may be time when your streaming is slow or If data rate is not static all the time and there is a possibility of cores availability during sometime of hours, why not assign them to other required job.

Changes needed to be make in spark-defaults.conf-

spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 56

And while submit the job, don’t assign –total-executor-cores and executor-memory

*While working in Spark 2.0, I found even while assigning them dynamic allocation automatically manages the core allocation, need to check it further.

Setup Apache Spark in Cluster

Very basic

MasterNode – master

Slaves – salve1,slave2,slave3

In each of the node-

Spark-env.sh-
export HADOOP_CONF_DIR=$HADOOP_CONF_DIR
export SPARK_MASTER_IP=master
export JAVA_HOME=/home/user/jdk1.8.0_91
export PYSPARK_PYTHON=PYTHON_PATH

Spark-sefaults.conf
spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir file:///home/home/spark-events
spark.history.fs.logDirectory file:///home/home/spark-events
spark.shuffle.service.enabled true

slaves
slave1
slave2
slave3

Now sbin/start-all.sh

要查看或添加评论，请登录

Abhishek Choudhary的更多文章

Slack New Architecture

2020年1月1日

Slack New Architecture

This article presented the architecture/engineering decisions and changes brought in Slack to Scale it massively but by…
Unit Testing Apache Spark Applications in Scala or Python

2017年7月12日

Unit Testing Apache Spark Applications in Scala or Python

I saw a trend that developers usually find it very complicated to test spark application, may be no good library…
Spark On YARN cluster, Some Observations

2017年4月24日

Spark On YARN cluster, Some Observations

1. Number of partitions in Spark Basic => n Number of cores = n partitions = Number of executors Good => 2-3 times of…

4 条评论
Apache Spark (Big Data) Cache - Something Nice to Know

2017年1月17日

Apache Spark (Big Data) Cache - Something Nice to Know

Spark Caching is one of the most important aspect of in-memory computing technology. Spark RDD Caching is required when…
Apache Airflow - if you are bored of Oozie & style

2016年12月12日

Apache Airflow - if you are bored of Oozie & style

Apache Airflow is an incubator Apache project for Workflow or Job Scheduler. DAG is the backbone of airflow.

1 条评论
Apache Spark Serialization issue

2016年11月13日

Apache Spark Serialization issue

Its bit common to face Spark Serialization Issue while working with Streaming or basic Spark Job org.apache.

3 条评论
Facebook Architecture (Technical)

2015年11月19日

Facebook Architecture (Technical)

Facebook's current architecture is: Web front-end written in PHP. Facebook's HipHop Compiler [1] then converts it to…
Apache Flink ,From a Developer point of View

2015年10月26日

Apache Flink ,From a Developer point of View

What is Apache Flink ? Apache Flink is an open source platform for distributed stream and batch data processing Flink’s…

2 条评论
Apache Spark (big Data) DataFrame - Things to know

2015年10月12日

Apache Spark (big Data) DataFrame - Things to know

What is the architecture of Apache Spark Now? What is the point of interaction in Spark? Previously it was RDD but…

6 条评论
Apache Spark 1.5 Released ...

2015年9月10日

Apache Spark 1.5 Released ...

Apache Spark 1.5 is released and now available to download https://spark.

See all articles

Experience on Apache Spark 2.0 Streaming Over cluster

Very basic

Abhishek Choudhary的更多文章

Slack New Architecture

Unit Testing Apache Spark Applications in Scala or Python

Spark On YARN cluster, Some Observations

Apache Spark (Big Data) Cache - Something Nice to Know

Apache Airflow - if you are bored of Oozie & style

Apache Spark Serialization issue

Facebook Architecture (Technical)

Apache Flink ,From a Developer point of View

Apache Spark (big Data) DataFrame - Things to know

Apache Spark 1.5 Released ...

社区洞察