Few points On Apache Spark 2.0 Streaming Over cluster

Few points On Apache Spark 2.0 Streaming Over cluster

Experience on Apache Spark 2.0 Streaming Over cluster 

Apache Spark streaming documentation has enough details about its basic, how to use and other details.

Please check https://spark.apache.org/docs/latest/streaming-programming-guide.html

What is the way to start Spark Streaming in 2.0 ?

SparkSession
  .builder
  .config(sparkConf)
  .getOrCreate() val ssc = new StreamingContext(conf, Seconds(windowtime.toInt))

 

Spark 2.0 has brought a Builder pattern to start any Spark component.

Read more https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html

 

How to calculate Window Length or Right batch Interval?

 

For stable Spark Streaming, the processing of data must be done within the window or queue will grow and excessive queue size will crash the job. It happened to me several times J

There are several things to be considered and few of them –

  • If possible, data rate can be slow down. If the processing job cannot be negotiated, try with lower data rate.
  • Increase the Window size. If your application can accept a larger window, then intelligent programming and larger window will make the streaming very stable. Intelligent programming means using the entire power/tricks of Spark, like caching as much as possible, avoid unbalanced join, re-partition and coalesce based in your need.Multiple write operation during streaming is sometimes mandatory but resource inefficient, so consider partition/re-partitioning/coalesce shuffle block size and other tricks to optimize that.
  • Introduce a cool-off period. After every interval of X time, stop the data flow to clear the existing queue of streaming job and then reset the data rate again.
  • Check Spark Configuration and use them properly. Like using CMS Garbage collector, spark memory fraction, spark.shuffle.file.buffer, spark.shuffle.spill.compress, spark.memory.offHeap.size and others.

 

Things to consider
  • When running spark streaming job for long time, you might get an error of No Space found, but that may be not the real case.Most of the time, it happens because of spark.local.dir setting as /tmp, so change the same in spark configuration.
  • While Using kafkaStreaming, make sure about the number of Threads while receiving because you must not ask for all the threads available in your cpu and if Kafka Streaming is non performent consider using multiple kafka streaming for each topic approach
  • Keep checking each of your executor. Sometime some of your executors might be very busy but not all of them, that will lead to unnecessary time waste during any shuffling operation and that lead to spark streaming queue. Monitor all the jobs properly and keep cleaning your garbage.

 

 val kafkaStreams = (1 to 3).map { i => KafkaUtils.createStream(ssc,  broker, "kfkrawdatagroup",Map(topicList(0) -> 1)).map(_._2.toString)}


val unifiedStream = ssc.union(kafkaStreams)           unifiedStream.repartition(3)

 

Spark Streaming Logging

 

Logging while Spark Streaming can be important sometime but one can imagine the amount of logs. So there is small api which is handy –

Valid log levels include:
* ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN spark.sparkContext.setLogLevel(“DEBUG”) 

 Or consider using Log4J RollingFileAppender-

log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8

 

Spark Dynamic Allocation of Resources 

There may be time when your streaming is slow or If data rate is not static all the time and there is a possibility of cores availability during sometime of hours, why not assign them to other required job.

Changes needed to be make in spark-defaults.conf-

 spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 56

 

And while submit the job, don’t assign –total-executor-cores and executor-memory

*While working in Spark 2.0, I found even while assigning them dynamic allocation automatically manages the core allocation, need to check it further.

 

Setup Apache Spark in Cluster

Very basic

 

MasterNode – master

Slaves – salve1,slave2,slave3

In each of the node-

Spark-env.sh-
export HADOOP_CONF_DIR=$HADOOP_CONF_DIR
export SPARK_MASTER_IP=master
export JAVA_HOME=/home/user/jdk1.8.0_91
export PYSPARK_PYTHON=PYTHON_PATH

Spark-sefaults.conf
spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir file:///home/home/spark-events
spark.history.fs.logDirectory file:///home/home/spark-events
spark.shuffle.service.enabled true

slaves
slave1
slave2
slave3

Now sbin/start-all.sh 

 

 

要查看或添加评论,请登录

Abhishek Choudhary的更多文章

  • Slack New Architecture

    Slack New Architecture

    This article presented the architecture/engineering decisions and changes brought in Slack to Scale it massively but by…

  • Unit Testing Apache Spark Applications in Scala or Python

    Unit Testing Apache Spark Applications in Scala or Python

    I saw a trend that developers usually find it very complicated to test spark application, may be no good library…

  • Spark On YARN cluster, Some Observations

    Spark On YARN cluster, Some Observations

    1. Number of partitions in Spark Basic => n Number of cores = n partitions = Number of executors Good => 2-3 times of…

    4 条评论
  • Apache Spark (Big Data) Cache - Something Nice to Know

    Apache Spark (Big Data) Cache - Something Nice to Know

    Spark Caching is one of the most important aspect of in-memory computing technology. Spark RDD Caching is required when…

  • Apache Airflow - if you are bored of Oozie & style

    Apache Airflow - if you are bored of Oozie & style

    Apache Airflow is an incubator Apache project for Workflow or Job Scheduler. DAG is the backbone of airflow.

    1 条评论
  • Apache Spark Serialization issue

    Apache Spark Serialization issue

    Its bit common to face Spark Serialization Issue while working with Streaming or basic Spark Job org.apache.

    3 条评论
  • Facebook Architecture (Technical)

    Facebook Architecture (Technical)

    Facebook's current architecture is: Web front-end written in PHP. Facebook's HipHop Compiler [1] then converts it to…

  • Apache Flink ,From a Developer point of View

    Apache Flink ,From a Developer point of View

    What is Apache Flink ? Apache Flink is an open source platform for distributed stream and batch data processing Flink’s…

    2 条评论
  • Apache Spark (big Data) DataFrame - Things to know

    Apache Spark (big Data) DataFrame - Things to know

    What is the architecture of Apache Spark Now? What is the point of interaction in Spark? Previously it was RDD but…

    6 条评论
  • Apache Spark 1.5 Released ...

    Apache Spark 1.5 Released ...

    Apache Spark 1.5 is released and now available to download https://spark.

社区洞察