Apache Spark Serialization issue
Abhishek Choudhary
Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI
Its bit common to face Spark Serialization Issue while working with Streaming or basic Spark Job
org.apache.spark.SparkException: Task not serializable
Its very annoying or I may say Hard to debug the issue and find it out exactly what caused the issue. Ideally something is not Serializable and that threw the issue. Some basic Guidelines made by Databricks to avoid the scenario -
- Declare functions inside an Object as much as possible
- If you need to use SparkContext or SQLContext inside closures (e.g. inside foreachRDD), then use SparkContext.get() and SQLContext.getActiveOrCreate() instead
- Redefine variables provided to class constructors inside functions
stream.map(addOne).window(Minutes(1)).foreachRDD { rdd =>
// Access the SQLContext using getOrCreate
val _sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
_sqlContext.createDataFrame(rdd).toDF("value", "time")
.withColumn("date", from_unixtime(col("time") / 1000)) // we could have imported _sqlContext.implicits._ and used $"time"
.registerTempTable("demo_numbers")
}
Product manager | audio engineer
8 年Or define your classes to be Serializable, preferably in combination with Kryo (instead of native Java serialization).