Optimizing Spark with Cache and Persist.
Komal Khakal
Data Engineer @Mastercard | Big Data | PySpark | Databricks| SQL | Azure | Hive | Sqoop | Azure data lake| Azure Data Factory | SnowFlake
In PySpark, cache and persist are methods used to improve the performance of your Spark jobs by storing intermediate results in memory or on disk. These methods are particularly useful when you have operations that are reused multiple times in your computations.
Cache
- Purpose: The cache method is used to store the data in-memory (RAM).
- Usage: When you call cache() on a DataFrame or RDD, it saves the data in the default storage level, which is memory-only.
- Syntax: df.cache()
- Storage Level: By default, cache stores the data using MEMORY_ONLY storage level.
Example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CacheExample").getOrCreate()
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]
df = spark.createDataFrame(data, ["id", "name"])
# Cache the DataFrame
df.cache()
# Perform an action to materialize the cache
df.count()
# Subsequent actions will use the cached data
df.show()
Persist
- Purpose: The persist method is more flexible than cache and allows you to specify different storage levels.
- Usage: When you call persist() on a DataFrame or RDD, you can choose how you want to store the data (in-memory, on disk, or both).
- Syntax: df.persist(storageLevel)
- Storage Levels: Common storage levels include MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER.
Example:
from pyspark import StorageLevel
# Persist the DataFrame with MEMORY_AND_DISK storage level
df.persist(StorageLevel.MEMORY_AND_DISK)
# Perform an action to materialize the persist
df.count()
# Subsequent actions will use the persisted data
df.show()
领英推è
Comparison
Default Storage Level:
- cache() uses MEMORY_ONLY by default.
- persist() requires you to specify the storage level explicitly (though you can use MEMORY_ONLY as well).
Flexibility:
- cache() is simpler but less flexible.
- persist() offers more control over how and where the data is stored.
Use Cases:
- Use cache() when you want to store data in memory and expect it to fit.
- Use persist() when you need more control over storage, such as when dealing with larger datasets that might not fit entirely in memory.
Storage Levels in Detail
- MEMORY_ONLY: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly as needed.
- MEMORY_AND_DISK: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk and read them from there when needed.
- DISK_ONLY: Store RDD partitions only on disk.
- MEMORY_ONLY_SER: Store RDD as serialized Java objects (one byte array per partition) in the JVM. This is more space-efficient but more CPU-intensive to read.
- MEMORY_AND_DISK_SER: Similar to MEMORY_ONLY_SER but spill partitions that don't fit in memory to disk.
By understanding and using cache and persist appropriately, you can significantly improve the performance and efficiency of your PySpark jobs, especially when dealing with iterative algorithms or repeated access to the same data.
Data Engineer at Brillquest Technologies | | SQL | Python | Spark | Azure | Delta lake | Power BI
6 个月Hyy Komal Khakal the default storage level for cache() in Spark is MEMORY_AND_DISK, not MEMORY_ONLY.
Data Engineer at Brillquest Technologies | | SQL | Python | Spark | Azure | Delta lake | Power BI
6 个月Hyy Komal Khakal the default storage level for cache() in Spark is MEMORY_AND_DISK, not MEMORY_ONLY.
Data Engineer at Brillquest Technologies | | SQL | Python | Spark | Azure | Delta lake | Power BI
8 个月Nice explanation