Skip to content

Latest commit

 

History

History
63 lines (42 loc) · 2.34 KB

spark-rdd-caching.adoc

File metadata and controls

63 lines (42 loc) · 2.34 KB

RDD Caching and Persistence

Caching or persistence are very important optimisation techniques for iterative and interactive Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.

RDDs can be cached using cache operation. They can also be persisted using persist operation.

The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.

Note
Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably and I will follow the "pattern" here.

RDDs can be unpersisted.

cache method

Caution
FIXME

persist method

Caution
FIXME

Storage Levels

StorageLevel describes how an RDD is persisted (and addresses the following concerns):

  • Does RDD use disk?

  • How much of RDD is in memory?

  • Does RDD use off-heap memory?

  • Should an RDD be serialized (while persisting)?

  • How many replicas (default: 1) to use (can only be less than 40)?

There are the following StorageLevel (number _2 in the name denotes 2 replicas):

  • NONE (default)

  • DISK_ONLY

  • DISK_ONLY_2

  • MEMORY_ONLY (default for cache() operation)

  • MEMORY_ONLY_2

  • MEMORY_ONLY_SER

  • MEMORY_ONLY_SER_2

  • MEMORY_AND_DISK

  • MEMORY_AND_DISK_2

  • MEMORY_AND_DISK_SER

  • MEMORY_AND_DISK_SER_2

  • OFF_HEAP

You can check out the storage level using getStorageLevel() operation.

val lines = sc.textFile("README.md")

scala> lines.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(disk=false, memory=false, offheap=false, deserialized=false, replication=1)

Unpersisting RDDs (Clearing Blocks)

When unpersist(blocking: Boolean = true) method is called, you should see the following INFO message in the logs:

INFO [RddName]: Removing RDD [id] from persistence list

It then calls SparkContext.unpersistRDD(id, blocking) and sets StorageLevel.NONE as the storage level.