Spark Internals and Design Basics

Apache Spark is an open-source general-purpose cluster computing engine built around speed, ease of use, and sophisticated analytics, that is primarily used for Big Data. It came in as an alternative to cope with the complexity and tediousness in Hadoop MapReduce for running Machine Learning algorithms. Spark introduces two main abstractions: resilient distributed datasets (RDDs) and parallel operations. Spark is written in Scala Programming Language and runs on Java Virtual Machine (JVM) environment. It very well exploits the functional programming aspect of Scala for elegance and simplicity.

Limitation of Classic MapReduce

MapReduce has been inherently designed for Directed Acyclic Graph (DAG) workflows. This puts a severe limitation on the kind of workloads you can execute. On the other hand, Machine Learning algorithms are frequently characterized by cyclic workflows involving multiple parallel operations on the same working dataset. For example, PageRank algorithm where the estimates of the value of the links converge over a number of iterations. Dryad and Map-Reduce-Merge provide the acyclic workflow framework, but are based on MapReduce design and so are more IO-intensive, unlike Spark.

MapReduce came in the times when RAM was not as cheap as it is today, and for that reason it is disk oriented and IO intensive. Every map-reduce operation transforms the complete dataset and flushes the output at the end to the disk, which in turn is read as input by the subsequent operation, and so on. This can be deemed as “vertical” execution — as in where the transform operation happens on all the entries in the dataset vertically downward. Whereas in Spark the execution flow can be seen as “horizontal”: whenever possible Spark will perform a sequence of transformations by row so no data is stored. More deeply, for a given node a sequence of transforms that do not involve “shuffling” (referred as same “stage”) can be clubbed and applied to each row as a whole. This reduces unnecessary copies after each transform. An intermediate RDD is preferably not created unless it is explicitly done in the code for caching. For example, in the dataset of restaurants you want top five restaurants for a given zip code, the operation need not read/transform/write any further entries. This comparison is loosely analogous to depth-first vs. breadth-first execution. For the same reason, with Hadoop, each Pig or Hive query incurs significant latency (tens of seconds), whereas Spark is lightening fast (milliseconds). This is a fundamental distinction in architecture between MapReduce and Spark.

The Programming Model

The nodes involved in the Spark programming model are: Driver, Cluster Manager/Master, Worker and Executer.
cluster-overview

There are 3 major constructs of the Spark programming model:

  • Resilient Distributed Datasets (RDDs)
  • Parallel Operations (Actions & Transformations)
  • Shared Variables (Broadcast Variable & Accumulator)

Resilient Distributed Datasets (RDDs)

Abstruse name for an elegant data structure! To simplify,

Resilient:           if lost, it can be recreated
Distributed:      existing across the cluster

Dataset:            data fetched from filesystem or created programmatically

RDDs represent a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. A major attribute of RDD is lineage. Lineage refers to the property that any partition of an RDD can be recovered through the information it contains about how it was derived from other RDDs.

RDDs can be considered as an abstraction of Distributed Shared Memory (DSM). However, there are significant differences.

RDD DSM
Restricted programming model Generic programming model
Coarse granularity Fine granularity
Fault tolerance through lineage Fault tolerance through checkpointing
Trivial consistency Consistency is left to the application

An RDD can be created in 4 ways:

  • file read operation
  • parallelize operation
  • transform operation on other RDDs
  • changing the persistence of existing RDD (cache or save to filesystem)

Parallel Operations

The driver program runs the user’s main function that executes the parallel operations on the cluster.

Actions

Action operations are meant to fetch values.
reduce, collect, count, first, take, takeSample, countByKey, saveAsTextFile

Transforms

A transform operation applies a function on an RDD to create new one(s).
map, flatMap, filter, mapPartitions, mapPartitionsWithIndex, sample, union, intersection, distinct, groupByKey, reduceByKey, aggregateByKey, sortByKey, join

Stages of operations: Spark operations are executed through a set of stages, separated by distributed “shuffle” operations. In other words, until shuffle all the operations are on the same node, and are referred to as a stage.

Shared Variables

Shared variables are the distributed across the cluster for access by the tasks on the nodes.

Broadcast Variables

These refer to (large) read-only chunks of data distributed across the cluster, one copy per node. The copy is shared among multiple tasks on the same node.

As the operations proceed from stage to stage, Spark implicitly broadcasts data needed by the tasks in a subsequent stage. Furthermore, this data is cached in serialized form and deserialized before running each task.

However, broadcast variables are meant to be used when:
  • tasks across multiple stages need same data
  • data must be cached in deserialized form

Accumulators

Accumulators are distributed across the nodes and support add-only operation. These are used to implement the counters and summation. For actions, Spark takes care to apply the update to accumulator only once even when the task is restarted. However, for transforms, it is the programmer’s onus and the update can be applied multiple times if the task is re-executed. Also, accumulators do not violate the lazy evaluation model of Spark, their value is only updated once that RDD is computed as part of an action.

References:

[1] Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S. and Stoica, I., 2010. Spark: Cluster Computing with Working Sets. HotCloud, 10, pp.10-10.

[2]  Cluster Mode Overview https://spark.apache.org/docs/1.1.0/cluster-overview.html

[3] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S. and Stoica, I., 2012, April. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. InProceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s