Spark Execution

At the core of every Spark application is the Spark driver program, which creates a SparkSession object.

During interactive sessions with Spark shells, the driver converts your Spark application into one or more Spark jobs. It then transforms each job into a DAG, which is Spark’s execution plan: each node within a DAG could be a single or multiple Spark stages.

As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel. Each stage is comprised of Spark tasks(a unit of execution), which are then federated across each Spark executor. Each task maps to a single core and works on a single partition of data.

Transformations, Actions, and Lazy Evaluation

Spark operations on distributed data can be classified into two types: transformations and actions. Transformations transform a Spark DataFrame into a new DataFrame without altering the original data, giving the property of immutability. Examples are select() or filter()

All transformations are evaluated lazily. Their results are not computed immediately, but are recorded or remembered as lineage. This allow Spark to rearrange certain transformations, coalesce them, or optimize transformations into stages for more efficient execution. Lineage and data immutability provide fault tolerance as well: Spark can easily reproduce data’s original state by replaying the recorded lineage, giving it resiliency in the event of failures

An action triggers the lazy evaluation of all the recorded transformations.

Transformations Actions
orderBy() show()
groupBy() take()
filter() count()
select() collect()
join() save()

The actions and transformations contribute to a Spark query plan. Nothing in a query plan is executed until an action is invoked.

Narrow and Wide Transformation

Transformations can be classified as having either narrow dependencies or wide dependencies. Any transformation where a single output partition can be computed from a single input partition is a narrow transformations. Two examples are filter() and contains() because they can operate on a single partition and produce the resulting output partition without any exchange of data.

However, transformations such as groupBy() or orderBy() instruct Spark to perform wide transformations, where data from other partitions is read in, combined, and written to disk. Wide transformations require output from other partitions to compute the final aggregations

Untitled