Overview | Notion

Spark offers four distinct components as libraries for diverse workloads. Each of these components is separate form Spark’s core fault-tolerant engine.

Untitled

Spark’s Distributed Execution

Untitled

At a high level in the Spark architecture, a Spark application consists of a driver program that is responsible for orchestrating parallel operations on the Spark cluster. The driver access the distributed components in the cluster(cluster manager and spark executors) through a SparkSession.

Spark Driver

Responsible for instantiating a SparkSession. It communicates with the cluster manager; it requests resources(CPU, memory, etc) from the cluster manager for Spark’s executors(JVMs); and it transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across the Spark executors.

SparkSession

As an user, you can create JVM runtime parameters, define DataFrames and Datasets, read from data sources, access catalog metadata, and issue Spark SQL queries. SparkSession provides a single unified entry point to all of Spark’s functionality

Distributed data and partitions

Actual physical data is distributed across storage as partitions. While the data is distributed as partitions across the physical cluster, Spark treats each partition as a high-level logical data abstraction — as a DataFrame in memory.