Spark offers four distinct components as libraries for diverse workloads. Each of these components is separate form Spark’s core fault-tolerant engine.
At a high level in the Spark architecture, a Spark application consists of a driver program that is responsible for orchestrating parallel operations on the Spark cluster. The driver access the distributed components in the cluster(cluster manager and spark executors) through a SparkSession
.
Responsible for instantiating a SparkSession
. It communicates with the cluster manager; it requests resources(CPU, memory, etc) from the cluster manager for Spark’s executors(JVMs); and it transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across the Spark executors.
As an user, you can create JVM runtime parameters, define DataFrames and Datasets, read from data sources, access catalog metadata, and issue Spark SQL queries. SparkSession
provides a single unified entry point to all of Spark’s functionality
Actual physical data is distributed across storage as partitions. While the data is distributed as partitions across the physical cluster, Spark treats each partition as a high-level logical data abstraction — as a DataFrame in memory.