What’s Underneath an RDD?

The RDD is the most basic abstraction in Spark, and three vital characteristics are associated with an RDD:

Dependencies: a list of dependencies can be used to instruct Spark how an RDD is constructed with its inputs. So when necessary, Spark can recreate an RDD from them and replicate operations on it. This give RDDs resiliency
Partitions (with some locality information): split the work to parallelize computation on partitions across executors.
Compute function: Partition ⇒ Iterator[T], and the data will be stored in RDD

Structuring Spark

One way Spark optimize computation is to use common patterns found in data analysis. These patterns are expressed as high-level operations such as filtering, selecting, counting, aggregating, averaging, and grouping. This provides added clarity and simplicity.

Key Merits and Benefits

In addition to better performance and space efficiency across Spark components, other advantages of structuring API are expressivity, simplicity, composability and uniformity.

Developers does not need to deal with the low-level operations anymore, which are more prone to mistakes. Instead, using the high-level abstraction can not only making the code more readable but also having Spark does all the optimization under the hood

The DataFrame API

Spark DataFrames are like distributed in-memory tables with named columns and schemas, where each column has a specific data type. DataFrames are immutable and Spark keeps a lineage of all transformations.

Different types that matches the supported programing languages are also supported in Spark, and you can declare your own type as well.

Schemas and Creating DataFrames

A schema in Spark defined the column names and associated data types for a DataFrame. This is often used when reading data from external data source. Defining a schema up front as opposed to taking a schema-on-read approach offers three benefits:

relieve Spark from the onus of inferring data types
prevent Spark from creating a separate job just to read a large portion of the file to ascertain the schema.
detect errors early if data doesn’t match schema

Defining a Schema

There’s 2 ways to define a schema: define it programmatically, or employ a Data Definition Language(DDL) string.