The RDD is the most basic abstraction in Spark, and three vital characteristics are associated with an RDD:
Iterator[T]
, and the data will be stored in RDDOne way Spark optimize computation is to use common patterns found in data analysis. These patterns are expressed as high-level operations such as filtering, selecting, counting, aggregating, averaging, and grouping. This provides added clarity and simplicity.
In addition to better performance and space efficiency across Spark components, other advantages of structuring API are expressivity, simplicity, composability and uniformity.
Developers does not need to deal with the low-level operations anymore, which are more prone to mistakes. Instead, using the high-level abstraction can not only making the code more readable but also having Spark does all the optimization under the hood
Spark DataFrames are like distributed in-memory tables with named columns and schemas, where each column has a specific data type. DataFrames are immutable and Spark keeps a lineage of all transformations.
Different types that matches the supported programing languages are also supported in Spark, and you can declare your own type as well.
A schema in Spark defined the column names and associated data types for a DataFrame. This is often used when reading data from external data source. Defining a schema up front as opposed to taking a schema-on-read approach offers three benefits:
There’s 2 ways to define a schema: define it programmatically, or employ a Data Definition Language(DDL) string.