Spark is a unified engine designed for large-scale distributed data processing, on premises in data centers or in the cloud. It provides in-memory storage for intermediate computations, making it much faster than MapReduce. Spark also incorporates libraries with composable APIs for machine learning, SQL, stream processing for and graph processing.

This note is only a selective section that covers the basic concepts with some emphasis on streaming. It aims to give an overview instead of all the detailed APIs/implementations.

Overview

Spark App Concepts

Structured APIs

Structured Streaming

Side - A case study on the delta lake architecture that uses Spark

Databricks Delta Lake