Overview and Background

The amount of data we need ingest and process is ever-increasing, we need the ability to horizontally scale the storage and the ability to dynamically scale the compute resources to address processing and consumption spikes. We want to be able to perform operations based on business logic in with transaction guarantees, and without having to rewrite large data files.

Over the time, the above set of requirements had been addressed by 2 distinct toolsets:

horizontal scalability
de-coupling of storage and compute

Delta Lake brings capabilities such as transaction reliability and support of UPSERTs and MERGEs to data lakes while maintaining the dynamic horizontal scalability and separation of storage and compute of data lakes

Data Warehouses

A central relational repository of integrated, historical data from multiple data sources that presents a single integrated, historical view of the business with a unified schema, covering all perspectives of the enterprise.

Untitled

Benefits:

Storing large amount of historical data → able to provide historical insights
Reliable, based upon the underlying relational database with ACID transaction guarantees
Suitable for business intelligence and reporting

However, as technology progresses, the term of big data is risen. Big data is defined as data that arrives in ever higher volumes, with more velocity, and a greater variety of formats and has higher veracity. Those are known as the the four V’s of data:

Volume: the amount of data

Velocity: the timely manner of data. For instance, stock trading applications need to have access to near-real-time data

Variety: the number of different “types” of data that are now available.

Veracity: the trustworthiness of the data. Want to make sure that the data is accurate ad of high quality.

Data Warehouses often have a hard time addressing theses four Vs.

Data warehouse suffer from both storage and scalability issues. They also don’t support the types of streaming architecture required to support near real-time data. Since there’s no built-in support for tracking the trustworthiness of the data, data warehouse metadata is mainly focused on schema, and les son lineage.

Thus, the concept of Data Lake comes in play