Data Lake Vs Delta Lake

DATA LAKE

Data Lake is a storage repository that cheaply stores vast raw data in its native format.

It consists of current and historical data dumps in various formats, including XML, JSON, CSV, Parquet, etc.

Data Lake Vs Delta Lake

Drawbacks in Data Lake

  • Doesn’t provide Atomicity — No all or nothing, it may end up storing corrupt data.
  • No Quality Enforcement — It creates inconsistent and unusable data.
  • No Consistency/Isolation — It’s impossible to read and append when an update occurs.

DELTA LAKE

Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake from one stage to another stage (Bronze -> Silver -> Gold).

  • Delta lake brings full ACID transactions to Apache Spark. That means jobs will either be complete or not at all.
  • Delta is open-sourced by Apache. You can store a large amount of data without worrying about locking.
  • Delta lake is deeply powdered by Apache Spark, meaning the Spark jobs (batch/stream) can be converted without writing those from scratch.

Delta Lake Architecture

Data Lake Vs Delta Lake

Delta Lake Architecture

Bronze Tables

Data may come from various sources, which could be Dirty. Thus, It is a dumping ground for raw data.

Silver Tables

Consists of Intermediate data with some cleanup applied.

It is Queryable for easy debugging.

Gold Tables

It consists of clean data, which is ready for consumption.