DATA LAKE
Data Lake is a storage repository that cheaply stores vast raw data in its native format.
It consists of current and historical data dumps in various formats, including XML, JSON, CSV, Parquet, etc.
Drawbacks in Data Lake
- Doesn’t provide Atomicity — No all or nothing, it may end up storing corrupt data.
- No Quality Enforcement — It creates inconsistent and unusable data.
- No Consistency/Isolation — It’s impossible to read and append when an update occurs.
DELTA LAKE
Delta Lake allows us to incrementally improve the quality until it is ready for consumption. Data flows like water in Delta Lake from one stage to another stage (Bronze -> Silver -> Gold).
- Delta lake brings full ACID transactions to Apache Spark. That means jobs will either be complete or not at all.
- Delta is open-sourced by Apache. You can store a large amount of data without worrying about locking.
- Delta lake is deeply powdered by Apache Spark, meaning the Spark jobs (batch/stream) can be converted without writing those from scratch.
Delta Lake Architecture
Delta Lake Architecture
Bronze Tables
Data may come from various sources, which could be Dirty. Thus, It is a dumping ground for raw data.
Silver Tables
Consists of Intermediate data with some cleanup applied.
It is Queryable for easy debugging.
Gold Tables
It consists of clean data, which is ready for consumption.