In Delta lake, whenever we do Overwrite or delete a record from the delta table, it will not be permanently deleted from the underlying file. It will mark those records as deleted and not be included in our result set. We can travel time and get those records if needed using the Delta Log. That is one of the main functionality of Delta Lake.
But If you keep doing transactions for a more extended period of time and maintaining the older files means, it will consume more space. Then what is the Solution? Just Vaccum it. Yes, like how you vacuum your home frequently, You can vacuum your delta lake frequently, get rid of those older records, and save some space.
df = spark.read.format('delta').load("Sample_Data")
df.count()
> 15000000df = spark.read.parquet("Sample_Data")
df.count()
> 45000000
When I read the Sample_Data in Delta format, it returned 12500000
records. But, when I read the same Sample_Data in Parquet format, it returned 45000000
records. The above example will show how the data gets maintained in Delta lake.
Why the same file returned two different counts? Because, As per some process, the Sample_Data got overwritten several times (Appox 3 times). Hence, in Delta format, it only shows the latest transactional records; in Parquet, it shows the entire records.
So, we must Vaccum the delta table on a particular time interval using the below command.
On Delta tables, Azure Databricks does not automatically trigger VACUUM
operations
VACUUM delta.`/data/workPath/` RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours old
VACUUM Table DRY RUN -- do a dry run to get the list of files to be deleted On Delta tables
Databricks recommends that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.