VACUUM In Delta Lake

Article

In Delta lake, whenever we do Overwrite or delete a record from the delta table, it will not be permanently deleted from the underlying file. It will mark those records as deleted and not be included in our result set. We can travel time and get those records if needed using the Delta Log. That is one of the main functionality of Delta Lake.

But If you keep doing transactions for a more extended period of time and maintaining the older files means, it will consume more space. Then what is the Solution? Just Vaccum it. Yes, like how you vacuum your home frequently, You can vacuum your delta lake frequently, get rid of those older records, and save some space.

VACUUM in Delta Lake

df = spark.read.format('delta').load("Sample_Data")
df.count()
> 15000000df = spark.read.parquet("Sample_Data")
df.count()
> 45000000

When I read the Sample_Data in Delta format, it returned 12500000records. But, when I read the same Sample_Data in Parquet format, it returned 45000000records. The above example will show how the data gets maintained in Delta lake.

Why the same file returned two different counts? Because, As per some process, the Sample_Data got overwritten several times (Appox 3 times). Hence, in Delta format, it only shows the latest transactional records; in Parquet, it shows the entire records.

So, we must Vaccum the delta table on a particular time interval using the below command.

On Delta tables, Azure Databricks does not automatically trigger VACUUM operations

VACUUM delta.`/data/workPath/` RETAIN 100 HOURS -- vacuum files not required by versions more than 100 hours old

VACUUM Table DRY RUN -- do a dry run to get the list of files to be deleted On Delta tables

Databricks recommends that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.