Introduction
Depending on app type, users' activity, and data type stored in DB, there can be cases when monthly costs spent on DB storage start to be remarkable enough and fast-growing. The idea to rewrite the whole app and re-design DB storage(s) is great but is not acceptable due to legacy/lack_of_developers/complexity/whatever.
'Raw' idea - find a way to compress the biggest data types using native SQL functionality and make it compatible with existing libraries on the app side (.net c# web app). But before any development activities and spending time on development, there should be some measurements and estimation phases.
Estimation
How do I decide if there is any sense to work in this direction? It was the 1st question/decision that needed to be made.
Pre-conditions
- working cloud multi-tenant web app used for many years;
- all data should be available and consistent - no retention/cleaning/archiving;
- RDBMS in a cloud (several DBs for different purposes);
- some DBs are hundreds of gigs and fast growing;
- idea - 'rewrire everything' is not acceptable;
Used steps
1. Find the biggest DB tables using any SQL from the variety available on the Internet (there are many tools for this purpose - but we have our own DB management area in our portal, which shows general stats by DBs, objects, indexes, fragmentation, etc).
2. Review the found tables' data nature. If there are any data types like nvarchar(max) and varbinary(max) (MS SQL types are used as an example), these tables will be used for further estimation/review.
3. Write SQL scripts for the overall estimation of data reduction. Here is a chunk of SQL used.
SELECT
SUM(DATALENGTH(TextColumnName) - DATALENGTH(COMPRESS(TextColumnName))) as DataReduction
,COUNT(*) as AffectedRows
FROM [dbo].[SomeTable]
WHERE TextColumnName IS NOT NULL
AND DATALENGTH(TextColumnName) > @thresholdParam
This script will give high-level estimation in bytes per column. Sure, the benefit is usually better due to low-level SQL data structure and data size stored in a particular column in a row. ThresholdParam value should be found in each particular case and data type. In our case, it was ~2k for text data.
Having the result numbers, it is possible to make a conclusion if it is worth implementing this. In our case, we had promising numbers (~5-10% per table), and eventually, the impact was even more (due to the peculiarities of RDBMS data structure).
Implementation
The implementation took ~several days of dev work and several days of testing QA work.
A better choice to store compressed data in the same DB table is varbinary(max) (later, I'll show a better way, but more complicated from an implementation perspective). Having additional column(s) on the same table is optional but better for data consistency and convenience.
For example. Imagine there is a column like Description nvarchar(max) - new column BinDescription varbinary(max) appended. The data migration script will use the same approach used in the estimation SQL script.
UPDATE TOP(@batchSize) tab
SET
tab.BinDescription = COMPRESS(tab.Description)
,tab.Description = NULL
FROM [dbo].[SomeTable] tab
WHERE tab.Description IS NOT NULL
AND DATALENGTH(tab.Description) > @threshold
AND tab.BinDescription IS NULL;
It is better to wrap up the script with batching to avoid long transactions and interruptions. My script was failure tolerance and could be interrupted at any moment and executed again. It continued data processing from the last committed point. To avoid high CPU and IO consumption on the DB level and not block users, it is better to introduce an additional delay between batches (we had load ~5% on DB, and migration could be processed even during business time).
Application layer implementation
The particular implementation depends on the DAL used in the app. For DAL, using Dapper microORM, the only changes will be in SQL queries used in that ORM (almost no app changes in c# code)—something like this.
SELECT
ID
,Name
,COALESCE([Description], CAST( DECOMPRESS([BinDescription]) as NVARCHAR(MAX)) ) as Description
,OtherColum
FROM [dbo].[SomeTable]
WHERE ...
For the write operation, it is preferable to use EF of something similar to have UnitOfWork, consistency, etc.
If EF is used to support compression and decompression, the entity needs to be extended with additional property with type byte[]. All 'magic' should be done in repositories (if such a pattern is used) or in any other data access services. The good thing is that the SQL COMPRESS function is completely compatible with .net GZipStream.
When we need to save data to DB using EF, only need to check if payload is suitable for compression (in my case there was a lightweigh service with incapsulated logic) and put compressed payload to BinColumn and reset OriginalColumn. On load action, the reverse operation is used - check if BinColumn is not null, decompress (using direct call or some service), and assign to OriginalColumn a decompressed payload. To encapsulate all these 'magic' repositories (or anything else), should expose some DTO without any binary/compression-related properties.
Conclusion
As a result, up to 40% of DB reduction, no issues found during testing, and good results in production. No legacy code was changed/affected. Sure, data should be loaded with pagination, and it is better to have compression/decompression on the code level (not only because of SQL CPU load but also due to network IO between the app & DB server). Old EF had serious issues with performance when quite a large amount of data was passed via ORM. I don't think new EF could solve this completely - but definitely better.
The general app performance was improved (mostly due to less network passed payload and remarkable DB size reduction). Code complexity was almost not affected, but the main benefit was a remarkable monthly price reduction for infrastructure payments.