In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. [Ref: Wikipedia]
OR in other words, we have actual data on some place and we create its copy to some other place mainly to increase performance. Sometimes we also store computed results for future use. Main objective of using cache is to increase performance. We use caching directly or indirectly almost on all layers in computer sciences e.g. Client, Server, Operating System, Switches.
What if actual data is updated, what will happen to copied/computed results? In such cases, copied/computed data is called dirty or stale data. Application which is using cache data will not be showing latest updated data. In ideal cases, as soon as source is updated so targets should be notified/updated in real time.
Whether we should use Cache or not, it depends on some factors,
- What is frequency of changes on source? This is the most important decision factor as more changes mean more chances of dirty data in cached copies.
- What damage and of which type can happen if user sees/uses dirty/stale data?
- Will there be an easy way to update targets whenever we want to update it manually or automatically?
- Cost vs. Performance of maintaining something on multiple places
Scenarios for better understanding
Let’s see some cases where we may increase performance using cache.
Case 1
For example, we've a database table in which we add a new record whenever a user logs into the application. Now we've a page where we show different statistics of logged in users based on date range day wise. Location wise daily logged in users count and so on.
Solution
Instead of reading data from raw table directly and doing summation and computation on every request, we may pre-compute results using a job and show pre-computed data directly on UI. In this case, only current day statistics will vary when new log in happens. So if we want to make it real time, we may re-compute current day data through triggers whenever there is a new record in table. This will be costly operation. So better solution will be to re-compute job after some interval (e.g. 1 min, 10 min or 30 min) based on transactions size & time required for re-computation.
Case 2
We've multiple tables and we need to pick data after applying joins on them.
Solution
We may have a de-normalized table and populate this table after some interval using a job. We may track changes in source & insert/update relevant records. Instead of reading from multiple tables and applying joins on every request, now we may read directly from de-normalized table. We may also use Materialized views (in SQL Server) if we fulfill its requirement. As more tables are involved, more chances will be there to have dirty data in de-normalized (middle) table.
Case 3
We are writing some SQL code to complete some task (e.g. calculating pay for an employee based on his attendance). For this, we need to access a table (e.g. TableA) multiple times in our script. TableA size is huge.
Solution
In our script, instead of reading actual table every time when we need it, we may load relevant data into a temp table (or table variable) once in start and then we may use that temp table in remaining script. Yes, we may apply indexing on this table to avoid any performance issues. Also if we want to use latest data in our whole script, we may miss new records which come after we load data into temp table.
Case 4
One way database replication is also a form of providing you a copy of actual data. But it provides you real time replication so there are almost no chances of dirty state until there is any issue in replication process.
Case 5
What about data which a web application is accessing from DB every time (e.g. Country). Mainly configuration data or data which doesn't change frequently.
Solution
We may load this data and keep it in some third Party cache (e.g. Redis, MemCached) or in Application Memory (e.g. Application Object in ASP.NET or in a static class). How to load data in cache is, we may load all in start or may load in cache when first request comes and then use cache version for future requests. We may re-load data in cache after specific intervals OR if request comes with specific instruction of loading fresh copy. It means we may give some option on front end where end user (or Admin) may request for latest data. If our application is load balanced, we may have issues in having cache data in same server memory. In such cases, distributed caches (e.g. Redis) are better solution which allows us to keep cache data on other (multiple) servers if we want.
Case 6
Another Approach: What about having cache as our main data source? It means storing transactional data (e.g. Logging history) directly in cache and serving it from cache and then pushing changes to the actual database (permanent storage) after some intervals or at the same time depending on the nature of data. In this case, application users will not have a dirty state but if any other application is using database directly, that application may have dirty data or missing information. But we need to decide what we want to keep in cache (volatile memory) first and what to keep in DB first or what to keep in both at the same time.
Case 7
Client Side Caching: Client is making hits to server (e.g. APIs) to get data.
Solution
As client side is another layer, we may also keep data on this layer. If our client is the browser, we may use HTML5 based storage (e.g. LocalStorage, SessionStorage, IndexedDB) or we may simply store data in JavaScript variables (this better suits if application is single page and page refreshes will not be that often). If client is not browser, we may store data in local DB or in files or in application memory. To avoid dirty state at all, client may send a request to check if any new data is available on server or client data is fine. If there is new data, get that in response. If there is nothing new, use whatever is available on client. As we are storing data on client side so client may make changes to it depending on what storage approach we use. We may use encryption or encoding to make data less playable. We should not use confidential information on client side and never use confidential information of other users on client side.
Case 8
Browser also caches resources (e.g. images, JS, CSS). When a page is using a resource (e.g. an image), the browser checks if it has a cached version of that resource before sending request to server. It matches full URL with the cached URL. So for browser https://learninginurdu.pk & https://learninginurdu.pk?v are different. Also there is "expiration" timestamp involved. We may use Ctr + F5 in browser to request browser to reload resources instead of using cached copy. ASP.NET allows us to use Bundling & Minificaiton feature which generates a unique URL whenever there is a change in file. So if a JS file is not changed, it will send same URL every time but if file is changed, it will start sending new unique URL in page by appending a new number at end of URL. In ASP.NET Core, we may use helper tag to add version number in URL if file is modified.
As a developer, if we don't want to cache content of our website when we are working on it during development (for example), we may disable cache by using this setting in Chrome Dev Tools Look for Network -> "Disable cache (while DevTools is open) and checked it. When dev tools will be opened, it will not cache resources and will get new resources from server every time.
Caching increases performance therefore it is widely used on many layers. But it comes with dirty data problem when resource on source & target becomes different. Almost every tool or application uses caching technique directly or indirectly. Caching is the best friend of a developer if it is used wisely, otherwise It becomes a worst enemy.