Introduction
All modern applications consume tons of data. They also generate a considerable amount of data. We need to plan out and build the Data Architecture very carefully. Data is crucial as it provides a lot of insights and information. You must deal with the data in the right way so that you benefit the most from it. A modern application can be a web application, or a desktop application, or a mobile application, or an IoT application or a chatbot. The list is long. All these different types of applications consume and generate data in a different format. We should plan out the right strategy to store these data depending upon how we are planning to consume the data.
In this article, let us discuss the following mechanisms to store data and the factors influencing our decision.
- Database
- Data Warehouse
- Data Lake
Database
A database helps you store a finite amount of data. You can have a single application or multiple applications in your enterprise portfolio, storing the data in a database. You may plan to store historic data in the database, but you have a limitation here. You cannot keep ingesting data for years and retaining the historical data as it would take a toll on the performance of your application. The database is usually designed to store a single format of data. You can either store relational data or document data or graph data or any other type of data. A database is usually tightly coupled with the format of data you store though there are some exceptions here.
Data Warehouse
A Data Warehouse can deal with multiple formats of data. It can store massive amount of data. It is best suited to retain historical data for ages. From the historical data, you can get a good amount of insights and use the insights to make crucial business decisions. The Data Warehouse works on the principle of Extract-Transform-Load (ETL) while dealing with the data. Multiple applications can generate data in multiple formats. All these different forms of data get extracted first, then get transformed into a structured format that the Data Warehouse can understand and then the structured data gets loaded to the Data Warehouse. Multiple applications can interact and run analytics on the structured data stored in the Data Warehouse.
However the challenge with Data Warehouse is that it can scale vertically only. Due to the relational nature of data, there exists a strong dependency among the data and many times it poses a challenge to scale the data horizontally. We cannot distributed the data across multiple servers if the data is structured. Though you cannot scale the data horizontally but replicate the data across servers. Data Warehouse can deal with multiple formats of data but can only store a particular format or structure of data.
Modern day applications generate data in a wide variety of formats. We need a data store that can keep all these format of data as-is. Also the data store should scale horizontally so as to deal with the huge amount of data. Data Warehouse can have a limit on dealing with the size of data like Database. However, the Data Warehouse can store massive amounts of data as compared to the Database.
Data Lake
A Data Lake can deal with and store multiple formats of data. It addresses all the short falls of the Database and Data Warehouse. It can scale horizontally and can store much more data that neither a Data Warehouse nor a Database can store. It works on the principle of Extract-Load-Transform (ELT). Different formats and a variety of data are extracted from multiple sources and then these data are loaded or stored in the Data Lake. As per the need, the data is read from the Data Lake and transformed. Once the data is transformed into a structured format, it is either consumed by an application or loaded back into the Data Lake. Based on the requirements, data is retrieved from the Data Lake and used. The Data Lake can store any format of data and quantity of data. There are no restrictions to it. The data in the Data Lake can grow as much as needed and can retain a huge amount of historical data.
Conclusion
To sum up, if you are dealing with a finite amount of data, then you should choose a Database. If you need to deal with historical data or run analytics on top of your huge amount of data then you should go for Data Warehouse. And, if you need to store multiple kinds of data and then transform and consume the data later, you can go for Data Lake. Data Lake is best suited for IOT based and Machine Learning applications and it can scale horizontally and store huge amounts of data as compared to a Data Warehouse.