Introduction
As your application grows, a large single database structure would be severely hit by performance degradation. The growing database demands scaling, which could be usually done in two ways -
- vertical scaling (scaling up) or
- horizontal scaling (scaling out).
The option of scaling up a database server (adding more computing resources in terms of more RAM/CPU/Disk space) is ultimately limited as the database grows significantly. The other option is to scale out, or rather a horizontal scaling, which involves adding more machines to the existing stack, which in turn distributes the load and adds more processing power.
In this article, we will discuss one such database architecture pattern, known as Database Sharding.
What Is Database Sharding?
Database sharding is the process of separating your database tables horizontally into different portions, known as shards. A shard is usually a horizontal partition compared to a vertical partition. Horizontal partitioning is the practice of splitting rows of a table into multiple shards/partitions. This is in contrast to vertical partitional, where the columns in the table are separated out and put into different partitions. In horizontal partitioning, each partition contains the same schema and columns, but different rows.
Database sharding divides the database into smaller shards (logical shards) and stores them in separate database servers (physical shards). Each of the physical shards could contain multiple logical shards. Collectively, data in all the shards represent the entire dataset.
Note that it is not mandatory to store logical shards in a different physical server. They may coexist on the same machine (coresident) or could be distributed across different machines. The motivation for coexisting shards is to reduce the size of individual indexes.
Types of Sharding Architecture
In this section, we will look into some of the common sharding architectures.
Key Based or Hash-based Sharding
In Key based Sharding, a key (field/fields) of the entity is passed through a hashing function to result in a hash value. The hashing value determines which shard the data should go to. In other words, the output of the hashing function determines the shard Id. A carefully chosen hashing function could ensure there are no database hotspots.
It needs to be noted that the shard keys (values of a column that is fed to hashing function) should not change over time. The main advantage of Key based sharding is that it depends on a hashing function and does not need a _mapping table or index that needs to be maintained. However, this could turn out to be a disadvantage as well, when you need to decide to add a new server, and now your existing data could have data that according to the hashing function, would need to be mapped to the new server.
Horizontal or Range Based Sharding
Range-based sharding involves sharding based on the range of value of the shard key. For example, if the shard key is the first character of the username, users with usernames starting with A-D could be in the first shard, E-G in the second shard, and so on.
While it is easier to implement, Range based sharding could lead to hot spots where some shards get significantly higher data than others.
Directory Based Sharding
Directory-based sharding maintains a lookup table providing a one-to-one mapping between data and shard. Each possible value for the shard key has an entry in the lookup table, which is mapped to different shards. Multiple shard keys could be mapped to the same shard.
The lookup table makes the directory-based sharding quite flexible. However, this again could also turn out to be the single point of failure. If the lookup table is corrupted or fails, it would affect the entire database collection. Furthermore, it would also have an additional step to query the lookup table for each query, which can have its own performance impact even though it could be insignificant compared to the gains.
Advantages of Sharding
As the database grows in your application, the option of scaling vertically is the easiest. You can add more computing resources in terms of storage and RAM. However, this soon hits a limitation. Sharding or horizontal scaling adds more machines to the stack and distributes the database.
Distributing the database also serves to speed up queries. As the database grows, the queries need to check every row. By sharding, we are limiting the size of individual databases and hence each individual database has a very limited set of data to query. Additionally, unlike monolith databases (assuming there are no backup servers), where a failure might result in the entire application stalling, a sharded database would mean the failure is limited to only certain parts of the application.
Disadvantages of Sharding
While sharding provides its own advantages, it comes with a set of drawbacks as well. The sheer complexity of implementing the sharding database can turn out to be cumbersome. The application would need to understand that the data might be in different databases and needs to have the logic to send the queries to the right database. Another problem that is likely to happen is certain shards might turn out to be significantly larger than others. This is more often seen in Range based sharding and would result in what is known as database hotspots. This would cancel out any benefits of sharding.
Furthermore, if you, for any reason, decide to return to unsharded architecture, this would not be easily possible and would require a lot of effort to rebuild the original unsharded version.