Distributing database data

Learning objectives

You know ways how database data is distributed.

There are two primary ways to distribute database data over multiple servers. The first is data replication, where the entire data is replicated over two or more servers, both maintaining copies of the data. The second is data sharding, where data is divided into shards that are stored in two or more servers. In this section, we will discuss both approaches.

Data replication

In data replication, the database data is replicated over a number of servers, each maintaining a copy of the data. This can improve availability and fault tolerance, as a failure of one server does not result in the loss of data. However, it also requires that the data is updated between the servers to maintain consistency.

When using data replication, effort to maintain data consistency is required.

There are two key approaches to updating data between servers: primary-secondary replication and primary-primary replication. In primary-secondary replication, one server is the primary, which is responsible for writing data, and the other servers are secondary, which are responsible for reading data. In primary-primary replication, all servers are primaries, and all servers are responsible for both writing and reading data.

The Figure 1 highlights primary-secondary replication, where one of the database replicas are used for reading content, while the other is used for writing content. When content is written to the primary database, it is then replicated to the secondary database(s).

Primary-secondary replication. Some of the databases are dedicated for writing operations, while some of the databases are dedicated for reading operations.

In the case of a database failure in the primary-secondary replication model, the failure of a secondary database does not affect the availability of the system, as the primary database is still available and it can be replicated to create new secondary databases. However, the failure of the primary database can result in the loss of data; in such a case, one of the the secondary databases are promoted as the primary database.

When considering the primary-secondary replication, the primary database is responsible for making sure that the content is written to the secondary database(s) in the correct order. However, in the case of primary-primary replication, where data can be written to any server, the servers need to coordinate with each other to ensure that the data is written in the correct order. This is more complex, as it requires more data to be updated between servers.

Data sharding

In data sharding, the database data is divided into shards that are stored in two or more servers, each maintaining its own data shard. As each server is responsible for its own shard, there is consistency in place. Sharding can improve performance, as queries requiring data from a single server and shard can be faster when contrasted to a single server with all the data. At the same time, queries requiring data from multiple shards can be slower, as data needs to be retrieved from multiple servers.

When using data sharding, additional effort is required to maintain fault tolerance. If a server responsible for a shard goes down, that shard is not available.

Data sharding also needs a mechanism to identify which shard to use for a given query. This is achieved, for example, by using a hash function or a distributing data to database based on a range of values. Distributing data to databases based on a range of values is highlighted in Figure 2.

Data sharding by distributing database data based on a range of values.

One of the challenges with data sharding is the possibility for imbalanced situations, where some servers are overloaded while others are underloaded. This is similar to the problems related to finding meaningful hash keys for e.g. hash tables. The problem can be mitigated by using a consistent hashing mechanism, where the hash function is used to determine which shard to use for a given query. The consistent hashing mechanism ensures that data is distributed evenly across the shards, and that the data is distributed in a way that minimizes the number of shards that need to be updated when a shard is added or removed.

Data partitioning

Note that data sharding is not the same as data partitioning, where data in a database is divided into parts for performance reasons. Data partitioning typically refers to dividing data into parts in a single database server, while data sharding refers to distributing the data over multiple servers.

Replication and sharding

Note that using data replication does not necessarily mean that the data is not sharded. In practice, with large database applications, it is common to use both data replication and data sharding. In such a case, the data is first divided into shards, and then each shard is replicated over two or more servers. This can improve availability and fault tolerance, as a failure of one server does not result in the loss of data. However, it also requires that the data is updated between the servers to maintain consistency. Figure 3 highlights the use of both data replication and data sharding.

Using both data replication and data sharding.

As an example, Google's AlloyDB and Cloud Spanner use both data replication and data sharding.

How to update distributed data?

The concrete protocols for updating distributed data are out of the scope of the present course. In practice, for updating distributed data, consensus protocols are used to ensure that the data is updated in the correct order. The consensus protocols are often based on the Paxos algorithm.