Azure Cosmos DB Replication

February 22, 2021

While learning about Cosmos DB I had a lot of misunderstandings around consistency levels. And, it's not surprising. Many people, certainly those coming from a SQL Server background like I did, have these misunderstandings. But, before I can jump into Cosmos DB consistency levels (covered in another post) I have to cover replication. This post is about the intricate details of replication that I had to wrap my head around for consistency levels to make sense. Although learning consistency levels does not require understanding replication first, it was helpful for me when developing use case scenarios.

With SQL Server it's understood that data resides in databases that can be spread across File Groups. Those File Groups could simply be different files in the same folder, different folders, and even different drives. When it comes to replicated instances, the data could be spread across servers and even data centers in different states. But, Cosmos DB is very different from SQL Server. Not only is Cosmos DB not a relational database like SQL Server, but there isn't a file structure to worry about.

Cosmos DB has, within each Azure region, 4 replicas that make up a "replica set". One is a "Leader" and another a "Forwarder". The other 2 are "Followers". The Forwarder is a Follower but has the additional responsibility to send data to other regions. As data is received into that region, it is written to every replica. For a Write to be considered "committed" a quorum of replicas must agree they have the data. A quorum, as it pertains to the replicas in each region, is 3 out of the 4. This means that regardless of which region is receiving the data, the Leader replica and 2 others must signal they have received the data. This is also true regardless of the consistency level used on the Write operation.

A code's client connection does not have to worry about replica count, if quorum has been met, or which replica does not yet have the data. The Cosmos DB system manages all of that. For our code, the Insert/Update/Delete operation has either succeeded or not.

Global Replication

Cosmos DB has a simple way of enabling global replication. Using the Azure Portal you can select 1 or more of the many data centers available all over the world. In a matter of minutes, another Cosmos DB instance is available with your data. For the discussion in this post I'm only going to cover Single Master. But, Multi-Master, also known as Multi-Write Enabled, is available. *Just a note on that though, once enabled you cannot turn it back off except with an Azure Support Ticket.

Data stored in Containers are split across Replica Sets by the Partition Key you provide when creating the Container. And, each Replica in the Replica Set contains only your data. The Replica is not shared with other Azure customers. As the amount of data grows, the Cosmos system also manages the partitioning and replication to other regions as needed. So, not only is Cosmos extremely fast but the sizing and replication is automatically handled for us.

With data in 2 Replica Sets, for example, in each region you enabled has an exact copy. Looking at a Replica Set in one region and the matching one in another region, this is known as a Partition Set. The Partition Sets are what manage the replication between regions for their respective Replica Sets.

Replication latency only pertains to global replication. The time to replicate data inside a Replica Set is so fast, it's not part of the latency concerns. However, from one region to another there is some latency. Given the distance data must travel there is inevitable delays. Microsoft, at least in the United States, has a private backbone for region to region networking. This has an effect with your applications if using the Strong consistency level.

The image above depicts that replication from the primary region to the other regions may have many miles to travel. The latency is between the regions. The only latency the client connection will notice is that of the replication to the furthest region away. This is because the replication is concurrent and not sequential.

With all consistency levels, except Strong, once data has hit quorum "committed" then the client connection is notified. At the same time, the data is replicated to the other regions as enabled. With Strong, that quorum is a little different. In this case a "Global Majority" has be met. With 2 regions, this means 6 of the 8 replicas must agree on the data. With 3 regions, at least 2 regions must agree. With 4 regions, at least 3 regions must agree. Basically, once using 3+ regions, N - 1 regions is the quorum. Again, this only applies to the Strong consistency level.