Thursday, April 26, 2018

Attainability of the Lower Bound of the Processing Time of Highly Available Distributed Transactions


Recently I've read another article from the series: "we are better than a two-phase commit". Here I will not analyze the contents of this article (although I'm thinking about giving a detailed analysis). The task of my opus is to offer the most effective version of the distributed commit in terms of number of round trips. Of course, such a commit comes at a price. However, the goal is to assess and show that the two-phase commit is not a drag, as many believe.

It should also be noted that there will be no full-scale experiments and fake comparisons. Algorithms and theoretical analysis will be given simply. If desired, you can independently implement and test it in practice. Of course, it would be much better that this was presented in the current article, but it all depends on free time and motivation. In my opinion, it's more important to describe algorithms than to present charts, because almost everyone can draw the charts based on algorithms while the opposite is not true.

After such an introduction, let's begin.

Two-Phase Commit

Definition. RTT is the time of message back and forth.
Definition. Hop is the time of single shot.

Theorem. 1 RTT is equal to two hops.
Proof. It is obvious.

Definition. Distributed commit is the process of making atomic changes between at least two distributed participants.

Definition. A two-phase commit is a commit consisting of two phases. The first phase is an atomic operation to verify the possibility of initiating a transaction and blocking the participants to perform the commit. The second phase is the collection of responses from participants and the further transaction processing with the releasing of the locks.

Theorem. A two-phase distributed commit can not be made faster than 1 RTT.
Proof. To perform a two-phase commit, it is necessary, as a minimum, to send a request from the client to all participants and receive a response on completion. This requires 2 hops or 1 RTT.

Definition. A highly available transaction commit is a commit that continues execution even if one or more involved participants are failed.

Here we assume fail-stop model for simplicity. The algorithm described below however can be easily generalized to cover other models.

Theorem. A two-phase highly available distributed commit for 1 RTT is possible.

To prove this theorem, it is sufficient to provide a method and conditions when it is possible. It is clear that this is not always possible, because in case of concurrent access of the same resource, the involved transactions should be serialized on accessing the resource. Thuse they will be executed sequentially. In this case, talking about 1 RTT will be somewhat funny. Nevertheless, even the usual algorithms provide under good conditions timings much more than 1 RTT.

The remainder of this article will be devoted to the proof of this theorem.


Consider the classical scheme of a two-phase commit with the Transaction Coordinator.

Definition. Transaction Coordinator coordinates the distributed transaction making final decision to commit or abort the transaction based on the responses from participants.

The sequence is as follows:

1st hop. The client sends the request to the Transaction Coordinator.
2nd hop. The Transaction Coordinator sends the request to the participants to prepare for the transaction: the 1st phase.
3rd hop. Participants successfully performed the preparation and send a response that they are ready to execute transactions.
4th hop. The Transaction Coordinator sends a message to all participants about the execution of the transaction: the 2nd phase.
5th hop. Participants send back the success of the transaction execution to the coordinator.
6th hop. The coordinator responds to the client.

Total 3 RTT.

Now we add fault tolerance to achieve high availability. We will assume that the coordinator and the participants belong to the corresponding consensus groups. We will also assume favorable conditions, i.e. we have stable leaders of the groups and the consensus terminates. Let us prove the lemma:

Lemma. Distributed consensus based on the leader can not be done faster than 1 RTT.
Proof. To achieve consensus, the request should be directed to the leader. Wherein:

1st hop. The leader sends the request to the other participants of the consensus, usually known as followers.
2nd hop. Participants send confirmation to the leader.

Without these hops consensus is impossible.

Lemma. 1 RTT consensus is possible.
Proof: Consider Raft algorithm. In the case of a stable leader and the presence of majority of consensus participants, an agreement on the leader takes place after receiving responses from the participants, i.e. after 1 RTT.

It is worth noting that after this the system guarantees that this agreement will remain in the system, even though the agreement at this point has not yet reached by other participants. If leader fails, a failover occurs where the new leader is responsible to reconcile these changes. However, this is not the subject of the lemma; we are considering a potential opportunity, i.e. some ideal conditions that can lead to the desired result - the achievement of the consensus. Why do not we consider all possible conditions? The reason that there is a theorem that consensus in an asynchronous system is impossible. Therefore, it is important to understand what is the minimum possible delay in the most favorable situations without violating the correctness of the algorithm, which is required to maintain its invariants in the event of violation of these favorable conditions at any stage. Two of these lemmas give an exhaustive answer, which suggests that the minimum possible time to achieve the distributed agreement is attainable.

This theorem can be generalized, proving that it is impossible to reach a consensus faster than 1 RTT, throwing out the condition of having a stable leader. However, this is beyond the scope of this article (the idea can be taken from article "Latency of Geo-Distributed Databases"). The idea of ​​the proof is to consider the spreading of knowledge about other participants in the system and having of a corresponding message: using 1 hop you can only send the data, but do not know whether they have received and what the state recipient had.

So, for fault tolerance, let’s consider a consensus with 1 RTT and add it to our two-phase commit:

1st hop. The client sends a request to the leader of the coordinator.
2nd and 3rd hop. The coordinator leader coordinates the beginning of the transaction.
4th hop. The Transaction Coordinator sends a request to the leaders of the participants: the 1st phase.
5th and 6th hop. Participants successfully prepares with the preservation of the decision in their consensus groups.
7th hop. Leaders of participants send the answer that they are ready to execute transaction.
8th and 9th hop. The coordinator's leader performs consensus agreement.
10th hop. The leader of the coordinator sends out a message to all the participants' leaders about the execution of the transaction: the 2nd phase.
11th and 12th hop. Leaders agree on the commit and apply the changes.
13th hop. Participants send the success to the coordinator's leader.
14th hop. The coordinator responds to the client.

Total 7 RTT. Not bad. Fault tolerance costs "only" 4 RTT. The reason is due to the fact that the coordinator and participants consistently come to their own consensus 2 times.

In the above scheme, you can see some non-optimality. Let's fix it.

Commit Optimization

The first obvious optimization is sending a response to the client immediately after collecting the responses of successful preparation from the participants. Because these responses are fault-tolerant, then the participants will never forget about them, which means that the transaction will sooner or later be executed even if the nodes fail, the leader crashes, etc. However, there is one slippery moment.

In fact the coordinator makes the final decision on whether to commit the final transaction or not. Meaning that even if all participants returned OK, but some participant blunted because of, for example, a leader election, then the coordinator can roll back the transaction. And if so, then you can remove only 10-13th hops, but not 8th and 9th. But that’s not bad either, since we have a decrease by 2 RTT, i.e. 5 RTT instead of 7.

At the same time, 10-13 hopes do not disappear anywhere, just the client does not need to wait for them. The coordinator and participants will finish their processing in parallel with the client. And the client will receive his confirmation a little earlier. The commit will be performed in the system, just a little later. Here we use the magic of asynchrony, consensus and the inability to prove to the external participant that we have slightly cheated and cut the corner. If the client suddenly wants to immediately read the data that we just completed and go directly to a participant, it will wait for the lock (if it was not removed by that time by the 2nd phase), and this request will hang until it is released . However, within the framework of our theoretical research this fact is absolutely not important, because we prepare ideal conditions for ourselves. And in the case of nonideal conditions, as already mentioned above, we will wait for several eternities (since consensus will require eternity, but we need to hold them several, and sequentially).

The next move is a bit more complicated and elegant.

Let's consider the very beginning of the transaction. There the client sends a request to the coordinator and then it initiates a two-phase commit by sending requests to the other participants. There is simple idea to execute such requests simultaneously, i.e. send the request to both the coordinator and the participants in parallel. On this way we can be trapped.

The matter is that the client is not a fault-tolerant entity, i.e. it can fall. Imagine that it sent a request to the participants, they took a lock and waited, and the request to the coordinator for some reason did not reach and the client feil. Thus, there is no one to start a two-phase commit and there is no one to roll it back in case of conflicts / problems and so on. Participants will permanently block records and no one will help them. Therefore, such optimization is incorrect. Participants have the right to commit only after the decision of the coordinator, who is responsible for the transaction and rolls it back if necessary.

To go further, we need to take a completely different look at the problem. And for this we begin, oddly enough, with consensus.

Consensus Optimization

It seems that there is nothing to do. After all, Raft achieves the minimum possible execution time - 1 RTT. However, it can be done faster - for 0 RTT.

Let’s recall that in addition to the consensus itself, another 1 RTT is required to send a request from the client to the leader and receive a response. So for a remote consensus group, 2 RTT is required for this case, which we see in the two-phase commit on 2 examples: sending and committing to the coordinator, sending and committing to the participants. A total of 4 RTTs at once, and another 1 RTT - to the second phase commit on the coordinator.

It is clear that a leader-based consensus for a remote client can not be faster than 2 RTTs. In fact, at first we need to deliver a message to the leader, and then the leader must execute the agreement by sending the message to the participants of the consensus group and get an response from them. There is no options.

The only option is to get rid of the weak entity - the leader itself. Indeed, not only all records must pass through it, additionally in case of its fail the group becomes inaccessible for a relatively long time. The leader of consensus is the weakest part, and the the leader election is the most fragile and nontrivial part of the consensus. So you just need to get rid of it.

Definition. Message broadcast is the sending of the same message to all the participants of the group.

To do this, let's take well known in inner circles masterless consensus. The main idea is to achieve the same state on the participants. To do this, it is sufficient to make 2 broadcasts, i.e. just 1 RTT. The first broadcast to the participants can make the client itself. Response on broadcast from participants can be sent to the client. If the client receives the same state (and he receives this in the case, for example, of the absence of concurrent requests), then it will be able to understand, on the analysis of the content of the broadcast responses, that its request will be executed sooner or later. In fact, using described algorithm, all participants in the consensus, including the client, simultaneously realize that the agreement has happened. And this will happen after 2 broadcasts, i.e. 1 RTT. Because the client still has to spend 1 RTT on sending the message to the group and receiving the answer, then we have a paradoxical conclusion that the consensus was performed at 0 RTT effectively.


To go further, we will use the powerful analysis tool - analogy. Let's return to Raft algorithm. What is happening there? It consists of two phases:

1st Phase: The leader sends a request to the participants and is waiting for a response.
2nd Phase: After the response, the leader enters into the agreement individually and sends it to the participants of the system.

Does not it look like anything? That's right, this is a two-phase commit, only with some clauses:

  1. The Raft algorithm does not wait for a response from all participants. In a two-phase commit for a successful transaction, you must wait for a successful response from all participants.
  2. The participants in the Raft algorithm can not say notOK. More theoretically, it can do so (for example, out of disk space), but this notOK will be analogous to the lack of response. In a two-phase commit, everything is stricter: if at least one of the participants make notOK decision, then the entire transaction should be aborted and rolled back. This is the very essence of two-phase commit: first we ask for the agreement of everyone, and only after the unanimous agreement we apply the changes. Consensus in this sense is more democratic, because requires majority agreement.

At the same time, they have in common that there is a dedicated decision driver (leader or coordinator), and there are two phases - preliminary, and final.

Accordingly, all we need is to refuse the coordinator in the two-phase commit, i.e. do exactly the same thing that we did for consensus, giving up the leader.

Let's forget about fault tolerance for a while and see how the commit looks in this case.


Definition. A two-phase commit without a coordinator consists of 2 phases:

  1. All participants send their decision to all other participants: OK or notOK.
  2. Each participant after receiving the OK from everyone commit changes or rolls back them if at least one responds to notOK.

After that, for reliability, each participant can broadcast to everyone else information that a commit has occurred and you can remove the locks, but this is not necessary.

Why did the coordinator suddenly become unnecessary? The fact is that the coordinator followed the transactional process, including whether the nodes are alive. So in case of problems with participants, the coordinator rolled back the transaction. The problem was only in the coordinator itself, because it could not look after itself. Therefore, often a two-phase commit is called blocking commit.

Definition. Self-coordinating transactions are transactions that do not require a dedicated coordinator.

However, by adding fault tolerance, the role of the coordinator becomes unnecessary. Every participant that is represented by consensus group can stand up for itself. Thus, we come to self-coordinating transactions without the need for a dedicated coordinator. An important difference from the usual two-phase commit with the coordinator is that the coordinator can at any time decide to roll back the transaction, even if all the participants gave a positive response. In self-coordinated transactions such nondeterministic behavior is unacceptable, because each participant makes a decision based on the responses of other participants and this decision should be the same.

Theorem. Self-coordinating transactions produce strict consistency (linearizability + serializability).
Proof. Actually, the proof is based on the simple fact that the two-phase commit also provides such a guarantee. Indeed, in a scheme without a coordinator, each participant is itself a coordinator; there is a two-phase commit as if it is the only one. This means that it preserves all the invariants of the two-phase commit. This is easy to verify, if we recall that each participant broadcasts responses to everyone else. So everyone receives OK responses from all the others, acting as a coordinator for performing the transaction commit.

Let's describe the minimum number of hops under a favorable conditions:

1st hop. The client sends a message to all participants in the transaction.
2nd hop. All participants send a reply to the client and to each other.

After the 2nd hop, the client has all the necessary information to make a decision about the commit. This requires only 1 RTT.

Fault Tolerance and Availability

An attentive reader may ask: what to do in case of a client failure? After all, if the participants in the system can be made fault-tolerant, then to the client we can not make such requirement, i.e. it can fail at any moment. It is clear that after the client sends requests to all participants of the system, the distributed commit can be completed without the client. And what if the client managed to send only to some of them and then fail safely?

In this case, we oblige the client to do the following: the client must forward to each participant information about all other participants in our transaction. Thus, each participant knows all the other participants and sends them the result. In this case, any participant, if it did not receive a request from the client, can choose one of the following behaviors:

  1. Immediately reply that it does not accept the transaction, i.e. sends notOK. In this case, the locks are rolled back. The participant at the same time, as always, broadcasts its response to the other participants.
  2. If the request from the other participant contains all the necessary information for executing the transaction commit for this participant, then it is possible to make a decision about the successful locking of the corresponding records (1st phase) and send OK. To this end, the client must send to each participant of the transaction information about all other participants and all the necessary data for executing the distributed commit.

In any case, we get that all participants either get OK, or in the absence of the necessary information, someone responds notOK and the transaction rolls back. So in the event of a client failure, each participant is able either to complete the initiated transaction or to correctly roll back the client's actions.

It remains to make the participants of the distributed system to be fault-tolerant. To do this, we put them in the consensus of the group without a dedicated leader. So each participant will not be represented by a separate node, but a set of nodes in the consensus group.

The commit algorithm will look like this:

  1. The client sends its request to each node belonging to the transaction group of the transaction.
  2. Each node sends a reply to all other nodes and the client about the speculative execution of the first phase of the commit as if it were being executed at the current step of the consensus. In reality, we do not know whether this will actually happen or not, because if there are concurrent requests from other clients, the consensus can reorder the current unapplied actions.
  3. The client receives all requests from all nodes of all participants. If all nodes in the speculative execution responded OK and the consensus step was the same for each node from the group consensus, it means that the speculative execution of the first phase will actually happen and client is able making a decision about the commit.

In fact, the condition for obtaining a response from all nodes of each group is redundant. However, a more detailed consideration of the relaxing of this requirement is beyond the scope of this article.


Total we obtain 2 hops or 1 RTT. Given that the communication between the client and the server can not be removed, the effective processing time of the commit on the server side is zero, i.e. as if the server instantly processed a distributed, high-availability, fault-tolerant transaction and sent a response to the client.

Thus, we have an important theoretical and practical result: the lower bound of the execution time of the distributed fault-tolerant highly available commit is attainable.


Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1983, Impossibility of distributed consensus with one faulty process

G. Demchenko, 2017, Latency of Geo-Distributed Databases

G. Demchenko, 2016, Masterless Consensus Algorithm

Jim Gray, Leslie Lamport, 2003, Consensus on Transaction Commit

No comments :

Post a Comment