Sunday, August 13, 2017

Latency of Geo-Distributed Databases

Theorem 0. The minimum guaranteed latency for the globally highly available strong consistency database is 133 ms.

1 Abstract

The article introduces step-by-step the formal vocabulary and auxiliary lemmas and theorems to prove the main theorem 0. Finally, as a consequence, the CAL theorem is formulated.

2 Introduction

Modern applications require intensive work with huge amount of data. It includes both massive transactional processing and analytical research. As an answer to the current demand, the new generation of databases appears: NewSQL databases. Those databases provide the following important characteristics: horizontal scalability, geo-availability, and strong consistency.

NewSQL era opens new possibilities to store and process so called Big Data. At the same time, the important question appears: "how fast the databases might be?". It is very challenging task to improve the performance and latency parameters because it involves almost all layers while building the databases: from hardware questions about data centers connectivity and availability to software sophisticated algorithms and architectural design.

Thus, we need to understand the degree of latency optimizations and corresponding limitations that we have to deal with. The article tries to find answers to that challenge.

3 Model

The globally distributed databases have different characteristics and assumptions based on the features and scalability properties they provide. I consider the following assumptions to specify the model to be used.

First of all, I consider that all participants belong to the sphere with some radius RR. Secondly, I do not need to restrict the failure mode because the final result does not depend on that. Thus without loss of generality, I consider the non-failure mode and put some consideration about real systems behavior under different failure conditions later.

The participants involved in the process are cooperated through the message exchanging. Each participant AA may send the information II to another participant BB by sending the message with needed information II. Any participant is represented as the point on the sphere and connected to another participant by some curve on the sphere.

3.1 Speed

Definition 3.1.1. The distance between two participants is the length of the curve between two points on the sphere.

Definition 3.1.2. The minimum distance between two participants is the minimum distance across all possible curves on the sphere that may connect two participants.

The message exchanging does not take place immediately and has time characteristics.

Definition 3.1.3. The information propagation speed between two participants AA and BB is l/tl/t where ll is the distance between participants and tt is the time required to send the information by AA and receive the information by BB.

The information has the limited propagation speed.

Definition 3.1.4. The speed of information propagation has the limit cc.

Thus the minimum time required to propagate the information from participant AA to participant BB is lm/cl_m/c where lml_m is the minimum distance between AA and BB.

Definition 3.1.5. The wave propagation between participants AA and BB is the information propagation with minimum distance and maximum speed cc.

Thus the wave propagation is spread on the sphere with speed cc.

Lemma 3.1.1. Information cannot propagate faster than wave propagation.

Proof. By definition of the wave propagation: it uses the largest speed cc and smallest distance dd; thus the ratio d/cd/c is minimum across all possibilities. \Box

3.2 Consistency

Strong consistent databases have a different meaning in different contexts. I consider linearizability property as strong consistency.

Definition 3.2.1. The operation is the message issued by the client to be committed.

The proof is based on the following linearizability characteristics:

  1. There is a global sequence of operations. Each operation is committed using some incremental index ii.
  2. To commit the operation on the ii index the system must have the information about the whole previous history, e.g. must know the information about all previous commits with indexes j<ij < i.
  3. The system may respond to the client with successful operation commit only if it commits the operation into the global sequence of operations.
  4. If the system commits the operation using index ii thus any participant that knows about committed operation for index ii has the same operation for that index. Thus each participant has the same information about prefix of the global sequence of operations.

Theorem 3.2.1. If there are two committed linearizable operations issued by the clients thus there is a participant that knew the information related to the both operations.

Proof. To commit linearizable operations the participant must know the history for the previous operations. Thus, in any case, it knows the current operation and previous one, thus it knows both. \Box

The theorem can be easily generalized for any number of committed operations.

Theorem 3.2.2. If there are several committed linearizable operations issued by the clients thus there is a participant that knew the information related to all operations.

Proof. The same as theorem 3.2.1. \Box

Theorem 3.2.3. If there are several committed linearizable operations issued by the clients thus there is a point on the sphere where all waves propagate from all clients to that point.

Proof by contradiction. If there is no such point thus there is no any participant that has all information related to operations due to lemma 3.1.1. It contradicts to theorem 3.2.2. \Box

Definition 3.2.2. The client commits the operation when it receives the message from a participant that has committed the operation.

Definition 3.2.3. The client commit latency is the time between issuing the operation and committing it by the client.

Definition 3.2.4. The guaranteed commit latency is tgt_g if clienttctg\forall client \Rightarrow t_c \leq t_g where tct_c is the client commit latency.

Definition 3.2.5. The wave half-time is the time needed to propagate from the participant to the diametrically opposite participant.

Definition 3.2.6. The wave full-time is doubled wave half-time.

Theorem 3.2.4. The minimum guaranteed commit latency is the wave half-time.

Proof. Let all clients be spread across the sphere. Let all clients issue the operation to be committed at the same time i.e. concurrently. To commit the sequence of operations according to theorem 3.2.3 there must be at least one point on the sphere where all waves propagate from all clients to that point pp. Let the opposite point be pop_o. At that point, according to our assumptions, we have a client that issues the operation. The commit latency according to lemma 3.1.1 is at least the wave half-time because wave half-time is needed to propagate from the pop_o to pp. \Box

Theorem 3.2.5. The wave half-time guaranteed commit latency is theoretically achievable.

Proof by example. Let all participants be spread across the sphere. In that case, any client issues the operation to the closest participant. Each participant broadcasts the operation to others and waits for the response from all participants, then commits the sequence of received operations. The commit latency is described as the largest wave propagation time between participants and equals to the wave half-time. \Box

3.3 Geo Split

The links are unreliable. Thus at any time there is a possibility when part of the system becomes unavailable for the rest of the system. It strictly affects the algorithms to handle such complex scenarios.

Definition 3.3.1. Group of participants is unavailable when the rest of participants cannot propagate the information to that group.

Definition 3.3.2. Split is the situation when there is an unavailable group.

Definition 3.3.3. The system is highly available when it can handle split.

Theorem 3.3.1. The guaranteed commit of the highly available system requires at least one round trip between available participants.

Proof by contradiction. Let us assume that half round trip is enough to commit. Consider the participant pp received the operation OO from the client to be committed. The half round trip allows to receive the operations from other participants and the participant pp must make a decision to commit the operations. If we assume that the participant pp belongs to unavailable group thus the operation of this participant may not be propagated to other participants. Thus we obtain the situation when the participant pp commits the operation OO while others cannot commit it due to pp unavailability. Thus that commit is inconsistent. \Box

The theorem provides the important clue about the latency. The totally available system requires only half round trip to commit the operations while the highly available system requires at least one round trip.

Definition 3.3.4. The commit group is the participants that make a decision to commit the operations.

Theorem 3.3.2. The minimum guaranteed commit latency for the highly available system is the wave full-time.

Proof. Let all clients be spread across the sphere. Let all clients issue the operation to be committed concurrently. Let GG be the commit group for the client operations. Let pcGp_c \in G be the participant from the group GG. Choose the opposite point oco_c on the sphere for the pcp_c. According to our assumption there is a client ccc_c in point oco_c. Let prGp_r \in G be the participant that sends the commit to the client ccc_c and from the client point of view, it receives the first commit response from the participant prp_r. There are two possibilities: either pcp_c and prp_r are the same or not. If pcp_c and prp_r are the same the proof is trivial: the latency is the doubled half wave time because ccc_c and pcp_c are opposite points. If pcp_c and prp_r are different participants thus according to theorem 3.3.1 the information about client ccc_c operation is propagated in the following way ccpcprccc_c \rightarrow p_c \rightarrow p_r \rightarrow c_c. Because pcp_c and ccc_c are opposite on the sphere thus the latency is at least the wave full-time. \Box

For leader-based consensus, the theorem becomes trivial because we can put the client on the opposite side of the leader and all operations must be propagated through the leader.

Theorem 3.3.3. For the highly available system, the wave full-time guaranteed commit latency is theoretically achievable.

Proof by example. Put three participants around any point on the sphere on equidistant positions with the distance ε\varepsilon. If ε0\varepsilon \rightarrow 0 thus the guaranteed latency goes to wave full-time because from any point the latency is the doubled of maximum latency between two points on the sphere. \Box

Now we can prove the main theorem 0.

Proof of theorem 0. The wave full-time of the Earth according to definitions 3.2.5 and 3.2.6 is equal to 2πRE/c2 \pi R_E / c where RER_E is Earth radius and cc is the speed of light. The result is 133 ms. \Box

4 Medium and Effectiveness

I have introduced the term theoretically achievable meaning the following:

  1. Latency is described only the wave propagation and does not have any packet loss or delay during transmissions/handling/routing etc.
  2. There is unlimited bandwidth for the links.
  3. Each client is connected with the participant.

The reality has another understanding related to the specified items. The common case is that information propagates slower than wave propagation. If we use the ratio n=c/vn = c / v where cc is the speed of wave and vv is the speed of information propagation thus nn is the medium refractive index [1] (BTW, do you know that particle speed can be greater than speed of light? See Cherenkov radiation [2] for details). Reciprocal of refractive index is the efficiency η\eta of the infrastructure. Thus it is the useful measure to compare infrastructure and the common criteria to determine overall efficiency.

The idea and intuition are very simple: the more transparent medium \Rightarrow the less refractive index nn and thus the higher efficiency η=1/n\eta = 1/n.

5 Optimization

Definition 5.1. Operation commit period is the time period between the time point of operation being issued and client commit point of time.

Definition 5.2. Two or more operations are not concurrent if their operation commit periods are not intersected.

Definition 5.3. Nonconcurrent period is the runtime situation when there are no concurrent operations.

It is important to emphasize that nonconcurrent period is the runtime situation based on the client's behavior and not a priory knowledge.

Definition 5.4. Nonconcurrent commit latency is the commit latency during the nonconcurrent period.

Theorem 5.1. The minimum guaranteed nonconcurrent commit latency is the wave half-time.

Proof by contradiction. Let two diametrically opposite clients issue the operations simultaneously. To detect concurrent execution there must be a participant pp that has information about both operations. Thus the minimum required time τ\tau to detect the concurrency is the wave propagation time from the client to the participant pp that is equal to the quarter of wave full-time. The total commit time is doubled of τ\tau because we need to add response time from the participant pp and thus it must be at least the wave half-time. \Box

Theorem 5.2. The guaranteed nonconcurrent commit latency more than the wave half-time is theoretically achievable.

Proof by example. Let all participants be placed equidistantly on the equator. If a number of participants are N=2K1N = 2 K - 1 where KK is the natural number thus concurrent execution detection requires KK participants to have the information about the operation. Assuming n=η=1n = \eta = 1 as an ideal (theoretically achievable) situation we conclude that commit latency is equal to τ2/(11/2K)\tau_2 / (1 - 1/2K) where τ2\tau_2 is the wave half-time. Increasing the number of participants NN allows achieving the latency arbitrarily close to the wave half-time. \Box

6 Solar System

Here is a table that represents the minimum guaranteed latencies for the solar system objects in the universe:

Planet System Latency
Earth 133 ms
Moon 36 ms
Mercury 51 ms
Mars 71 ms
Venus 126 ms
Jupiter 1.4 s
Sun 14 s
Earth-Moon 2.6 s
Earth-Mars 6-44 min*

*Depending on the position of the planets.

7 Connection Relaxing

Theorem 3.3.2 assumes that the participants are connected with each other using surface links thus the wave propagates on the sphere. Let us relax this requirement and allow the wave propagates directly through the surface. One of the possibility how to achieve this is to use neutrino particles [3] (BTW, do you know, that neutrino is detected by using Cherenkov radiation [2] due to extremely low interaction with the matter).

Definition 7.1. Straight connection is the shortest geometrical connection in space.

Using the mentioned shortcut we may obtain the following theorem.

Theorem 7.1. The minimum guaranteed latency for the globally highly available strong consistency database using straight connections is 85 ms.

Proof. The wave full-time of Earth for the straight connections is 4RE/c4 R_E / c which is equal to 85 ms. \Box

8 Place Relaxing

Theorem 7.1 assumes that all participants are placed on the sphere. In addition to connection relaxing let us assume that the database participants may be placed at any point within the sphere.

Definition 8.1. The arbitrarily placed database is the database where participants can be placed at any point within sphere using straight connections.

Theorem 8.1. The minimum guaranteed latency for the globally available strong consistency arbitrarily placed database is 42 ms.

Proof. Let all clients are spread across the sphere and they issue the operation to be committed at the same time. To commit the sequence of operations according to theorem 3.2.3 there must be at least one point on the sphere where all waves propagate from all clients to that point. Let us choose the earliest moment of the time where that point pp appears. The shortest path among all clients to the point pp occurs when pp is the center of the sphere (otherwise we could find the client with larger distance). Thus the minimum guaranteed latency is 2RE/c2 R_E / c or 42 ms. \Box

Theorem 8.2. The minimum guaranteed latency 42 ms for the globally available strong consistency arbitrarily placed database is theoretically achievable.

Proof by example. Let pcp_c be the center of the sphere and the only participant of the consistent database. It is evident that the time required to propagate wave from any point on the surface to the pcp_c and back is equal to 42 ms. \Box

9 Reliability and Fault-Tolerance

Theorem 3.3.2 is proved under assumptions that any participant lives forever and handles the incoming messages correctly. If we introduce additional failure modes (fail-stop, fail-silent etc) it is evident that latencies cannot become less than theorems 3.3.2, 7.1 and 8.1 state because the system must resend the information to redundant participants increasing availability and timings as well.

Nevertheless, it could be easily proved that latencies are achievable at the same values even in the case of arbitrary node failures.

Definition 9.1. Byzantine participant group is the group of nodes that uses Byzantine consensus to handle received and send information under arbitrary failure.

Definition 9.2. Live Byzantine participant group is the Byzantine participant group that achieve consensus under a limited fixed number of round trips between participants within the group.

Theorem 9.1. Live Byzantine participant group behaves as a single participant without failure.

Proof. If Byzantine consensus makes the progress due to liveness conditions it means that it operates normally even in the case of arbitrary failures because consensus algorithm preserves safety. \Box

Definition 9.3. The link has efficiency η=v/c\eta = v / c where cc is the wave propagation speed and vv is the information propagation speed.

Note that η1\eta \leq 1 in accordance with lemma 3.1.1.

Definition 9.4. Group efficiency is ηG=min(ηl)\eta_G = \min(\eta_l) across all links ηl\eta_l within the group.

Definition 9.5. The group is flexible when group efficiency preserved under proportional distance changing between nodes.

Theorem 9.2. The commit latency of flexible live Byzantine participant group can be arbitrarily small.

Proof. tcdNr/ηct_c \leq d N_r / {\eta c} where tct_c is the commit latency, dd is the maximal distance between participants, NrN_r is the limited number of round trips and cc is the wave propagation speed. Decreasing dd causes tct_c decreasing as well. \Box

Based on theorem 9.2 we conclude that latencies are achievable because we could use live flexible groups to handle arbitrary failures that introduce negligible overhead on the overall latency timings.

Thus the theorem allows to easily build a fault-tolerant solution based on the simple non-fail model without sacrificing the timings of the database system.

10 Geo-Locality

Flexible groups and their commit latency allow introducing the notion of geo-locality. The idea is the following. Sometimes we know that the client at any time can be only in one place or near some place geographically. Thus we could choose the closest data center to handle client requests. The locality allows us to reduce the commit latency.

The only question is what to do if the client moves to another distant region. To preserve consistency we must apply group membership change protocol and move corresponding commit group to another closest data center preserving the minimum response timings. It may work if we implement a rather complex algorithm to change the group and put data centers around the world to be as close as possible to the client. But it does not work if there are concurrent clients across the planet at any time and space.

11 Consistency Relaxing

Another way to decrease the latency of operations is to relax the consistent requirements. E.g. one may consider applying sequential consistency still preserving geo-availability of the data.

The idea of sequential consistency can be represented by the following model. The client may issue any operation at any time. But instead of execution, the participant puts the operation into the queue to execute it later. The queue is the global across all participants. Another participant may dequeue the elements one by one and execute it in a serial way. Thus we serialize execution of all operations.

To reduce the write latency we could do the following approach. Instead of using the global queue to schedule operation the client chooses the closest consensus group inside some data center and put the operation into that queue. Periodically the operation is dequeued from that group and enqueued into the global queue transactionally using exactly once semantics.

The same approach can be used to reduce read latency by having the closest replicated data based. The replicated data is obtained by reading the operations from the global queue.

The approach allows to significantly reduce the latency of write operations by reducing the consistency level. At the same time, the actual latency to commit or execute the operation still cannot be less than mentioned theorems.

This scheme illustrates the idea of how to reduce latency without sacrificing geo-availability. The cost is the reduction of consistency guarantees.

12 CAL Theorem

CAL definition. CAL abbreviation means:

  1. "C" is the consistency that implies linearizability, e.g. strict consistency (linearizability+serializability) or linearizability.
  2. "A" is geo-availability meaning that client can be placed at any point on Earth surface.
  3. "L" is the guaranteed low latency, lower than related theorems states, e.g. submillisecond latency.

CAL theorem. It is impossible for a distributed database to simultaneously provide more than two out of the CAL guarantees.

Proof. See theorem 0 and considerations related to consistency relaxing and geo-locality. \Box


I am very sad to present the theorem because it disallows building the best database with smartest geo-replicated consistent algorithms. It happened unintentionally, sorry for that.

On the other hand, it allows me to save my energy not to invent unworkable algorithms and relax by accepting the inevitability.

Nevertheless, the developed approach turned out to be powerful and productive. The wave propagation model allows understanding the minimum achievable latency in ideal situations even without knowing the connection topology and algorithms that are used to achieve consistency. Thus it is applicable always regardless of the system.

As the consequence, it provides the boundary to search and investigate the possible optimizations. One of the important application is to improve consensus algorithms by reducing latency based on the wave and information propagation consideration. In that perspective, the nonconcurrent optimization gives the key understanding how to further improve the consensus algorithm characteristics based on runtime situation.

At the same time, the article introduces the locality approach to significantly improve the latency. Limiting the distance between all participants including clients gives the possibility to further improve system characteristics without sacrificing the other parameters like consistency. The only requirement is to ensure that client cannot be at any point on geosphere simultaneously.

The theorem states the timings regarding ideal system meaning that there is no information (network packets) drops, node failure and other mechanisms that may potentially increase the overall latency. To deal with that complexity the article introduces the notion of efficiency and defines the way to obtain the direct value of the efficiency. The total latency can be easily recalculated based on the ideal latencies and efficiency. The medium model is thoughtful while considering the nonideality of the environment.

Moreover, the article provides the way how to shift consideration from totally available nodes to partially available nodes including arbitrary failures. The further application the Byzantine consensus allows creating a robust infrastructure and treating each participant as totally available. Of course, the model can be easily reduced to the normal consensus if the arbitrary failures are forbidden. The locality consideration to reduce the latency is applicable for the consensus groups as well.

The newly introduced CAL theorem allows tuning important entities describing the tradeoffs and connectivity between them. The possible way to preserve geo-availability and low latency is to use more relaxed consistency models e.g. briefly described sequential consistency application.

Now the turn from the wave propagation model to concrete consensus and transaction algorithms to make the closest bridge between them in the context of achieving the described guaranteed minimum latency. Those algorithms are coming.

Self-Examination Question. What is the minimum possible number of round trips required to commit the client operation excluding round trip from the client to the participants?

The answer is: eiπ+1e^{i \pi} + 1.


[1] Wikipedia: Refractive index
[2] Wikipedia: Cherenkov radiation
[3] Wikipedia: Neutrino

No comments :

Post a Comment