Sunday, September 20, 2015

Replicated Object. Part 1: Introduction

1 Abstract

The present article explains an early prototype that introduces the concept of replicated object or replob. Such object is a further rethinking how to deal with complexity related to distributed systems development. Replob eliminates the dependency on the external reliable service and incorporates the consistent data manipulation into the user-defined objects representing data and related functionality. The idea is based on using the power of C++ language and object-oriented programming that allows complex logic utilization within distributed transactions and significantly simplifies development of the reliable applications and services. Subsequent articles will explain presented approach in detail step-by-step.

2 Introduction

Disclaimer. Almost all methods specified in the article contain dirty memory hacks and abnormal usage of C++ language. So if you are not tolerant to system and C++ perversions please stop reading this article.

Today, topics related to distributed systems are one of the most interesting and attract many people including developers and computer scientists. The popularity can be explained in a simple manner: we need to create robust fault-tolerant systems that provide safe environment to perform execution of operations and data storing.

Along with that, the consistency of distributed system plays important role. It comes with a price if you want to have stronger notion of consistency level. There are a set of systems provides a weakest form of consistency: so called eventual consistency. While those systems have relatively good performance they cannot be used in many areas where you need to have transactional semantics for your operations. The thing is that it is much simpler to meditate and reason about a system under consideration using one of the strong forms of consistency like strict consistency or linearizability. Due to those consistency levels, it is much easier to develop reliable application with safe semantics of operations.

3 Overview

The most common way developing a distributed system is using special building blocks. Those building blocks should provide the convenient way to deal with a complexity related to asynchronous nature of distributed services and a various types of failures including networking issues, process crashes and hardware malfunction. In distributed environments, those failures should not be treated as exceptional and must be handled as a normal code execution. Thus the task of having reliable and consistent building block to deal with a distributed issues is appeared on the scene.

Today's systems use fault-tolerant centralized coordination services like Zookeeper (mostly) or etcd (still under active development). They use consensus-based algorithms like Zab (Zookeeper) or Raft (etcd) to provide linearizability. The idea here is the following. At the first stage the leader is elected and at the second stage the designated leader (master) commits the messages in a sequential order providing necessary consistency level. Whereas Zookeeper documentation states that Zookeeper uses primary-backup instead of state machine replication, it is evident that the only difference between those notions that primary-backup based on replica sequence of requests while state machine replication based on client sequence. I think that the only matters the fact that they agree upon the sequence of deterministic operations using the developed master-based consensus algorithms.

4 Discussion of Existent Approaches

The drawback of master-based consensus algorithm is obvious: when the master fails it requires a time period to handle committing messages. The master timeout cannot be very small because it can have negative impact on performance due to high probability of new master election. It cannot be very large either due to significantly increasing latency during master failure. Thus the actual timeout is a tradeoff between latency and reelection probability depending on network conditions and replicas performance. The performance of consensus algorithm strictly depends on the master liveness and it takes significant time to restore the operability and efficiency due to timeout period and consistency preserving logic. Such logic requires at least several round trips, agreement on uncommitted entries and it does not guarantee the convergence with limited amount of round trips because almost every participant may become a new master (leader). So the system may become unavailable for a relatively long period:

  1. Chubby: most outages were 15s or less, and 52 were under 30s.
  2. MongoDB: it varies, but a replica set will select a new primary within a minute... During the election, the cluster is unavailable for writes.
  3. Zookeeper: After 15 seconds or so, a new leader is elected in the majority component, and writes may proceed again. However, only the clients which can see one of [n3 n4 n5] can write: clients connected to [n1 n2] time out while waiting to make contact with the leader.

4.1 Transactional Semantics and Complex Scenarios

One of the most difficult challenges is to apply transactional semantics for the complex logic. Let's assume that we have reliable storage like Zookeeper and we would like to perform the following sequence of operations:

  1. Load some portion of data from the storage into memory to deal with.
  2. Apply complex logic to process the data and obtain the result.
  3. Save the result to the storage.

This scenario could be solved by applying several approaches.

4.1.1 Pessimistic Locking Scheme

Pessimistic locking or concurrency control scheme based on explicit locking mechanism like using mutex for multithreaded applications. Task mentioned above could be solved by applying the following sequence of operations:

  1. Obtain exclusive lock to perform the operations.
  2. Perform operations mentioned above (load, apply and save).
  3. Release the lock.

The disadvantage of that scheme is derived from the exclusive locking mode:

  1. Mutual exclusion increases waiting times needed to propagate lock/unlock actions. Therefore, it increases overall operations latency.
  2. In case of process failure we potentially can have inconsistent data (fortunately, Zookeeper has multi update functionality to apply all results atomically on the final stage). It requires considerable amount of time to spread the process failure knowledge to the system to be able to release the obtained lock.

I would like to emphasize that the systems like Zookeeper do not have explicit lock/unlock functionality. One has to use special lock recipe to be able to utilize pessimistic locking scheme. It introduces additional penalty on the overall transaction latency (see also: Addressing the ZooKeeper Synchronization Inefficiency).

Due to mentioned issues the second approach appears on the scene.

4.1.2 Optimistic Locking Scheme

Optimistic scheme tries to get around the performance issues from the previous approach. The idea is to verify the actual state of data before committing:

  1. Load the state of data under consideration from the storage.
  2. Apply complex logic locally and create batch of writes.
  3. Atomically verify that no other transaction has changed the data and apply batch of writes.
  4. If verification fails => repeat from the 1st step.

All action on the 3rd step must be executed atomically including verification and applying. This scheme can be implemented by using the incremental version counter: on any successful update operation we increase the counter by one. The idea is to apply compare-and-swap operation that atomically checks the version counter to verify that the data has not been changed and sets the new value.

This scheme still has the following drawbacks:

  1. Implementation complexity: service must implement CAS and batch writes operations as a single atomic operation.
  2. High cost on contention: when there are many concurrent updates the algorithm requires repeating steps from the beginning wasting the process resources due to version conflicts.

Additionally for both pessimistic and optimistic schemes we need to serialize our internal data into hierarchical key space of the corresponding system (e.g. Zookeeper "znodes" or etcd "nodes"). All mentioned facts lead the application to become more complex and error prone. Thus I would like to go to completely another direction.

5 Replicated Object Concept

Let's step back and remember object oriented programming (OOP). We have a notion of objects. Each object has the underlying data representing an object state. An object contains a set of methods that transforms the object from one state to another state.

The idea is to replicate actions (object methods) across the nodes instead of data (object state) replication. Those actions change the object state deterministically and create the illusion that the object itself is replicated. Linearizability guarantees that all replicas are agreed on the same sequence of operations thus the distributed object state will remain consistent. It is very similar to state machine replication. The only difference is that I use ordinary object to represent the state and methods to represent the events transforming the object. This mapping significantly reduces the complexity and allows using the C++ language power because it supports OOP natively without code bloating.

6 Replicated Object Proposal

My replicated object (aka replob) proposal has the following features:

  1. Embedded.
  2. Masterless.
  3. In-memory.
  4. Linearizability.
  5. FIFO process guarantee.
  6. Fast local readings.
  7. Concurrent flexible distributed transactions.
  8. Parallel independent transactions option.
  9. Supports any native data structures.
  10. CAP tunable.
  11. Smooth set of replicas degradation.
  12. Safety and liveness under network issues:
    1. Partitioning.
    2. Partial partitioning like "bridging".
    3. Temporary network instability.
    4. Partial network packets direction.

Below I briefly explain each item.

Embedded. It is not a standalone service. The functionality operates within user process thus allowing reducing latency by decreasing the number of round trips and corresponding overhead. The approach completely eliminates the dependency on external services like Zookeeper or etcd and utilizes native interfaces dramatically simplifying interoperation with replication logic making it completely transparent from the developer perspective.

Masterless. The algorithm does not have designated master (leader). Thus any node is indistinguishable from each other. Masterless algorithm significantly reduces the fail recovering timings and provides predictable behavior under most conditions.

In-Memory. Current implementation does not have the persistent layer and every item is distributed across the replica nodes inside the processes memory. Algorithm still allows adding persistence property.

Linearizability. Replicated object algorithm provides linearizable consistency.

First in First out Process Guarantee. For the specified process all operations are completed in the order they scheduled (FIFO order).

Fast Local Readings. A special mode allows reading the data locally by reducing the consistency to sequential consistency level. It significantly decreases the latency and overall system overhead.

Concurrent Flexible Distributed Transactions. Deterministic user-specific functionality of any complexity can be placed inside distributed transactions. Those transactions are handled concurrently.

Parallel Independent Transactions Option. User may decide to have several consensus instances to parallelize the agreement on a sequence of independent transactions.

Supports Any Native Data Structures. Developer can use standard containers like std::vector, std::map etc as well as boost::optional, boost::variant or other data structures that provides copy semantics.

CAP Tunable. User may choose between linearizable consistency and availability under network partitioning.

Smooth Set of Replicas Degradation. The system preserves consistency even if the number of nodes reduces dramatically, e.g. from five replicas to two replicas or even to one replica under appropriate conditions.

Safety and Liveness under Network Issues. There are plenty of different kinds of network issues (see Aphyr: The network is reliable). Algorithm retains consistency and operability under mentioned issues.

All those items will be discussed in detail in subsequent articles.

7 Example: Key-Value Storage

To demonstrate the flexibility and power of the approach I consider the following example. The task is to implement replicated key-value storage with the following interface (I omit std:: and boost:: namespaces):

struct KV
    optional<string> get(const string& key) const;
    void set(const string& key, const optional<string>& value);
    unordered_map<string, string> kv_;

I chose symmetric interface for simplicity. set method deletes appropriate key if value is empty. Corresponding implementation in case of representing the normal object is the following:

optional<string> KV::get(const string& key) const
    if (kv_.count(key) == 0)
        return {};

void KV::set(const string& key, const optional<string>& value)
    if (value)
        kv_[key] = *value;

Now I would like to turn the ordinary object into replicated object. To do that I just add the following:

DECL_REPLOB(KV, get, set)

Hint: the implementation DECL_REPLOB is the following:


Then I can use the following code snippet to replicate my data across the replicas:

replob<KV>().set(string{"hello"}, string{"world!"});

All KV instances from replica set contain specified key-value pair when the invocation of KV::set completes. Please note that the object is referenced by the type KV meaning that each replica contains it is own single object instance.

To read the data in a linearizable manner I write:

auto world = replob<KV>().get(string{"hello"});

To improve the performance I just write down:

auto localWorld = replobLocal<KV>().get(string{"hello"});

That's it!

7.1 Transactions

Let's suppose I want to update the item. The naive approach is to use the following code:

auto world = replobLocal<KV>().get(string{"hello"}).value_or("world!");
replob<KV>().set(string{"hello"}, "hello " + world);

The only problem is that two atomic operations together are not atomic (race condition of the second kind). Thus we need to put those actions inside the transaction:

MReplobTransactInstance(KV) {
    auto world = $.get(string{"hello"}).value_or("world!");
    $.set(string{"hello"}, "hello " + world);

Now those actions are applied atomically across the all replicas.

7.2 Transactions with Results

Let's consider the following task: calculate the length of value for specified key. Nothing's easier:

// use local instance because we do not need to update the object
auto valueLength = MReplobTransactLocalInstance(KV) {
    return $.get(string{"hello"}).value_or("").size();

The same approach can be applied for update operation:

auto valueLength = MReplobTransactInstance(KV) {
    auto world = $.get(string{"hello"});
    $.set(string{"another"}, world);
    return world.value_or("").size();

All the mentioned operations are applied on the replicas atomically.

7.3 Multiple Replob Transactions

Let's assume that we have two independent instances of key-value storages: KV1 and KV2. We can combine operations for corresponding instances by using the modifier MReplobTransact:

// the first transaction is distributed
// performs value copying from KV2 to KV1 for the same key
MReplobTransact {
// the second transaction is applied locally
// returns total value size calculation for the same key
auto totalSize = MReplobTransactLocal {
    auto valueSize = [](auto&& val) {
        return val.value_or("").size();
    return valueSize($.instance<KV1>().get(string{"hello"}))
         + valueSize($.instance<KV2>().get(string{"hello"}));

Should I mention that all those actions are performed atomically and the first transaction is spread across the all replicas?

7.4 Advanced Example

Let's consider iteration through the collection with user-defined function:

struct KV
    optional<string> get(const string& key) const;
    void set(const string& key, const optional<string>& value);

    // generic method to iterate through the collection
    template<typename F>
    void forEach(F f) const
        for (auto&& v: kv_)

    unordered_map<string, string> kv_;

Now the task is calculating the total size of all values:

auto valuesSize = MReplobTransactLocalInstance(KV) {
    size_t sz = 0;
    $.forEach([&sz](auto&& v) {
        sz += v.second.size();
    return sz;

As you can see the way is completely straightforward.

8 Further Directions

Previously I consider several simple but powerful examples how to use replicated object approach. Further articles introduce utilized ideas and concepts step-by-step:

  1. God adapter.
  2. Nonblocking deadlock-free synchronization or subjector model.
  3. Uniform actor model or funactor model.
  4. Overgeneralized serialization.
  5. Behavior modifiers.
  6. IO and coroutines.
  7. Consistency and CAP theorem applicability.
  8. Phantom, replob and masterless consensus algorithm.
  9. Implementation examples:
    1. Atomic failure detector.
    2. Distributed scheduler.

9 Conclusion

We consider the introduction into the fault tolerant distributed replicated object with the set of outstanding features. It allows significantly reducing the complexity of reliable distributed application creation and opens the door to use it in a wide range of areas.

Masterless consensus algorithm allows handling fails in a predictable way without wasting the time. Embedded approach eliminates network delays required to cooperate with external services. Whereas strong consistency model provides convenient way to interact with replob in a transactional and flexible manner.

Special thanks to Sergey Polovko, Yauheni Akhotnikau and Petr Prokhorenkov for useful advices and comments.

10 Test Questions

  1. How is DECL_REPLOB implemented?
  2. What is the difference between local and nonlocal operations?
  3. Is it possible to implement masterless consensus algorithm?
  4. Specify all behavior modifiers mentioned in the article.


[1] Documentation: Zookeeper.

[2] Documentation: etcd.

[3] Article: Zab: High-performance Broadcast For Primary-Backup Systems.

[4] Article: In Search of an Understandable Consensus Algorithm (Extended Version).

[5] Zookeeper documentation: Zab vs. Paxos.

[6] Wikipedia: State Machine Replication.

[7] Article: The Chubby Lock Service For Loosely-Coupled Distributed Systems.

[8] MongoDB documentation: How long does replica set failover take?

[9] Aphyr blog: Zookeeper.

[10] Documentation: ZooKeeper Recipes and Solutions: Locks.

[11] Article: Addressing the ZooKeeper Synchronization Inefficiency.

[12] Wikipedia: Compare-and-swap

[13] Documentation: Zookeeper znodes.

[14] Documentation: etcd nodes.

[15] Aphyr blog: The Network Is Reliable.

[16] Article: ZooKeeper: Wait-Free Coordination For Internet-Scale Systems.

No comments :

Post a Comment