Distributed Systems and Low-Level Programming: Memory Model: Brief Description

Introduction

My previous post was related to discussion about memory model in C++11. Now let’s talk about memory model itself briefly.

So, memory model contains 3 different use cases. It can be classified differently. Let’s use the following criteria: how many threads are affected on atomic operation:

One thread, known as relaxed atomic operations.
Two threads, known as acquire-release operations.
More than two threads, the strongest guarantee, known as sequentially consistent operations.

One Thread: Relaxed Atomics

It’s quite simple: if you just need to have eventual atomic consistency you may use relaxed atomics. It has the best performance but it guarantees only atomic operation for that value. Other threads may read old values, but eventually updated value will appear. It’s useful, e.g., for statistics, debugging flags etc, or during intermediate atomic operations in some complex scenarios.

Two Threads: Acquire-Release Atomics

It has stronger guarantee than relaxed atomic operations thus having additional performance penalty (at least it has compiler barriers). The use-cases are similar to mutex-based approach on lock()/unlock() operations:

Locking => resource acquiring => acquire atomic operation
Unlocking => resource releasing => release atomic operation

It’s the most common use case: writer releases resource in one thread while reader acquires the same resource in another thread. That’s why it affects two threads.

Consume-Release Atomics

Acquire-release semantics has additional sub-case: consume-release semantics. The main and the only difference is that it can be used for pointer manipulations:

std::atomic<Object*> atomicObject;
// load pointer
Object* object = atomicObject.load(std::memory_order_consume);
// dependent read uses the same loaded pointer
object->value1 = 1;
// another dependent read uses the same loaded pointer
object->value2 = …;

The idea is that subsequent read operations use loaded pointer. Almost all processor models understand that there is dependent subsequent loadings and don’t require additional memory barriers (hello, DEC Alpha!). Thus the performance of consume-release is at most as good as acquire-release and sometimes is even better.

Three or More Threads: Sequentially Consistent Atomics

This is the strongest guarantee and slowest as well, no surprises. All examples have something in common: at least three threads are used to demonstrate usefulness of using sequentially consistent atomics. They are very complicated and very unlikely to see those examples in real applications. But it’s the most expected behavior among all guarantees. That’s why it is used as default value for all atomic operations in C++11.

Examples

Here are a couple of interesting examples to demonstrate usage in different scenarios.

Lock Contention Statistics

It’s quite natural to use mutexes for complex data structure and logic. But if you are developing high performance application you have to know about lock contention to prevent long waiting on lock() operations (see my previous post C++ User Group in St. Petersburg). Simple snippet below provides you the needed information:

std::mutex mutex;
std::atomic<int> lockCounter; // no contention
std::atomic<int> waitCounter; // wait due to contention
// … locking
if (mutex.try_lock())
{
    lockCounter.fetch_add(1, std::memory_order_relaxed);
}
else
{
    waitCounter.fetch_add(1, std::memory_order_relaxed);
    mutex.lock();
}
// mutex acquired, continue

As you see there is no magic on atomic manipulations. If you obtain that waitCounter is not small as expected you should reconsider locking mechanism usage by applying fine-grained approach.

Double-Checked Locking

Not so long ago double-checked locking was considered as anti-pattern. Now it can be implemented correctly and safely using C++11 memory model:

std::mutex mutex;
// returns singleton object of the type T
template<typename T>
T& single()
{
    static std::atomic<T*> ptr;
    // try to load pointer => consume
    T* t = ptr.load(std::memory_order_consume);
    if (t != nullptr)
        return *t;
    std::lock_guard<std::mutex> lock(mutex);
    // try again under mutex => relaxed
    t = ptr.load(std::memory_order_relaxed);
    if (t != nullptr)
        return *t;
    t = new T();
    ptr.store(t, std::memory_order_release);
    return *t;
}

Done!

Atomic Counter for Shared Pointers

As you (maybe) know std::shared_ptr<T> contains atomic counters to preserve number of owners. On increase operation there is no any surprises:

std::atomic<int> counter;
// ...
counter.fetch_add(1, std::memory_order_relaxed);

Decreasing is a little bit much more complicated:

int c = counter.fetch_sub(1, std::memory_order_acq_rel);
if (c == 1)
    delete ptr;

The reason of using acquire and release semantics simultaneously is the following. Let’s suppose we have two threads that own shared_ptr and associated counter == 2. Thread #1 decreases counter from 2 to 1 and thread #2 decreases from 1 to 0 and invokes delete operation. To correctly obtain the whole object state on delete operation thread #2 has to apply acquire semantics while thread #1 has to apply release semantics to "commit" its changes. Because you don’t known in advance the counter value you have to apply both semantics in single atomic operation meaning that reading is performed using acquire semantics while writing decreases value using release semantics.

Conclusion

I would recommend using the following receipts:

If you just need atomic operation with eventually consistency you may use relaxed atomics.
If you have mutex-like interaction between threads you should use acquire-release atomics. For pointers you should consider consume-release atomics.
If you are not sure about correctness or you have some complex scenarios involving different threads and operations you may fallback to default memory order: sequentially consistent ordering.

Distributed Systems and Low-Level Programming

Monday, August 4, 2014

Memory Model: Brief Description