Monday, September 10, 2018

Exactly Once is NOT Exactly the Same: Article Analysis

Introduction

I decided to analyze an article describing some interesting details of the stream processing exactly-once. The thing is that sometimes the authors misunderstand important terms. The analysis of the article just will clarify many aspects and details so revealing illogicalities and oddities allows you to fully experience the concepts and meaning.

Let’s get started.

Analysis

Everything starts very well:

Distributed event stream processing has become an increasingly hot topic in the area of Big Data. Notable Stream Processing Engines (SPEs) include Apache Storm, Apache Flink, Heron, Apache Kafka (Kafka Streams), and Apache Spark (Spark Streaming). One of the most notable and widely discussed features of SPEs is their processing semantics, with “exactly-once” being one of the most sought after and many SPEs claiming to provide “exactly-once” processing semantics.

Meaning that data processing is extremely important bla-bla-bla and the topic under discussion is exactly-once processing. Let us discuss it.

There exists a lot of misunderstanding and ambiguity, however, surrounding what exactly “exactly-once” is, what it entails, and what it really means when individual SPEs claim to provide it.

Indeed, it is very important to understand what it is. To do this, it would be nice to give a correct definition before the lengthy reasoning. And who am I to give such damn sensible advice?

Thursday, July 19, 2018

Heterogeneous Concurrent Exactly-Once Real-Time Processing

Concurrent sausage

Abstract

Exactly-once data processing in real-time is an extremely non-trivial task and requires serious and thoughtful approach for the entire pipeline. Someone even believes that such a task is impossible. In reality, one wants to have an approach that provides generic fault-tolerant processing without any delays together with using the different data storages, which puts forward an even stronger requirement for the system: concurrent exactly-once and heterogeneity of the persistent layer. To date, any of the existing systems do not support this requirement.

The proposed approach will consistently reveal secret ingredients and necessary concepts allowing to implement heterogeneous concurrent exactly-once processing relatively easy literally based on two components.

Introduction

The developer of distributed systems passes several stages:

Stage 1: Algorithms. Here we study the basic algorithms, data structures, approaches to OOP type programming, and so on. The code is solely single-threaded. The initial phase of entering the profession. Nevertheless, it is rather difficult and can last for years.

Stage 2: Multithreading. Then there are questions of extracting maximum efficiency from hardware, there is multithreading, asynchrony, races, debugging, stracing, sleepless nights... Many get stuck at this stage and even start from some moment to catch an inexplicable buzz. But only a few come to understand the architecture of virtual memory and memory models, lock-free/wait-free algorithms, various asynchronous models. And almost no one ever comes to the verification of multi-threaded code.

Stage 3: Distributed programming. Here such shit happens that words cannot describe it.

Thursday, April 26, 2018

Attainability of the Lower Bound of the Processing Time of Highly Available Distributed Transactions

Introduction

Recently I've read another article from the series: "we are better than a two-phase commit". Here I will not analyze the contents of this article (although I'm thinking about giving a detailed analysis). The task of my opus is to offer the most effective version of the distributed commit in terms of number of round trips. Of course, such a commit comes at a price. However, the goal is to assess and show that the two-phase commit is not a drag, as many believe.

It should also be noted that there will be no full-scale experiments and fake comparisons. Algorithms and theoretical analysis will be given simply. If desired, you can independently implement and test it in practice. Of course, it would be much better that this was presented in the current article, but it all depends on free time and motivation. In my opinion, it's more important to describe algorithms than to present charts, because almost everyone can draw the charts based on algorithms while the opposite is not true.