Life of Xunzhang

# 16 Apr 2016 Fault Tolerance in Action -- Preface

During last few years, I am working on developing distributed computational tools. I wrote Paracel project, a distributed training framework designed for machine learning and graph algorithms. The question I want to point out is that Does a distributed computational framework needs to be fault tolerant? Before drawing an immediate conclusion, let’s just do a little bit calculation.

Suppose a distributed computational system can’t move forward because of the crash of one worker, therefore we should care about the failure. Let’s denote $Prob$ as the crash probability of one worker in $H$ hours, denote $N$ as the number of workers we use. Then we can easily deduce the failure probability of the system in $H$ hours is: $1 - (1 - Prob)^{N}$. In real scenario, $Prob$ is close to 0.0005 with $H = 10$, so the formulation tells us the failure probability with 500 workers is $\approx 20\%$. Under this consideration, fault tolerance is not that critical comparing to performance(importing fault tolerance always sacrifice some efficiency).

However, the conclusion may be misleading. The key problem is the scale depends on your system. For example, if $N$ is greater than 1500, then the failure probability above is $> 50\%$. In other words, if you run your application twice, the system could probably fail once. It is really a disaster! Besides distributed computational field, traditional distributed storage systems always care more about the failure problem. For example, a distributed SQL engine needs to run forever, the reliability is key important.

So is it difficult to handle failure? In mapreduce frameworks such as hadoop and spark, it is quite simple since workers are stateless. In this case, all coordinating jobs are handled by master, so we just need to pick up another healthy worker to take over. That’s a clever abstraction, but it also limits the flexibility of the whole system. Moreover, data locality is poor in mapreduce since its multi-stage shufferling abstraction. In more advanced computational framework, the fault tolerance is extremely hard to design.

In the following series of blog posts, I am trying to dive into detail with implementing fault tolerance feature of distributed systems to archieve high availability. A presupposed outline might looks like below:

1. RPC
2. Fault recovery in Mapreduce
3. A global view service strategy
4. Consensus problem