#
16 Apr 2016
Fault Tolerance in Action -- Preface

During last few years, I am working on developing distributed computational tools.
I wrote Paracel project, a distributed training framework
designed for machine learning and graph algorithms. The question I want to point
out is that __Does a distributed computational framework needs to be fault tolerant__?
Before drawing an immediate conclusion, let’s just do a little bit calculation.

Suppose a distributed computational system can’t move forward because of the crash of one worker, therefore we should care about the failure. Let’s denote \(Prob\) as the crash probability of one worker in \(H\) hours, denote \(N\) as the number of workers we use. Then we can easily deduce the failure probability of the system in \(H\) hours is: \( 1 - (1 - Prob)^{N} \). In real scenario, \(Prob\) is close to 0.0005 with \( H = 10 \), so the formulation tells us the failure probability with 500 workers is \( \approx 20\% \). Under this consideration, fault tolerance is not that critical comparing to performance(importing fault tolerance always sacrifice some efficiency).

However, the conclusion may be misleading. The key problem is the scale depends on your system. For example, if \(N\) is greater than 1500, then the failure probability above is \( > 50\% \). In other words, if you run your application twice, the system could probably fail once. It is really a disaster! Besides distributed computational field, traditional distributed storage systems always care more about the failure problem. For example, a distributed SQL engine needs to run forever, the reliability is key important.

__So is it difficult to handle failure?__ In mapreduce frameworks such as hadoop and
spark, it is quite simple since workers are stateless. In this case, all
coordinating jobs are handled by master, so we just need to pick up another
healthy worker to take over. That’s a clever abstraction, but it also limits the
flexibility of the whole system.
Moreover, data locality is poor in mapreduce since its multi-stage
shufferling abstraction. In more advanced computational framework, the fault tolerance is extremely hard to design.

In the following series of blog posts, I am trying to dive into detail with implementing fault tolerance feature of distributed systems to archieve high availability. A presupposed outline might looks like below:

**RPC****Fault recovery in Mapreduce****A global view service strategy****Consensus problem****Raft made simple****Paxos protocol and other strategies in real systems[optional]****TBD**

Notice: Lots of ideas are motivated by the labs of 6.824, a famous MIT course. But since the course doesn’t allow publishing code of original project, I will rewrite it in C++ to avoid potential problem. Please don’t copy my code for assignment and other particular usage.

Hong Wu at 21:19