I’ve been reading about failures in distributed computing. A lot of it is thanks to Vaidehi Joshii’s Year of Distributed Computing at https://medium.com/@vaidehijoshi . Here’s small summary of what I’ve learned.

A fault is a flaw in the system. It leads to an error, which is a program is in a bad state. That results in a server failure.

Hierarchy of faults

  • Transient - Occurs once and leaves (like a Panda)
  • Intermittent - Occurs sometimes.
  • Permanent - Always occurs.

Types of Errors

I don’t think there is a concept of error types.

Hierachy of failures

These are listed from least severe to most severe.

This means that any system that can handle a problem lower in the list can also anything higher in the list.

  • Fail Stop - server crashes, and other servers know it
  • Crash - server crashes, no more messages will be delivered
  • Ommission - a message is not sent or not received
  • Timing or Performance - server doesn’t respond quickly enough
  • Authentication Byzantine - server responds with any ol message, but won’t lie about other messages
  • Arbitrary or Byzantine - server responds with any ol message and may lie about other messages, possibly compromised

Byzantine failures are Value Failures (also called Response Failures), the rest are Timing Failures.

Fault tolerance means the system as a whole continues even during a partial failure.

Papers to read:

Understanding Fault-Tolerant Systems, by Flaviu Cristian (pdf)