Learning about Faults, Errors, and Failures.

I’ve been reading about failures in distributed computing. A lot of it is thanks to Vaidehi Joshii’s Year of Distributed Computing at https://medium.com/@vaidehijoshi . Here’s small summary of what I’ve learned.

A fault is a flaw in the system. It leads to an error, which is a program is in a bad state. That results in a server failure.

Hierarchy of faults⌗

Transient - Occurs once and leaves (like a Panda)
Intermittent - Occurs sometimes.
Permanent - Always occurs.

Types of Errors⌗

I don’t think there is a concept of error types.

Hierachy of failures⌗

These are listed from least severe to most severe.

This means that any system that can handle a problem lower in the list can also anything higher in the list.

Fail Stop - server crashes, and other servers know it
Crash - server crashes, no more messages will be delivered
Ommission - a message is not sent or not received
Timing or Performance - server doesn’t respond quickly enough
Authentication Byzantine - server responds with any ol message, but won’t lie about other messages
Arbitrary or Byzantine - server responds with any ol message and may lie about other messages, possibly compromised

Byzantine failures are Value Failures (also called Response Failures), the rest are Timing Failures.

Fault tolerance means the system as a whole continues even during a partial failure.

Papers to read:⌗

Understanding Fault-Tolerant Systems, by Flaviu Cristian (pdf)