The TLDR is simple: if you have a disappearing/reappearing bug, just run it again.

In 1985 Jim Gray coined the term Heisenbug. It’s a bug that disappears when you try to replicate it. As distinguished from a Bohrbug, which are easy to replicate.

Jim Gray’s key insight was that since a Heisenbug disappears quickly, you can solve it by just running the system again. This way, instead of spending time tracking down a bug and fixing it, you can get wicked high uptime just by having multiple copies of a systems ready to go.

Anyways, there are three papers which tend to get cited a lot when it comes to Heisenbugs. Here’s my mental dump of them.

Why Do Computers Stop and What Can Be Done About It? by Jim Gray (1985)

Jim Gray looks at the problem of creating creating a highly available system. He defines MTBF and MTTR to define Availability.

  • MTBF Mean Time Between Failure
  • MTTR Mean Time To Repair
  • Availability = MTBF / (MTBF + MTTR)

The point of his paper is to reduce MTTR to almost zero, making Availability almost 100%. He proposed to do this with two key concepts, Modularity and Redundancy. The hope is that if your components are nicely modular, and if they fail independently, and if you have redundant modules, then your MTTR will be nothing but a blip. True downtime will only happen when all your redundancies fail at the same time.

The key to providing high availability is to modularize the system so that modules are the unit of failure and replacement. Spare modules are configured to give the appearance of instantaneous repair if MTTR is tiny, then the failure is “seen” as a delay rather than a failure.

Sidenote: There’s an ongoing debate at my work of Java v. Javascript, and one of the major arguments is about compiler type checking, which you can work around with more unit tests. Page 16 of this paper provides an argument to the unit test arguments.

Although compiler checking and exception handling provided by programming languages are real assets, history seems to have favored the run-time checks

“If a transaction commits, the messages on the session will be reliably delivered EXACTLY once [Spector].” I don’t think is true. It would be more accurate to say, “will be reliably delivered AT LEAST once, and as long as it has a unique” “We can’t hope for better people. The only hope is to simplify and reduce human intervention in these aspects of the system.”

Heisenbugs and Bohrbugs: Why are they different? (2003)

Are Heisenbugs and Bohrbugs actually different? Yes, this paper argues, yes they are.

A software process depends on elements like the program, the stack, and data. It also depends on external factors, like the hardware, OS, and other programs to sync with.

  • Bohrbugs come from elements in the first set, the program, stack, and data.
  • Heisenbugs come from interactions with the second set of data, which external factors.

This is the difference between these types of bugs. When it comes to Heisenbugs, there are too many factors in the external set to consider, or to easily and fully replicate. At the very least, factors like wear and tear on the hardware, clock drift, and network connectivity are too random to replicate under a controlled environment.

Protecting Applications Against Heisenbugs by Chris Hobbs (2010)

Programming has moved from simple “execute until completion” to “synchronous multithreaded processes”

In the simple model you can map the input space to the output space, and each input variable becomes a single dimension of the output space.

In the SMP model, you’re subject to issues like preemption and shared state. Hobbs calls this “the trajectory of the input space.” The idea is that exact same input, under different trajectories, can produce different output. As in, it’s not just your context, it’s the context of your context. Heisenbugs thrive in this area like fever in a dirty hospital.

Heisenbugs are a software issue. You can’t build better hardware to solve it.

The solution is virtual-synchronous replication. When a request comes in, it is sent to multiple copies of the server, and we return the first response back to the sender.