If you’re responsible for a large, legacy code base, here’s my hint to you: delete your failing tests.

Good code bases have a large suite of unit tests, probably around 80% coverage, that are run automatically by integration scripts. Getting to this is step one. Step two is managing those tests, and the most important job in now is to make sure they aren’t failing.

Unfortunately, in any large code base, failing tests are everywhere.

This causes two problems, alert fatigue and wrong prioritization.

Alert fatigue

When you see a failing test, it should mean that something just broke and should be fixed. But legacy code bases are often filled with enough failing tests that a RED ALERT email is a daily false positive. You learn to ignore them.

Learning to ignore failing tests is bad. In critical systems, this is called alert fatigue, where there are too many unimportant or non-actionable alerts, and operators learn to ignore them.

Which is trouble, because if it an alert is real, we’ll miss it.

Wrong Prioritization

When a test fails, it should be fixed, right?

No.

Legacy systems have been in production for a long time, and are, to some extent, working. Also, all systems have a list of bugs against them. Why should a failing test be more important than a bug without a test?

In a well-groomed suite of tests, one failing test is straightforward to fix, and will likely point to where the bug is. But in legacy code systems with long-term failing tests, test are hard to debug, and it’s just as likely that the issue isn’t a bug at all. If it is, but has been failing for years, then it’s just not high priority.

Solution

The answer is simple:

  1. Delete failing test.
  2. Add the bug to the backlog.

This solves both problems: your integation system stops sending out ignored emails, so when you do get a that “FAILING TEST” email, you know it’s due to a recent change. And then your product owner (or product manager, team lead, boss, loudest customer, whoever) can own the actual prioritization of bug fixes.