For my own reference, here are notes I took while reading Google’s book on SRE.

Site Reliability Engineering - Edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy

Chapter 7 has great stories about the growth and setbacks of automation at Google.

Chapter 13 has stories of emergencies. My favorite involved a disk erasing system which used 0 to mean “all”.

They namedrop Perl. Jeez.

All bug-fixes are cherry-picked to go into the mainline.

Eventually automatous processes become complex and SRE’s lose their mental model of the system. Simplicity is important to prevent that.

SRE is heavily based on human psychology. Like recognizing that engineers fight for what they are responsible for, and recognizing that burn-out is real.

  • Burn-out is prevented by ensuring (and enforcing) that SREs spend no more than 50% of their time on “ops” pager-duty stuff.
  • Expertise is built-in by forcing SREs to spend 50% of their time on “ops” pager-duty stuff.
  • Im sum, SREs time is split between 50% project and engineering, 25% on-call, and 25% on ops non-emergencies. Each shift can expect 2 pager events and 5 daily tickets.

SRE recognizes a trade-off between new features and stability, and assigns different weights based on the product. Growing products in new markets focus on features, backbone infrastructure products focus on stability. This trade-off is defined explicitly by how many 9s you want. 99%, 99.9%, 99.99%, in different metrics. This is your error budget. Development teams use this error budget to become self-correcting, they’ll back off of new features as their budget is spent.

Interesting note: each additional 9 takes 100x the effort.

“Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code.” —-Robert Muth

A lot of the their best practices are grown and perfected, not designed.

They use text scraping over HTTP to monitor their servers, rather than receiving structured data over SNMP, because text scraping over HTTP is good enough.

Interesting note: when monitoring is essential to think about the difference between counts and gauges. Counts will capture information between polling, gauges may not.

They built their systems with easily-queryable components, simple APIs available from curl. From there, triaging becomes simpler.

What I’m learning: Google offers its internal projects two types of data stores, both distributed: low latency (quickly get an result) or high throughput (quickly process a lot of data). The former is for instant queries, and will be given to a nearby datacenter, the latter for batch operations, and will be given to an underutilized datacenter. Projects can pick their needs. That a good example for how Google SRE works with product engineers: giving developers clear options and letting them pick.

By using words like “budget” (as in “error budget” and “per-client retry budget”) Google taps into our intuitive notions of the economy, where supply isn’t infinite. Programmers have to account for the fact that their demand exceeds supply and make decisions accordingly.

They admit where work still needs to be done in SRE culture. “Balancing these constraints to pick a good deadline can be something of an art.” (Chapter 22.)

You could spend a lifetime learning just the pieces in Chapter 23 It’s all CAP theorem and consensus protocols.

Once again, the focus on real-world lessons is such a refreshing choice. I recently read quite a bit on Agile and Scrum, and that world is filled with extremists and armchair consultants.

Advice for debugging systemic problems: think statistically, not procedurally.