Solid Snakes or: How to Take 5 Weeks of Vacation

No matter whether you run a web app, search for gravitational waves, or maintain a backup script: reliability of your systems make the difference between sweet dreams and production nightmares at 4am.

So far, I’ve held it at PyCon UA 2017 in Lviv, Ukraine, PyCon US 2017 in Portland, Oregon, EuroPython 2017 in Rimini, Italy, PyCon ZA 2017 in Cape Town, South Africa, PyCon Belarus 2018 in Minsk, Belarus, and DevOps Pro Moscow 2019 in Moscow, Russia.

Slides on Speaker Deck

N.B. I’ve kept something resembling a diary while writing this talk in case you’re interested how a talk like this comes to be.

Big picture

There are two seminal books that are related to this topic:
1. Release It by Michael Nygard which completely changed the way I think about this whole topic when I read it the first time. Of the two books, this one focuses more not the application development side. There seems to be a new edition coming up and I can’t wait!
2. Site Reliability Engineering – How Google Runs Production Systems which you can read for free now. Super interesting but large parts aren’t relevant to people working for small- to mid-sized companies. Still a lot to draw inspiration from.
Reddit has nowadays a pretty good reliability track record. And William Ting shared his experiences. The slides on their own are interesting too.
SE-Radio Episode 284: John Allspaw on System Failures. Great podcast with lots of insights.
Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
The SRE Weekly newsletter is a great source for staying in the loop on all things reliability.
A Fresh Look on Failure

Attitude

Reliability does not come for free, you have to embrace the idea of writing better software. David MacIver (the author of the fantastic Hypothesis package) has some helpful tips for you.
Testing is an important part of quality software. But it’s more subtle than “always have 100% code coverage” as Itamar Turner-Trauring explains in Why and how you should test your software.

Simplicity

How Complex Systems Fail
Simple Made Easy incredibly good talk about the difference between “simple” and “easy”. By the inventor of Clojure, Rich Hickey.
No Silver Bullet — Essence and Accident in Software Engineering. Solving essential complexity is what pay the bills. Don’t get lost in accidental problems.
Is it really “Complex”? Or did we just make it “Complicated”? by Alan Kay, no less.

Code

“Also the subclassing based design was a huge mistake” is probably the most-commonly uttered sentence in programming.
— Cory Benfield, Tweet

Don’t reach for those meta classes right away. Try simpler constructs first.
Call some functions with some arguments and look at the results.
— David Reid
None and its cousins have been called a “billion dollars mistake”. And indeed being able to avoid handling None in your code is nice. Especially because it’s the default return value for any callable so its expressiveness as a return value is a bit dubious. The Null object Pattern – aka Active Nothing – is one way to avoid it.
- As Sandi Metz puts it: Nothing is Something
I believe strongly in Composition over Inheritance. Subclassing saves you typing at the expense of clarity and readability. Subclassing always adds complexity through namespace confusion.
- No matter how you feel about subclassing, remember that its intended for specialization, not code sharing. The aforementioned Nothing is Something talk gives examples of the problems arising from abusing inheritance for code sharing.
- I cannot link this amazing talk from PyCon 2013 enough often enough: The End Of Object Inheritance & The Beginning Of A New Modularity. Me and my composition/delegation posse like to refer to it as “The Talk.”
In concurrency, complexity is even more likely to sneak in. Glyph takes the time to explain why both threads and green threads are terrible primitives to use in application code in Unyielding.
SOLID stands for single responsibility, open-closed, Liskov substitution, interface segregation and dependency inversion. Not all of those principles are relevant to dynamic languages like Python but it’s good to know them all. The goal is to help to produce loosely coupled, maintainable code.
- Sandi Metz has a good intro using Ruby which should be understandable to everyone using dynamic languages: SOLID Object-Oriented Design. Yes, Sandi Metz again. She’s my favorite superhero.
- I’m especially fond of Single responsibility principle (SRP).
  - Which is why I wrote attrs. Because writing small, loosely coupled classes in Python ought to be easy.
  - Some call it The One Python Library Everyone Needs. And I agree.
The less you promise about your APIs and software, the more likely you’re to deliver on your promises: Deliver Your Software In An Envelope.

Operations

The Microservice Premium is a thing.
- Hence you should go Monolith First:
  [Even] experienced architects working in familiar domains have great difficulty getting boundaries right at the beginning. By building a monolith first, you can figure out what the right boundaries are, before a microservices design brushes a layer of treacle over them.
- Cookie Cutter Scaling – it’s good enough for Facebook and Etsy.
If you have doubts about the complexity to run a Kubernetes cluster, enjoy Kubernetes the Hard Way by Kelsey Hightower. If its features overlap a lot with your problems, it’s certainly worth it. If not: look for simpler alternatives like HashiCorp’s Nomad.
12 Factor App helps you to detach your application from the running environment, saving you application and operational complexity.
- 12 Fractured Apps overlaps with later topics in this talk and talks you through what plenty people do wrong when trying to adopt 12FA.
- There is something to be said about storing secrets in env variables, though..
- environ-config gets it right for you!
Adding more components to your infrastructure makes it more complex and fragile. Sometimes it also turns out that good old PostgreSQL can be better and faster than the cool NoSQL fad du jour.
The more complex your infra becomes, the more relevant they become: Fallacies of distributed computing.

No such thing as human errors

Processes that depend on humans being perfect are stupid processes that are doomed to fail.
— Ceej Silverio, Tweet

To learn more about failures, it’s great to read good postmortems like the ones Dan Luu is collecting. You’ll notice that many of the things could happen to you too.
- Recent fascinating ones are the from GitLab’s database oops…
- …and S3’s extended downtime. The most important quote is:
  While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.
This is how you handle such incidents: You fix systems that failed their operators. You don’t blame overwhelmed people.
Brittle Failure, Blame-and-Train, and More Productive Reactions to Failure. A discussion of systemic brittleness and “blame-and-train” in light of a post of a junior dev that killed a production database while setting up their first dev environment. Spoilers: it’s not the junior dev’s fault if this happens.

Data validation / time bombs

PSA: If a function accepts a string then it’s a parser. Parsers are hard to get right and dangerous to get wrong. Write fewer of them.
— David R. MacIver, Tweet

I fundamentally disagree with Postel’s law. I believe it’s its fault why so many things on the Internet are broken beyond repair. Please be conservative in what you receive too.
To be fair, he had no chance to anticipate what became of it.
- I’m not the only one of this opinion.
- I love the RFC1122Error in Python’s HTTP/2 implementation.
I’m a huge fan of voluptuous for data validation in Python. It’s an old and stable project that’s still receiving regular releases and allows even for data normalization. I also haven’t found a library with equally amazing programmatic introspection into validation errors. Yes, the name is horrible.
hasattr() is terrible on Python 2 and suboptimal on Python 3. Avoid it.
Putting Python packages into the root project directory masks packaging errors. Use an un-importable src directory to prevent that. If anyone suggests to omit src, ask them why. So far, in 100% of all cases the reason is “looks like Java.”
Making your Python software less robust because a directory name reminds you of another programming language doesn’t seem like a rational decision to me. You may be surprised, but not everything is bad about Java.
Using application containers like Docker with a fully-featured distribution base image and its packaged Python, will get you into trouble unless you use a virtual environment.
Bare except clauses mask memory errors, sys.exit(), Ctrl-C, and more. Avoid doing that and if you need to, catch explicitly BaseException.
The burrito I’m eating in the picture on the slide is from Taqueria Castillo Mason and it’s delicious.

Failure is inevitable

Apollo 11 almost failed it’s lunar descent due to an error in a checklist. Margaret Hamilton’s software was smart enough to handle the resulting problems. In other words: it expected failures and dealt with it.
She also kind of invented software engineering while writing it. My other favorite superhero.
The best way we’ve found to prevent disaster is to actively engage with it.
Designs, Lessons and Advice from Building Large Distributed Systems – slide 10 (“The Joys of Real Hardware”) is pure gold.
Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications – TL;DR: network partitions are rare but real.
Software Systems Will Fail.

Preparations

Monitoring

I spoke about monitoring and instrumentation in the past two years and I’d love to invite you to check those talks out:
- Beyond grep talks about error reporting, centralized logging and metrics in general.
- Get Instrumented is about instrumenting your systems and software using Prometheus including alerting.
Take That Vacation: Eliminate Alerts Dragging You Back to the Office.
- Actionable Alerts

Timeouts

Infinity Is a Bad Timeout

Redundancy

HAProxy is a great way to improve the reliability and uptime of your applications. Tools like this also allow for zero-downtime deployments. HAProxy is good software.
You need multiple backups, you need offsite backups, and you need to test your backups regularly. Otherwise you don’t have backups. Ask GitLab.
While important, redundancy isn’t a silver bullet either: Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions.

Documentation

Pilots won’t even start the plane without a checklist. Be like a pilot and have a checklist for everything important and every failure scenario. You will not remember everything important if shit hits the fan and being a knowledge silo means phone calls when you’d rather lie on the beach.
- The Checklist Manifesto is an excellent book that puts numbers on my claims.
- Please note that a checklist doesn’t just consists of a sequence of “type xyz”" into a terminal. It can be also things like: “Tell Sharon about this incident.” or “Update status page.”
  - Don’t host your status page on the same domain as the services it’s reporting status on. Ideally use a different hoster and DNS provider. Otherwise it goes down together with your hosting/DNS and is quite useless.
- Obviously don’t write checklists that could be executed by a computer. They are called programs. Write a Python program instead.

Dealing with it

Not making it worse

AWS Architecture Blog on Exponential Backoff And Jitter. Always use a backoff unless you want to DoS yourself/get on blacklists.
- Try my stamina package for easy-to-use exponential backoff and jitter in Python.
Queues are often considered the silver bullet for overload and other ailments. But alas, Queues Don’t Fix Overload.
- Unbounded queues are fundamentally broken. The command you’re looking for is celery purge.
Circuit Breakers help you to take load off erring systems.
- Protect your users with Circuit Breakers is a Python-specific talk on that topic.
- pybreaker is a Python implementation of the idea.
- Make sure you configure them correctly.
Scaling your API with rate limiters talks about how to make your web APIs more robust in the face of abuse, traffic spikes, and user errors.
- It’s usually easier to get started with facilities provided by e.g. nginx than implementing it inside of your applications.

Crash fast & loudly

The faster you crash, the faster your service is back up. If you didn’t anticipate the problem, report it first (in other words: don’t tell Sentry about a database restart).
Redis takes the “loud” to a different level. They actually run a quick memory test on crashes so they have better chances to understand what went wrong.
MTTR is more important than MTBF (for most types of F)
Crash-Only Software is a highly readable paper making the point, that you should focus on fast recovery instead of adding complexity to heal yourself.
- Microreboot – A Technique for Cheap Recovery (by the same author) is one of the techniques described however I haven’t found a practical way to do that in Python.

Credits

“Lifting a Dreamer” aka “Fail Whale” art by @YiyingLu. Used with explicit permission by the artist.
Sharp
Land’s End
Ravioli
Touched by His Noodly Appendage
Circuit Breakers
Juggler
Apollo 11
Margaret Hamilton
Complicated mechanism at work
The Abyss of Hell
Microservices graph is by Andrew Godwin, used with his kind permission.
Some licensed icons from Symbolicons
Picture of burned out server rack is not retraceable anymore and has been used all over the internet.