No matter whether you run a web app, search for gravitational waves, or maintain a backup script: reliability of your systems make the difference between sweet dreams and production nightmares at 4am.
So far, I’ve held it at PyCon UA 2017 in Lviv, Ukraine, PyCon US 2017 in Portland, Oregon, EuroPython 2017 in Rimini, Italy, PyCon ZA 2017 in Cape Town, South Africa, PyCon Belarus 2018 in Minsk, Belarus, and DevOps Pro Moscow 2019 in Moscow, Russia.
N.B. I’ve kept something resembling a diary while writing this talk in case you’re interested how a talk like this comes to be.
- There are two seminal books that are related to this topic:
- Release It by Michael Nygard which completely changed the way I think about this whole topic when I read it the first time. Of the two books, this one focuses more not the application development side. There seems to be a new edition coming up and I can’t wait!
- “Site Reliability Engineering – How Google Runs Production Systems” which you can read for free now. Super interesting but large parts aren’t relevant to people working for small- to mid-sized companies. Still a lot to draw inspiration from.
- Reddit has nowadays a pretty good reliability track record. And William Ting shared his experiences. The slides on their own are interesting too.
- SE-Radio Episode 284: John Allspaw on System Failures. Great podcast with lots of insights.
- Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
- The SRE Weekly newsletter is a great source for staying in the loop on all things reliability.
- A Fresh Look on Failure
- Reliability does not come for free, you have to embrace the idea of writing better software. David MacIver (the author of the fantastic Hypothesis package) has some helpful tips for you.
- Testing is an important part of quality software. But it’s more subtle than “always have 100% code coverage.” as Itamar Turner-Trauring explains in Why and how you should test your software.
- How Complex Systems Fail
- Simple Made Easy incredibly good talk about the difference between “simple” and “easy”. By the inventor of Clojure, Rich Hickey.
- No Silver Bullet — Essence and Accident in Software Engineering. Solving essential complexity is what pay the bills. Don’t get lost in accidental problems.
- Is it really “Complex”? Or did we just make it “Complicated”? by Alan Kay no less.
“Also the subclassing based design was a huge mistake” is probably the most-commonly uttered sentence in programming.— Cory Benfield (@Lukasaoz) April 6, 2017
Don’t reach for those meta classes right away. Try simpler constructs first.
Call some functions with some arguments and look at the results.
Noneand its cousins have been called a “billion dollars mistake.” And indeed being able to avoid handling
Nonein your code is nice. Especially because it’s the default return value for any callable so its expressiveness as a return value is a bit dubious. The Null object Pattern – aka Active Nothing – is one way to avoid it.
- As Sandi Metz puts it: Nothing is Something
I believe strongly in Composition over Inheritance. Subclassing saves you typing at the expense of clarity and readability. Subclassing always adds complexity through namespace confusion.
- No matter how you feel about subclassing, remember that its intended for specialization, not code sharing. The aforementioned Nothing is Something talk gives examples of the problems arising from abusing inheritance for code sharing.
- I cannot link this amazing talk from PyCon 2013 enough often enough: The End Of Object Inheritance & The Beginning Of A New Modularity. Me and my composition/delegation posse like to refer to it as “The Talk.”
In concurrency, complexity is even more likely to sneak in. Glyph takes the time to explain why both threads and green threads are terrible primitives to use in application code in Unyielding.
SOLID stands for single responsibility, open-closed, Liskov substitution, interface segregation and dependency inversion. Not all of those principles are relevant to dynamic languages like Python but it’s good to know them all. The goal is to help to produce loosely coupled, maintainable code.
- Sandi Metz has a good intro using Ruby which should be understandable to everyone using dynamic languages: SOLID Object-Oriented Design. Yes, Sandi Metz again. She’s my favorite superhero.
- I’m especially fond of Single responsibility principle (SRP).
The less you promise about your APIs and software, the more likely you’re to deliver on your promises: Deliver Your Software In An Envelope.
- The Microservice Premium is a thing.
Hence you should go Monolith First:
[Even] experienced architects working in familiar domains have great difficulty getting boundaries right at the beginning. By building a monolith first, you can figure out what the right boundaries are, before a microservices design brushes a layer of treacle over them.
Cookie Cutter Scaling – it’s good enough for Facebook and Etsy.
- If you have doubts about the complexity to run a Kubernetes cluster, enjoy Kubernetes the Hard Way by Kelsey Hightower. If its features overlap a lot with your problems, it’s certainly worth it. If not: look for simpler alternatives like HashiCorp’s Nomad.
- 12 Factor App helps you to detach your application from the running environment, saving you application and operational complexity.
- Adding more components to your infrastructure makes it more complex and fragile. Sometimes it also turns out that good old PostgreSQL can be better and faster than the cool NoSQL fad du jour.
- The more complex your infra becomes, the more relevant they become: Fallacies of distributed computing.
No Such Thing as Human Errors
Processes that depend on humans being perfect are stupid processes that are doomed to fail.— Ceej has shot 1 of 2. (@ceejbot) July 18, 2017
To learn more about failures, it’s great to read good postmortems like the ones Dan Luu is collecting. You’ll notice that many of the things could happen to you too.
Recent fascinating ones are the from GitLab’s database oops…
…and S3’s extended downtime. The most important quote is:
While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.
This is how you handle such incidents: You fix systems that failed their operators. You don’t blame overwhelmed people.
Brittle Failure, Blame-and-Train, and More Productive Reactions to Failure. A discussion of systemic brittleness and “blame-and-train” in light of a post of a junior dev that killed a production database while setting up their first dev environment. Spoilers: it’s not the junior dev’s fault if this happens.
Data Validation / Time Bombs
PSA: If a function accepts a string then it's a parser. Parsers are hard to get right and dangerous to get wrong. Write fewer of them.— David R. MacIver (@DRMacIver) May 11, 2017
I fundamentally disagree with Postel’s law. I believe it’s its fault why so many things on the Internet are broken beyond repair. Please be conservative in what you receive too.
To be fair, he had no chance to anticipate what became of it.
I’m a huge fan of voluptuous for data validation in Python. It’s an old and stable project that’s still receiving regular releases and allows even for data normalization. I also haven’t found a library with equally amazing programmatic introspection into validation errors. Yes, the name is horrible.
hasattr()is terrible on Python 2 and suboptimal on Python 3. Avoid it.
Putting Python packages into the root project directory masks packaging errors. Use an un-importable
srcdirectory to prevent that. If anyone suggests to omit
src, ask them why. So far, in 100% of all cases the reason is “looks like Java.”
Making your Python software less robust because a directory name reminds you of another programming language doesn’t seem like a rational decision to me. You may be surprised, but not everything is bad about Java.
Using application containers like Docker with a fully-featured distribution base image and its packaged Python, will get you into trouble unless you use a virtual environment.
exceptclauses mask memory errors,
Ctrl-C, and more. Avoid doing that and if you need to, catch explicitly
The burrito I’m eating in the picture on the slide is from Taqueria Castillo Mason and it’s delicious.
Failure Is Inevitable
Apollo 11 almost failed it’s lunar descent due to an error in a checklist. Margaret Hamilton’s software was smart enough to handle the resulting problems. In other words: it expected failures and dealt with it.
She also kind of invented software engineering while writing it. My other favorite superhero.
Designs, Lessons and Advice from Building Large Distributed Systems – slide 10 (“The Joys of Real Hardware”) is pure gold.
Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications – TL;DR: network partitions are rare but real.
- I spoke about monitoring and instrumentation in the past two years and I’d love to invite you to check those talks out:
- Take That Vacation: Eliminate Alerts Dragging You Back to the Office.
- HAProxy is a great way to improve the reliability and uptime of your applications. Tools like this also allow for zero-downtime deployments. HAProxy is good software.
- You need multiple backups, you need offsite backups, and you need to test your backups regularly. Otherwise you don’t have backups. Ask GitLab.
- While important, redundancy isn’t a silver bullet either: Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions.
- Pilots won’t even start the plane without a checklist. Be like a pilot and have a checklist for everything important and every failure scenario. You will not remember everything important if shit hits the fan and being a knowledge silo means phone calls when you’d rather lie on the beach.
- The Checklist Manifesto is an excellent book that puts numbers on my claims.
- Please note that a checklist doesn’t just consists of a sequence of “type
xyz” into a terminal. It can be also things like: “Tell Sharon about this incident.” or “Update status page.”
- Don’t host your status page on the same domain as the services it’s reporting status on. Ideally use a different hoster and DNS provider. Otherwise it goes down together with your hosting/DNS and is quite useless.
- Obviously don’t write checklists that could be executed by a computer. They are called programs. Write a Python program instead.
Dealing With It
Not Making Problems Worse
- AWS Architecture Blog on Exponential Backoff And Jitter. Always use a backoff unless you want to DoS yourself/get on blacklists.
backoffPython package gives you an easy-to-use decorator for retrying using backoff.
- Queues are often considered the silver bullet for overload and other ailments. But alas, Queues Don’t Fix Overload.
- Unbounded queues are fundamentally broken. The command you’re looking for is
- Unbounded queues are fundamentally broken. The command you’re looking for is
- Circuit Breakers help you to take load off erring systems.
- Scaling your API with rate limiters talks about how to make your web APIs more robust in the face of abuse, traffic spikes, and user errors.
- It’s usually easier to get started with facilities provided by e.g. nginx than implementing it inside of your applications.
Crash Fast & Loudly
- The faster you crash, the faster your service is back up. If you didn’t anticipate the problem, report it first (in other words: don’t tell Sentry about a database restart).
- Redis takes the “loud” to a different level. They actually run a quick memory test on crashes so they have better chances to understand what went wrong.
- MTTR is more important than MTBF (for most types of F)
- Crash-Only Software is a highly readable paper making the point, that you should focus on fast recovery instead of adding complexity to heal yourself.
- Microreboot – A Technique for Cheap Recovery (by the same author) is one of the techniques described however I haven’t found a practical way to do that in Python.
- Every photo not mentioned below is private with all rights reserved by Hynek Schlawack.
- “Lifting a Dreamer” aka “Fail Whale” art by @YiyingLu. Used with explicit permission by the artist.
- Land’s End
- Touched by His Noodly Appendage
- Circuit Breakers
- Apollo 11
- Margaret Hamilton
- Complicated mechanism at work
- The Abyss of Hell
- Microservices graph is by Andrew Godwin, used with his kind permission.
- Some licensed icons from Symbolicons
- Picture of burned out server rack is not retraceable anymore and has been used all over the internet.