No matter whether you run a web app, search for gravitational waves, or maintain a backup script: reliability of your systems make the difference between sweet dreams and production nightmares at 4am.

So far, I’ve held it at PyCon UA 2017 in Lviv, Ukraine, PyCon US 2017 in Portland, Oregon, EuroPython 2017 in Rimini, Italy, PyCon ZA 2017 in Cape Town, South Africa, PyCon Belarus 2018 in Minsk, Belarus, and DevOps Pro Moscow 2019 in Moscow, Russia.

Slides on Speaker Deck

Solid Snakes or How to Take 5 Weeks of Vacation PyCon 2017

N.B. I’ve kept something resembling a diary while writing this talk in case you’re interested how a talk like this comes to be.

Big picture

Attitude

Simplicity

Code

“Also the subclassing based design was a huge mistake” is probably the most-commonly uttered sentence in programming.

Cory Benfield, Tweet
  • Don’t reach for those meta classes right away. Try simpler constructs first.

    Call some functions with some arguments and look at the results.

    David Reid
  • None and its cousins have been called a “billion dollars mistake”. And indeed being able to avoid handling None in your code is nice. Especially because it’s the default return value for any callable so its expressiveness as a return value is a bit dubious. The Null object Pattern – aka Active Nothing – is one way to avoid it.

  • I believe strongly in Composition over Inheritance. Subclassing saves you typing at the expense of clarity and readability. Subclassing always adds complexity through namespace confusion.

    • No matter how you feel about subclassing, remember that its intended for specialization, not code sharing. The aforementioned Nothing is Something talk gives examples of the problems arising from abusing inheritance for code sharing.
    • I cannot link this amazing talk from PyCon 2013 enough often enough: The End Of Object Inheritance & The Beginning Of A New Modularity. Me and my composition/delegation posse like to refer to it as “The Talk.”
  • In concurrency, complexity is even more likely to sneak in. Glyph takes the time to explain why both threads and green threads are terrible primitives to use in application code in Unyielding.

  • SOLID stands for single responsibility, open-closed, Liskov substitution, interface segregation and dependency inversion. Not all of those principles are relevant to dynamic languages like Python but it’s good to know them all. The goal is to help to produce loosely coupled, maintainable code.

  • The less you promise about your APIs and software, the more likely you’re to deliver on your promises: Deliver Your Software In An Envelope.

Operations

  • The Microservice Premium is a thing.
    • Hence you should go Monolith First:

      [Even] experienced architects working in familiar domains have great difficulty getting boundaries right at the beginning. By building a monolith first, you can figure out what the right boundaries are, before a microservices design brushes a layer of treacle over them.

    • Cookie Cutter Scaling – it’s good enough for Facebook and Etsy.

  • If you have doubts about the complexity to run a Kubernetes cluster, enjoy Kubernetes the Hard Way by Kelsey Hightower. If its features overlap a lot with your problems, it’s certainly worth it. If not: look for simpler alternatives like HashiCorp’s Nomad.
  • 12 Factor App helps you to detach your application from the running environment, saving you application and operational complexity.
  • Adding more components to your infrastructure makes it more complex and fragile. Sometimes it also turns out that good old PostgreSQL can be better and faster than the cool NoSQL fad du jour.
  • The more complex your infra becomes, the more relevant they become: Fallacies of distributed computing.

No such thing as human errors

Processes that depend on humans being perfect are stupid processes that are doomed to fail.

Ceej Silverio, Tweet
  • To learn more about failures, it’s great to read good postmortems like the ones Dan Luu is collecting. You’ll notice that many of the things could happen to you too.

    • Recent fascinating ones are the from GitLab’s database oops…

    • …and S3’s extended downtime. The most important quote is:

      While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.

    This is how you handle such incidents: You fix systems that failed their operators. You don’t blame overwhelmed people.

  • Brittle Failure, Blame-and-Train, and More Productive Reactions to Failure. A discussion of systemic brittleness and “blame-and-train” in light of a post of a junior dev that killed a production database while setting up their first dev environment. Spoilers: it’s not the junior dev’s fault if this happens.

Data validation / time bombs

PSA: If a function accepts a string then it’s a parser. Parsers are hard to get right and dangerous to get wrong. Write fewer of them.

David R. MacIver, Tweet
  • I fundamentally disagree with Postel’s law. I believe it’s its fault why so many things on the Internet are broken beyond repair. Please be conservative in what you receive too.

    To be fair, he had no chance to anticipate what became of it.

  • I’m a huge fan of voluptuous for data validation in Python. It’s an old and stable project that’s still receiving regular releases and allows even for data normalization. I also haven’t found a library with equally amazing programmatic introspection into validation errors. Yes, the name is horrible.

  • hasattr() is terrible on Python 2 and suboptimal on Python 3. Avoid it.

  • Putting Python packages into the root project directory masks packaging errors. Use an un-importable src directory to prevent that. If anyone suggests to omit src, ask them why. So far, in 100% of all cases the reason is “looks like Java.”

    Making your Python software less robust because a directory name reminds you of another programming language doesn’t seem like a rational decision to me. You may be surprised, but not everything is bad about Java.

  • Using application containers like Docker with a fully-featured distribution base image and its packaged Python, will get you into trouble unless you use a virtual environment.

  • Bare except clauses mask memory errors, sys.exit(), Ctrl-C, and more. Avoid doing that and if you need to, catch explicitly BaseException.

  • The burrito I’m eating in the picture on the slide is from Taqueria Castillo Mason and it’s delicious.

Failure is inevitable

Preparations

Monitoring

Timeouts

Redundancy

Documentation

  • Pilots won’t even start the plane without a checklist. Be like a pilot and have a checklist for everything important and every failure scenario. You will not remember everything important if shit hits the fan and being a knowledge silo means phone calls when you’d rather lie on the beach.
    • The Checklist Manifesto is an excellent book that puts numbers on my claims.
    • Please note that a checklist doesn’t just consists of a sequence of “type xyz”" into a terminal. It can be also things like: “Tell Sharon about this incident.” or “Update status page.”
      • Don’t host your status page on the same domain as the services it’s reporting status on. Ideally use a different hoster and DNS provider. Otherwise it goes down together with your hosting/DNS and is quite useless.
    • Obviously don’t write checklists that could be executed by a computer. They are called programs. Write a Python program instead.

Dealing with it

Not making it worse

Crash fast & loudly

  • The faster you crash, the faster your service is back up. If you didn’t anticipate the problem, report it first (in other words: don’t tell Sentry about a database restart).
  • Redis takes the “loud” to a different level. They actually run a quick memory test on crashes so they have better chances to understand what went wrong.
  • MTTR is more important than MTBF (for most types of F)
  • Crash-Only Software is a highly readable paper making the point, that you should focus on fast recovery instead of adding complexity to heal yourself.

Credits