Get Instrumented: How Prometheus Can Unify Your Metrics

31 May 2016

To get real time insight into running applications you need to instrument them and collect metrics: count events, measure times, expose numbers. That used to be a clusterfuck of technologies and approaches. Prometheus changes that.

So far, I’ve held it at PiterPy 2016 and WGDF 2016 in Saint Petersburg, PyCon US 2016 in Portland, EuroPython 2016 in Bilbao, PyCon ZA 2016 in Cape Town, and DevOpsPro Moscow 2016.

Slides on Speaker Deck

Metrics In General

Two timeless talks about the importance of metrics:
- Coda Hale: Metrics, Metrics Everywhere
- Jeff Hodges: Distributed Systems in Production
Logging v. instrumentation.
Why Averages Suck and Percentiles are Great.
Statistics for Software. Fortunately you can ignore the implementation parts as Prometheus does that for you. :)
Is Everything Worth Maximizing? A rather philosophical disquisition on choosing and using business metrics.
Push vs Pull for Monitoring.
Site Reliability Engineering – How Google Runs Production Systems talks quite a bit about metrics too.

Prometheus

Another record with the server improvements of the upcoming 0.18 release: 800k samples/s ingestion rate with 1.7M series and 2100 targets.
— Official Prometheus Twitter Account, Tweet

Homepage and documentation.
Thoughts on push vs pull based metrics/monitoring:
- From the FAQ: Why do you pull rather than push?.
- Pull doesn’t scale - or does it?
It might be tempting to just use the push gateway and leave everything push based. However the push gateway is intended for a very specific use case and it’s important to know When to Use the Pushgateway. If the success of instrumentation depends on using the Prometheus for use cases outside this realm, Prometheus may not be the best choice for you. However those cases are rather rare. So you may want to double check preconceived notions.
Prometheus offers two (modern) different strategies on how to store sample data: “double delta” (default) and “varbit” (new in 0.18, referred to in the tweet above). They allow to trade disk space and I/O load for query runtime: When (not) to use varbit chunks.
- If you’re interested how the TSDB behind Prometheus works, this talk by its author from PromCon 2016 is fascinating.
Prometheus proper is intended to be scaled using federation. However third parties have started horizontal solutions:
- DigitalOcean’s Vulcan that builds on Kafka, Elasticsearch, and Cassandra.
- Weaveworks’ Prism that relies on AWS services.

PromQL

When coming from a different monitoring system, it’s interesting to see how their languages translate to each other.
How to Query Prometheus (on Ubuntu 14.04). A gentle, hands-on two-part series on PromQL from one of the Prometheus authors for DigitalOcean.
irate() vs rate(). TL;DR: irate() is better for graphs, rate() is better for monitoring.

alertmanager

Prometheus: A Next Generation Monitoring System (DCU Techweek 2016). A general introduction to Prometheus with a slant towards monitoring.
Prometheus as an Engine for MySQL Monitoring
My Philosophy on Alerting by a former Google SRE.

Instrumenting Your Environment

node_exporter for instrumenting from inside (metal, KVM, LXC, jails, …)
cAdvisor for instrumenting from outside (mainly Docker; but also LXC).
mtail for extracting metrics from log files based on regular expressions.
- The apache_metrics example can extract better metrics than the Apache status-based exporter.
- If you like grok, you may also be interested in the grok_exporter that will allow you to reuse your patterns.
Prometheus can be introduced step by step by using one of the bridging exporters. Here’s an example for Graphite.
More official and unofficial exporters can be found here.
Sometimes you want to know the load on your database servers. That’s when machine roles come handy.

Adding Prometheus to Your App

Don’t shy from adding instrumentation code. It takes a while to get used to it by it should be an integral part of your software. Not an after-thought.

Official Python client.
- Async (Twisted & asyncio) extensions for the official client by yours truly.
- Recently the Python client added support for multi-process applications (like gunicorn or uWSGI web apps). Another approach is to expose your metrics per process. In uWSGI you can use uwsgi.worker_id() as a label; gunicorn sadly doesn’t support something like that but I was able to implement it using its callbacks.
Other client libraries.
A start to end instrumentation of Go and Python applications including DNS service discovery.

Credits

Icons: Symbolicons
A380 cockpit: Wikipedia
Container ship: Wikipedia
Wikipedia server rack: Wikipedia

Would you like me to give a talk at your conference or company? Get in touch!

Hynek Schlawack

Hynek Schlawack

Code Bohemian in ❤️ with Python 🐍, Go 🐹, and DevOps 🔧. Blogger 📝, speaker 📢, YouTuber 📺, PSF fellow 🏆, substance over flash 🧠.

Is my content helpful and/or enjoyable to you? Please consider supporting me! Every bit helps to motivate me in creating more.