To get real time insight into running applications you need to instrument them and collect metrics: count events, measure times, expose numbers. That used to be a clusterfuck of technologies and approaches. Prometheus changes that.
So far, I’ve held it at PiterPy 2016 and WGDF 2016 in Saint Petersburg, PyCon US 2016 in Portland, EuroPython 2016 in Bilbao, PyCon ZA 2016 in Cape Town, and DevOpsPro Moscow 2016.
Metrics In General
- Two timeless talks about the importance of metrics:
- Coda Hale: Metrics, Metrics Everywhere
- Jeff Hodges: Distributed Systems in Production
- Logging v. instrumentation.
- Why Averages Suck and Percentiles are Great.
- Statistics for Software. Fortunately you can ignore the implementation parts as Prometheus does that for you. :)
- Is Everything Worth Maximizing? A rather philosophical disquisition on choosing and using business metrics.
- Push vs Pull for Monitoring.
- Site Reliability Engineering – How Google Runs Production Systems talks quite a bit about metrics too.
Prometheus
Another record with the server improvements of the upcoming 0.18 release: 800k samples/s ingestion rate with 1.7M series and 2100 targets.
— Official Prometheus Twitter Account, Tweet
- Homepage and documentation.
- Thoughts on push vs pull based metrics/monitoring:
- It might be tempting to just use the push gateway and leave everything push based. However the push gateway is intended for a very specific use case and it’s important to know When to Use the Pushgateway. If the success of instrumentation depends on using the Prometheus for use cases outside this realm, Prometheus may not be the best choice for you. However those cases are rather rare. So you may want to double check preconceived notions.
- Prometheus offers two (modern) different strategies on how to store sample data: “double delta” (default) and “varbit” (new in 0.18, referred to in the tweet above). They allow to trade disk space and I/O load for query runtime: When (not) to use varbit chunks.
- If you’re interested how the TSDB behind Prometheus works, this talk by its author from PromCon 2016 is fascinating.
- Prometheus proper is intended to be scaled using federation. However third parties have started horizontal solutions:
PromQL
- When coming from a different monitoring system, it’s interesting to see how their languages translate to each other.
- How to Query Prometheus (on Ubuntu 14.04). A gentle, hands-on two-part series on PromQL from one of the Prometheus authors for DigitalOcean.
irate()
vsrate()
. TL;DR:irate()
is better for graphs,rate()
is better for monitoring.
alertmanager
- Prometheus: A Next Generation Monitoring System (DCU Techweek 2016). A general introduction to Prometheus with a slant towards monitoring.
- Prometheus as an Engine for MySQL Monitoring
- My Philosophy on Alerting by a former Google SRE.
Instrumenting Your Environment
- node_exporter for instrumenting from inside (metal, KVM, LXC, jails, …)
- cAdvisor for instrumenting from outside (mainly Docker; but also LXC).
- mtail for extracting metrics from log files based on regular expressions.
- The apache_metrics example can extract better metrics than the Apache status-based exporter.
- If you like grok, you may also be interested in the grok_exporter that will allow you to reuse your patterns.
- Prometheus can be introduced step by step by using one of the bridging exporters. Here’s an example for Graphite.
- More official and unofficial exporters can be found here.
- Sometimes you want to know the load on your database servers. That’s when machine roles come handy.
Adding Prometheus to Your App
Don’t shy from adding instrumentation code. It takes a while to get used to it by it should be an integral part of your software. Not an after-thought.
- Official Python client.
- Async (Twisted & asyncio) extensions for the official client by yours truly.
- Recently the Python client added support for multi-process applications (like gunicorn or uWSGI web apps). Another approach is to expose your metrics per process. In uWSGI you can use
uwsgi.worker_id()
as a label; gunicorn sadly doesn’t support something like that but I was able to implement it using its callbacks.
- Other client libraries.
- A start to end instrumentation of Go and Python applications including DNS service discovery.
Credits
- Icons: Symbolicons
- A380 cockpit: Wikipedia
- Container ship: Wikipedia
- Wikipedia server rack: Wikipedia