Monitoring

with Prometheus, Alert Manager & Grafana

KSS Con'19

Adam Płaczek

Why do we need monitoring ?

Want to know when things go wrong before buisness does
Investingate the system when something does go wrong
Analyze long term trends
Alarm on call person
Compare performance of different software versions
Post mortem analysis
Overal system health dashboards

What needs monitoring ?

INFRASTRUCTURE static	SERVICES dynamic
Servers	Containers
Network	Virtual machines
Storage	PaaS
Datacenter	Microservices

Tools

Oldschool

Newschool

Prometheus

Created at SoundCloud at 2012 by ex-Google engineers. Based on Google BorgMon.

Joined the Cloud Native Computing Foundation in 2016 as the second hosted project, after Kubernetes.

Comes with builtin time series database TSDB

Time series collection happens via a PULL model over HTTP. However PUSH model is still possible with Push Gateway

Targets are discovered via service discovery or static configuration

No HA- single server is independent. It is possible to stack multiple servers into Federation

Time series & metrics

Every time series is uniquely identified by its metric name and a set of key-value pairs, also known as labels.

Monitored systems expose metrics over HTTP /metrics endpoint.

Prometheus stores metrics in its time-series database and makes them available via PromQL query language

/metrics

Getting data into Prometheus

Client Libraries

Go Java Scala Python Ruby Node.js Perl PHP Rust

When Prometheus scrapes your instance's HTTP endpoint, the client library sends the current state of all tracked metrics to the server

Exporters

node_exporter Blackbox AWS cloudwatch statsD SNMP

Useful for cases where target system does not expose HTTP endpoint natively- for example Linux system stats

Push Gateway

Intermediary service which allows you to push metrics from jobs which cannot be scraped

What are the key KPIs ?

The RED Method - microservices

Rate (R): The number of requests per second.
Errors (E): The number of failed requests.
Duration (D): The amount of time to process a request.

The USE Method - hardware

Utilization (U): The percentage of time a resource is in use.
Saturation (S): The amount of work the resource must (the “queue” of work).
Errors (E): A count of errors.

PromQL

Get all time series with the metric http_requests_total

http_requests_total

Get number of requests to URI /api/comments

http_requests_total{ handler="/api/comments"}

Get the per-second rate for all requests, the last 5 minutes:

rate(http_requests_total[5m])

Sum over the rate of all instances, group by requested URI

sum(rate(http_requests_total[5m])) by (uri)

Alert Manager

Handles alarms received from Prometheus.

Alertmanager manages alerts, including silencing, inhibition, aggregation and sending out notifications via email, Slack, Symphony and many more channels. Example alarm rule:

groups:
- name: example
  rules:
 - alert: APIHighRequestLatency
    expr: api_http_request_latencies_second{quantile="0.5"} > 1
    for: 10m
    annotations:
      summary: "High request latency on {{ $labels.instance }}"
      description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

Grafana

Grafana is feature rich platform for metrics visualisation, dashboard creation and alerting.

Retrieves data from Prometheus database

Monitoring stack architecture

Prometheus federation

Federation allows a Prometheus server to scrape selected time series from another Prometheus server.

Demo

Deploy locally Prometheus, Alert Manager and Grafana.
https://github.com/PagerTree/prometheus-grafana-alertmanager-example

Deploy Dragon Ball mapper to GKE Kubernetes

Add Dragon Ball Mapper to Prometheus targets

Create  graph in Grafana

Create EC2 Instance in AWS

Expose system metrics with node exporter

Add instance to Prometheus targets

Send Alarm to Slack Channel