Prometheus Certified Associate (PCA)

Observability

the ability to understand and measure the state of a system based on data generated by the system
allows to generate actionable outputs from unexpected scenarions
to better understad the internal of your system
greater need for observability in distributed systems & microservices
troubleshooting - e.g. why are error rates high?
3 pillars are logging, metrics, traces:

a. Logs - records of event that have occurred and encapsulate info about the specific event b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace

Prometheus only handles metrics, not logs or traces!

SLO/SLA/SLI

a. SLI (service level indicators) = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)

not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

b. SLO (service level objectives) = target value or range for an SLI

examples: SLI - Latency SLO - Latency < 100ms SLI - Availability SLO - 99.99% uptime
should be directly related to the customer experience
purpose is to quantify reliability of a product to a customer
may be tempted to set to aggressive values
goal is not to achieve perfection, but make customers happy

c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO

Prometheus fundamentals

use cases:
- collect metrics from different locations like West DC, central DC, East DC, AWS etc.
- high memory on the hosting MySQL db and notify operations team via email
- find out which uploaded video length the application starts to degrade
open source monitoring tool that collects metrics data and provide tools to visualize the data
allows to generate alerts when treshold reached
collects data by scraping targets who expose metrics through HTTP endpoint
stored in time series db and can be queried with built-in PromQL
what can it monitor:
- CPU/memory
- disk space
- service uptime
- app specific data - number of exceptions, latency, pending requests
- networking devices, databases etc.
exclusively monitor numeric time-series data
does not monitor events, system logs, traces
originally sponsored by SoundCloud
written in Go

Prometheus Architecture

3 core components:
- retrieval (scrapes metric data)
- TSDB (stores metric data)
- HTTP server (accepts PromQL query)
lot of other components making up the whole solution:
- exporters (mini-processes running on the targets), retrieval component pulls the metrics from
- pushgateway (short lived job sends the data to it and retrieved from there)
- service discovery is all about providing list of targets so you don't have to hardocre those values
- alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
- Prometheus Web UI or Grafana etc.
collects by sending HTTP request to /metrics endpoint of each target, path can be changed
several native exporters:
- node exporters (Linux)
- Windows
- MySQL
- Apache
- HAProxy
- client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
Pull based is pros:
- easier to tell if the target is down
- does not DDoS the metrics server
- definitive list of targets to monitor (central source of truth)

Prometheus Installation

Download tar from http://prometheus.io/download
untarred folder contains console_libraries, consoles, prometheus (binary), prometheus.yml (config) and promtool (CLI utility)
Run ./prometheus
Open http://localhost:9090
Execute / query up in the console to see the one target (itself) - should work OK so we can turn it into a systemd service
Create a user sudo useradd --no-create-home --shell /bin/false prometheus
Create a config folder sudo mkdir /etc/prometheus
Create /var/lib/prometheus for the data
Move executables sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin
Move config file sudo cp prometheus.yaml /etc/prometheus/
Copy the consoles folder sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/
Change owner for these folders & executables sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
The command (ExecStart) will then look like this: sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries
Create a service file with this information /etc/systemd/system/prometheus.service and reload sudo systemctl daemon-reload
Start the daemon sudo systemctl start prometheus ; sudo systemctl enable prometheus

Node exporter

Download tar from http://prometheus.io/download
untarred folder contains basically just the binary node_exporter
Run the ./node_exporter and then curl localhost:9100/metrics
Run in the background & start on boot using the systemd
sudo cp node_exporter /usr/local/bin
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo vi /etc/systemd/system/node_exporter.service
sudo systemctl daemon-reload
sudo systemctl start node_exporter ; sudo systemctl enable node_exporter

Prometheus configuration

Sections: a. global - default parameters, it can be overriden by the same variables in sub-sections b. scrape_configs - define targets and job_name, which is a collection of instances that need to be scraped c. alerting d. rule_files e. remote_read & remote_write f. storage
Some examples:

scrape_configs:
  - job_name: 'nodes'               # call it whatever
    scrape_interval: 30s            # from the target every X seconds
    scrape_timeout: 3s              # timeouts after X seconds
    scheme: https                   # http or https
    metrics_path: /stats/metrics    # non-default path that you send requests to
    static_configs:
      - targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
    # basic_auth                    # this is the next section

Reload the config sudo systemctl restart prometheus

Encryption & Authentication

between Prometheus and targets

On the targets, you need to generate the keys:

sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost" - this will generate key & crt pair
config will have to be customized:

tls_server_config:
  cert_file: node_exporter.crt
  key_file: node_exporter.key

./node_exporter --web.config=config.yml
curl -k https://localhost:9100/metrics

On the server:

copy the node_exporter.crt from the target to the Prometheus server
update the scheme to https in the prometheus.yml and add tls_config with ca_file (e.g. /etc/prometheus/node_exporter.crt that we copied in the previous step) and insecure_skip_verify if self-signed
restart prometheus service

scrape_configs:
  - job_name: "node"
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/node_exporter.crt
      insecure_skip_verify: true

Authentication is done via generated hash (sudo apt install apache2-utils or httpd-tools etc) and htpasswd -nBC 12 "" | tr -d ':\n' (will prompt for password and spits out the hash):

add the basic_auth_users and username + hash underneath it:

# /etc/node_exporter/config.yml
basic_auth_users:
  prometheus: $2y$12$daXru320983rnofkwehj4039F

restart node_exporter service
update Prometheus server's config with the same auth:

- job_name: "node"
  basic_auth:
    username: prometheus
    password: <PLAIN TEXT PASSWORD!>

Metrics

3 properties:
- name - general feature of a system to be measured, may contain ASCII, numbers, underscores ([a-zA-Z_:][a-zA-Z0-9_:]*), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g. __name__=node_cpu_seconds_total)
- {labels (key/value pairs)} - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores ([a-zA-Z0-9_]*). Labels surrounded by __ are considered internal to Prometheus. Every matric is assigned 2 labels by default (instance and job).
- metric value
Example = node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86: labels provude us information on which cpu this metric is for
when Prometheus scrapes a target and retrieves metrics, it als ostores the time at which the metric was scraped
Example = 1668215300 (unix epoch timestamp, since Jan 1st 1970 UTC)
time series = stream of timestamped values sharing the same metric and set of labels
metric have a TYPE (counter, gauge, histogram, summary) and HELP (description of the metric is) attribute:
- counter can only go up, how many times did X happen
- gauge can go up or down, what is the current value of X
- histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s)
- summary is similar to histogram and tells us how many observation fell below X, do no thave to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s)

Q:How many total unique time series are there in this output?

node_arp_entries{instance="node1" job="node"} 200
node_arp_entries{instance="node2" job="node"} 150

node_cpu_seconds_total{cpu="0", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="0", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}

node_memory_Active_bytes{instance="node1" job="node"} 419124
node_memory_Active_bytes{instance="node2" job="node"} 55589

A: 9

Q: What metric should be used to report the current memory utilization? A: gauge

Q: What metric should be used to report the amount of time a process has been running? A: counter

Q: Which of these is not a valid metric? A: 404_error_count

Q: How many labels does the following time series have? http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234 A: 5

Q: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this? A: histogram

Q: What are the two labels every metric is assigned by default? A: instance, job

Q: What are the 4 types of prometheus metrics? A: counter, gauge, histogram, summary

Q: What are the two attributes provided by a metric? A: Help, Type

Q: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782 ; What is the metric name? A: http_request_total

Expression browser

Web UI for Prometheus server to query data
up - returns which targets are in up state (you can see an instance and job and value on the right - 0 and 1)

Prometheus on Docker

Pull image prom/prometheus
Configure prometheus.yml
Expose ports, bind mounts
docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus

PromTools

check & validate configuration before applying (e.g to production)
prevent downtime while config issues are being identified
validate metrics passed to it are correctly formatted
can perform queris on a Prom server
debugging & profiling a Prom server
perform unit tests agains Recording/Alerting rules
promtool check config /etc/prometheus/prometheus.yml

Container metrics

metrics can be scraped from containerized envs
docker engine metrics (how much CPU does Docker use etc. no metrics specific to a container!):
- vi /etc/docker/daemon.json:
```
{
  "metrics-addr": "127.0.0.1:9323",
  "experimental": true
}
```
- sudo systemctl restart docker
- curl localhost:9323/metrics
- prometheus job update:
```
scrape_configs:
  - job_namce: "docker"
    static_configs:
      - targets: ["12.1.13.4:9323"]
```
cAdvisor (how much memory does each container use? container uptime? etc.):
- vi docker-compose.yml to pull gcr.io/cadvisor/cadvisor
- docker-compose up or docker compose up
- curl localhost:8080/metrics

Last update: Mon Jan 16 04:27:57 UTC 2023

vdt/Mock Exam 1.md

Prometheus Certified Associate (PCA)