Prometheus Certified Associate (PCA)

Observability

the ability to understand and measure the state of a system based on data generated by the system
allows to generate actionable outputs from unexpected scenarions
to better understad the internal of your system
greater need for observability in distributed systems & microservices
troubleshooting - e.g. why are error rates high?
3 pillars are logging, metrics, traces:

a. Logs - records of event that have occurred and encapsulate info about the specific event b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace

Prometheus only handles metrics, not logs or traces!

SLO/SLA/SLI

a. SLI = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)

not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

b. SLO = target value or range for an SLI

examples: SLI - Latency SLO - Latency < 100ms SLI - Availability SLO - 99.99% uptime
should be directly related to the customer experience
purpose is to quantify reliability of a product to a customer
may be tempted to set to aggressive values
goal is not to achieve perfection, but make customers happy

c. SLA = contract between a vendor and a user that guarantees SLO

Prometheus fundamentals

use cases:
- collect metrics from different locations like West DC, central DC, East DC, AWS etc.
- high memory on the hosting MySQL db and notify operations team via email
- find out which uploaded video length the application starts to degrade
open source monitoring tool that collects metrics data and provide tools to visualize the data
allows to generate alerts when treshold reached
collects data by scraping targets who expose metrics through HTTP endpoint
stored in time series db and can be queried with built-in PromQL
what can it monitor:
- CPU/memory
- disk space
- service uptime
- app specific data - number of exceptions, latency, pending requests
- networking devices, databases etc.
exclusively monitor numeric time-series data
does not monitor events, system logs, traces
originally sponsored by SoundCloud
written in Go

Prometheus Architecture

3 core components:
- retrieval (scrapes metric data)
- TSDB (stores metric data)
- HTTP server (accepts PromQL query)
lot of other components making up the whole solution:
- exporters (mini-processes running on the targets), retrieval component pulls the metrics from
- pushgateway (short lived job sends the data to it and retrieved from there)
- service discovery is all about providing list of targets so you don't have to hardocre those values
- alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
- Prometheus Web UI or Grafana etc.
collects by sending HTTP request to /metrics endpoint of each target, path can be changed
several native exporters:
- node exporters (Linux)
- Windows
- MySQL
- Apache
- HAProxy
- client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
Pull based is pros:
- easier to tell if the target is down
- does not DDoS the metrics server
- definitive list of targets to monitor (central source of truth)

Last update: Thu Jan 12 23:34:51 UTC 2023

vdt/Mock Exam 1.md

Prometheus Certified Associate (PCA)