- Observability
- the ability to understand and measure the state of a system based on data generated by the system
- allows to generate actionable outputs from unexpected scenarions
- to better understad the internal of your system
- greater need for observability in distributed systems & microservices
- troubleshooting - e.g. why are error rates high?
- 3 pillars are logging, metrics, traces:
a. Logs - records of event that have occurred and encapsulate info about the specific event b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace
- Prometheus only handles metrics, not logs or traces!
- SLO/SLA/SLI
a. SLI = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)
- not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
- high CPU, high memory are poor SLIs as they don't necessarily affect user's experience
b. SLO = target value or range for an SLI
- examples: SLI - Latency SLO - Latency < 100ms SLI - Availability SLO - 99.99% uptime
- should be directly related to the customer experience
- purpose is to quantify reliability of a product to a customer
- may be tempted to set to aggressive values
- goal is not to achieve perfection, but make customers happy
c. SLA = contract between a vendor and a user that guarantees SLO
- Prometheus fundamentals
- use cases:
- collect metrics from different locations like West DC, central DC, East DC, AWS etc.
- high memory on the hosting MySQL db and notify operations team via email
- find out which uploaded video length the application starts to degrade
- open source monitoring tool that collects metrics data and provide tools to visualize the data
- allows to generate alerts when treshold reached
- collects data by scraping targets who expose metrics through HTTP endpoint
- stored in time series db and can be queried with built-in PromQL
- what can it monitor:
- CPU/memory
- disk space
- service uptime
- app specific data - number of exceptions, latency, pending requests
- networking devices, databases etc.
- exclusively monitor numeric time-series data
- does not monitor events, system logs, traces
- originally sponsored by SoundCloud
- written in Go
- Prometheus Architecture
- 3 core components:
- retrieval (scrapes metric data)
- TSDB (stores metric data)
- HTTP server (accepts PromQL query)
- lot of other components making up the whole solution:
- exporters (mini-processes running on the targets), retrieval component pulls the metrics from
- pushgateway (short lived job sends the data to it and retrieved from there)
- service discovery is all about providing list of targets so you don't have to hardocre those values
- alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
- Prometheus Web UI or Grafana etc.
- collects by sending HTTP request to
/metricsendpoint of each target, path can be changed - several native exporters:
- node exporters (Linux)
- Windows
- MySQL
- Apache
- HAProxy
- client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
- Pull based is pros:
- easier to tell if the target is down
- does not DDoS the metrics server
- definitive list of targets to monitor (central source of truth)
Last update: Thu Jan 12 23:34:51 UTC 2023