- Observability
- the ability to understand and measure the state of a system based on data generated by the system
- allows to generate actionable outputs from unexpected scenarions
- to better understad the internal of your system
- greater need for observability in distributed systems & microservices
- troubleshooting - e.g. why are error rates high?
- 3 pillars are logging, metrics, traces:
a. Logs - records of event that have occurred and encapsulate info about the specific event b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace
- Prometheus only handles metrics, not logs or traces!
- SLO/SLA/SLI
a. SLI (service level indicators) = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)
- not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
- high CPU, high memory are poor SLIs as they don't necessarily affect user's experience
b. SLO (service level objectives) = target value or range for an SLI
- examples: SLI - Latency SLO - Latency < 100ms SLI - Availability SLO - 99.99% uptime
- should be directly related to the customer experience
- purpose is to quantify reliability of a product to a customer
- may be tempted to set to aggressive values
- goal is not to achieve perfection, but make customers happy
c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO
- Prometheus fundamentals
- use cases:
- collect metrics from different locations like West DC, central DC, East DC, AWS etc.
- high memory on the hosting MySQL db and notify operations team via email
- find out which uploaded video length the application starts to degrade
- open source monitoring tool that collects metrics data and provide tools to visualize the data
- allows to generate alerts when treshold reached
- collects data by scraping targets who expose metrics through HTTP endpoint
- stored in time series db and can be queried with built-in PromQL
- what can it monitor:
- CPU/memory
- disk space
- service uptime
- app specific data - number of exceptions, latency, pending requests
- networking devices, databases etc.
- exclusively monitor numeric time-series data
- does not monitor events, system logs, traces
- originally sponsored by SoundCloud
- written in Go
- Prometheus Architecture
- 3 core components:
- retrieval (scrapes metric data)
- TSDB (stores metric data)
- HTTP server (accepts PromQL query)
- lot of other components making up the whole solution:
- exporters (mini-processes running on the targets), retrieval component pulls the metrics from
- pushgateway (short lived job sends the data to it and retrieved from there)
- service discovery is all about providing list of targets so you don't have to hardocre those values
- alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
- Prometheus Web UI or Grafana etc.
- collects by sending HTTP request to
/metricsendpoint of each target, path can be changed - several native exporters:
- node exporters (Linux)
- Windows
- MySQL
- Apache
- HAProxy
- client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
- Pull based is pros:
- easier to tell if the target is down
- does not DDoS the metrics server
- definitive list of targets to monitor (central source of truth)
- Prometheus Installation
- Download tar from http://prometheus.io/download
- untarred folder contains console_libraries, consoles, prometheus (binary), prometheus.yml (config) and promtool (CLI utility)
- Run
./prometheus - Open http://localhost:9090
- Execute / query
upin the console to see the one target (itself) - should work OK so we can turn it into a systemd service - Create a user
sudo useradd --no-create-home --shell /bin/false prometheus - Create a config folder
sudo mkdir /etc/prometheus - Create
/var/lib/prometheusfor the data - Move executables
sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin - Move config file
sudo cp prometheus.yaml /etc/prometheus/ - Copy the consoles folder
sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/ - Change owner for these folders & executables
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool - The command (ExecStart) will then look like this:
sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries - Create a service file with this information
/etc/systemd/system/prometheus.serviceand reloadsudo systemctl daemon-reload - Start the daemon
sudo systemctl start prometheus ; sudo systemctl enable prometheus
- Node exporter
- Download tar from http://prometheus.io/download
- untarred folder contains basically just the binary
node_exporter - Run the
./node_exporterand thencurl localhost:9100/metrics - Run in the background & start on boot using the systemd
sudo cp node_exporter /usr/local/binsudo useradd --no-create-home --shell /bin/false node_exportersudo chown node_exporter:node_exporter /usr/local/bin/node_exportersudo vi /etc/systemd/system/node_exporter.servicesudo systemctl daemon-reloadsudo systemctl start node_exporter ; sudo systemctl enable node_exporter
- Prometheus configuration
- Sections:
a.
global- default parameters, it can be overriden by the same variables in sub-sections b.scrape_configs- define targets andjob_name, which is a collection of instances that need to be scraped c.alertingd.rule_filese.remote_read&remote_writef.storage - Some examples:
scrape_configs:
- job_name: 'nodes' # call it whatever
scrape_interval: 30s # from the target every X seconds
scrape_timeout: 3s # timeouts after X seconds
scheme: https # http or https
metrics_path: /stats/metrics # non-default path that you send requests to
static_configs:
- targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
# basic_auth # this is the next section- Reload the config
sudo systemctl restart prometheus
- Encryption & Authentication
- between Prometheus and targets
On the targets, you need to generate the keys:
sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"- this will generate key & crt pair- config will have to be customized:
tls_server_config:
cert_file: node_exporter.crt
key_file: node_exporter.key./node_exporter --web.config=config.ymlcurl -k https://localhost:9100/metrics
On the server:
- copy the
node_exporter.crtfrom the target to the Prometheus server - update the
schemetohttpsin theprometheus.ymland addtls_configwithca_file(e.g. /etc/prometheus/node_exporter.crt that we copied in the previous step) andinsecure_skip_verifyif self-signed - restart prometheus service
scrape_configs:
- job_name: "node"
scheme: https
tls_config:
ca_file: /etc/prometheus/node_exporter.crt
insecure_skip_verify: trueAuthentication is done via generated hash (sudo apt install apache2-utils or httpd-tools etc) and htpasswd -nBC 12 "" | tr -d ':\n' (will prompt for password and spits out the hash):
- add the
basic_auth_usersand username + hash underneath it:
# /etc/node_exporter/config.yml
basic_auth_users:
prometheus: $2y$12$daXru320983rnofkwehj4039F- restart node_exporter service
- update Prometheus server's config with the same auth:
- job_name: "node"
basic_auth:
username: prometheus
password: <PLAIN TEXT PASSWORD!>- Metrics
- 3 properties:
- name - general feature of a system to be measured, may contain ASCII, numbers, underscores (
[a-zA-Z_:][a-zA-Z0-9_:]*), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g.__name__=node_cpu_seconds_total) - {labels (key/value pairs)} - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores (
[a-zA-Z0-9_]*). Labels surrounded by__are considered internal to Prometheus. Every matric is assigned 2 labels by default (instanceandjob). - metric value
- name - general feature of a system to be measured, may contain ASCII, numbers, underscores (
- Example =
node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86: labels provude us information on which cpu this metric is for - when Prometheus scrapes a target and retrieves metrics, it als ostores the time at which the metric was scraped
- Example = 1668215300 (unix epoch timestamp, since Jan 1st 1970 UTC)
- time series = stream of timestamped values sharing the same metric and set of labels
- metric have a
TYPE(counter, gauge, histogram, summary) andHELP(description of the metric is) attribute:- counter can only go up, how many times did X happen
- gauge can go up or down, what is the current value of X
- histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s)
- summary is similar to histogram and tells us how many observation fell below X, do no thave to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s)
Q:How many total unique time series are there in this output?
node_arp_entries{instance="node1" job="node"} 200
node_arp_entries{instance="node2" job="node"} 150
node_cpu_seconds_total{cpu="0", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="0", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}
node_memory_Active_bytes{instance="node1" job="node"} 419124
node_memory_Active_bytes{instance="node2" job="node"} 55589
A: 9
Q: What metric should be used to report the current memory utilization? A: gauge
Q: What metric should be used to report the amount of time a process has been running? A: counter
Q: Which of these is not a valid metric? A: 404_error_count
Q: How many labels does the following time series have? http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234 A: 5
Q: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this? A: histogram
Q: What are the two labels every metric is assigned by default? A: instance, job
Q: What are the 4 types of prometheus metrics? A: counter, gauge, histogram, summary
Q: What are the two attributes provided by a metric? A: Help, Type
Q: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782 ; What is the metric name? A: http_request_total
- Expression browser
- Web UI for Prometheus server to query data
up- returns which targets are in up state (you can see aninstanceandjoband value on the right -0and1)
- Prometheus on Docker
- Pull image
prom/prometheus - Configure
prometheus.yml - Expose ports, bind mounts
docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus
- PromTools
- check & validate configuration before applying (e.g to production)
- prevent downtime while config issues are being identified
- validate metrics passed to it are correctly formatted
- can perform queris on a Prom server
- debugging & profiling a Prom server
- perform unit tests agains Recording/Alerting rules
promtool check config /etc/prometheus/prometheus.yml
- Container metrics
- metrics can be scraped from containerized envs
- docker engine metrics (how much CPU does Docker use etc. no metrics specific to a container!):
vi /etc/docker/daemon.json:{ "metrics-addr": "127.0.0.1:9323", "experimental": true }sudo systemctl restart dockercurl localhost:9323/metrics- prometheus job update:
scrape_configs: - job_namce: "docker" static_configs: - targets: ["12.1.13.4:9323"]
- cAdvisor (how much memory does each container use? container uptime? etc.):
vi docker-compose.ymlto pullgcr.io/cadvisor/cadvisordocker-compose upordocker compose upcurl localhost:8080/metrics
Last update: Mon Jan 16 04:27:57 UTC 2023