- Observability
- the ability to understand and measure the state of a system based on data generated by the system
- allows to generate actionable outputs from unexpected scenarions
- to better understad the internal of your system
- greater need for observability in distributed systems & microservices
- troubleshooting - e.g. why are error rates high?
- 3 pillars are logging, metrics, traces:
a. Logs - records of event that have occurred and encapsulate info about the specific event b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace
- Prometheus only handles metrics, not logs or traces!
- SLO/SLA/SLI
a. SLI (service level indicators) = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)
- not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
- high CPU, high memory are poor SLIs as they don't necessarily affect user's experience
b. SLO (service level objectives) = target value or range for an SLI
- examples: SLI - Latency SLO - Latency < 100ms SLI - Availability SLO - 99.99% uptime
- should be directly related to the customer experience
- purpose is to quantify reliability of a product to a customer
- may be tempted to set to aggressive values
- goal is not to achieve perfection, but make customers happy
c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO
- Prometheus fundamentals
- use cases:
- collect metrics from different locations like West DC, central DC, East DC, AWS etc.
- high memory on the hosting MySQL db and notify operations team via email
- find out which uploaded video length the application starts to degrade
- open source monitoring tool that collects metrics data and provide tools to visualize the data
- allows to generate alerts when treshold reached
- collects data by scraping targets who expose metrics through HTTP endpoint
- stored in time series db and can be queried with built-in PromQL
- what can it monitor:
- CPU/memory
- disk space
- service uptime
- app specific data - number of exceptions, latency, pending requests
- networking devices, databases etc.
- exclusively monitor numeric time-series data
- does not monitor events, system logs, traces
- originally sponsored by SoundCloud
- written in Go
- Prometheus Architecture
- 3 core components:
- retrieval (scrapes metric data)
- TSDB (stores metric data)
- HTTP server (accepts PromQL query)
- lot of other components making up the whole solution:
- exporters (mini-processes running on the targets), retrieval component pulls the metrics from
- pushgateway (short lived job sends the data to it and retrieved from there)
- service discovery is all about providing list of targets so you don't have to hardocre those values
- alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
- Prometheus Web UI or Grafana etc.
- collects by sending HTTP request to
/metricsendpoint of each target, path can be changed - several native exporters:
- node exporters (Linux)
- Windows
- MySQL
- Apache
- HAProxy
- client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
- Pull based is pros:
- easier to tell if the target is down
- does not DDoS the metrics server
- definitive list of targets to monitor (central source of truth)
- Prometheus Installation
- Download tar from http://prometheus.io/download
- untarred folder contains console_libraries, consoles, prometheus (binary), prometheus.yml (config) and promtool (CLI utility)
- Run
./prometheus - Open http://localhost:9090
- Execute / query
upin the console to see the one target (itself) - should work OK so we can turn it into a systemd service - Create a user
sudo useradd --no-create-home --shell /bin/false prometheus - Create a config folder
sudo mkdir /etc/prometheus - Create
/var/lib/prometheusfor the data - Move executables
sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin - Move config file
sudo cp prometheus.yaml /etc/prometheus/ - Copy the consoles folder
sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/ - Change owner for these folders & executables
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool - The command (ExecStart) will then look like this:
sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries - Create a service file with this information
/etc/systemd/system/prometheus.serviceand reloadsudo systemctl daemon-reload - Start the daemon
sudo systemctl start prometheus ; sudo systemctl enable prometheus
- Node exporter
- Download tar from http://prometheus.io/download
- untarred folder contains basically just the binary
node_exporter - Run the
./node_exporterand thencurl localhost:9100/metrics - Run in the background & start on boot using the systemd
sudo cp node_exporter /usr/local/binsudo useradd --no-create-home --shell /bin/false node_exportersudo chown node_exporter:node_exporter /usr/local/bin/node_exportersudo vi /etc/systemd/system/node_exporter.servicesudo systemctl daemon-reloadsudo systemctl start node_exporter ; sudo systemctl enable node_exporter
- Prometheus configuration
- Sections:
a.
global- default parameters, it can be overriden by the same variables in sub-sections b.scrape_configs- define targets andjob_name, which is a collection of instances that need to be scraped c.alertingd.rule_filese.remote_read&remote_writef.storage - Some examples:
scrape_configs:
- job_name: 'nodes' # call it whatever
scrape_interval: 30s # from the target every X seconds
scrape_timeout: 3s # timeouts after X seconds
scheme: https # http or https
metrics_path: /stats/metrics # non-default path that you send requests to
static_configs:
- targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
# basic_auth # this is the next section- Reload the config
sudo systemctl restart prometheus
- Encryption & Authentication
- between Prometheus and targets
On the targets, you need to generate the keys:
sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"- this will generate key & crt pair- config will have to be customized:
tls_server_config:
cert_file: node_exporter.crt
key_file: node_exporter.key./node_exporter --web.config=config.ymlcurl -k https://localhost:9100/metrics
On the server:
- copy the
node_exporter.crtfrom the target to the Prometheus server - update the
schemetohttpsin theprometheus.ymland addtls_configwithca_file(e.g. /etc/prometheus/node_exporter.crt that we copied in the previous step) andinsecure_skip_verifyif self-signed - restart prometheus service
scrape_configs:
- job_name: "node"
scheme: https
tls_config:
ca_file: /etc/prometheus/node_exporter.crt
insecure_skip_verify: trueAuthentication is done via generated hash (sudo apt install apache2-utils or httpd-tools etc) and htpasswd -nBC 12 "" | tr -d ':\n' (will prompt for password and spits out the hash):
- add the
basic_auth_usersand username + hash underneath it:
# /etc/node_exporter/config.yml
basic_auth_users:
prometheus: $2y$12$daXru320983rnofkwehj4039F- restart node_exporter service
- update Prometheus server's config with the same auth:
- job_name: "node"
basic_auth:
username: prometheus
password: <PLAIN TEXT PASSWORD!>- Metrics
- 3 properties:
- name - general feature of a system to be measured, may contain ASCII, numbers, underscores (
[a-zA-Z_:][a-zA-Z0-9_:]*), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g.__name__=node_cpu_seconds_total) - {labels (key/value pairs)} - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores (
[a-zA-Z0-9_]*). Labels surrounded by__are considered internal to Prometheus. Every matric is assigned 2 labels by default (instanceandjob). - metric value
- name - general feature of a system to be measured, may contain ASCII, numbers, underscores (
- Example =
node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86: labels provude us information on which cpu this metric is for - when Prometheus scrapes a target and retrieves metrics, it als ostores the time at which the metric was scraped
- Example = 1668215300 (unix epoch timestamp, since Jan 1st 1970 UTC)
- time series = stream of timestamped values sharing the same metric and set of labels
- metric have a
TYPE(counter, gauge, histogram, summary) andHELP(description of the metric is) attribute:- counter can only go up, how many times did X happen
- gauge can go up or down, what is the current value of X
- histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s) -
request_latency_seconds_bucket{le="0.05"} 50. Buckets are cumulative (i.e. all request in the le=0.03 bucket will include all requests less a 0.03s which includes all requests that fall into the buckets below it (e.g 0.02, 0.01) - e.g. to calculate the histogram's quantiles, we would usehistogram_quantile, approximation of the value of a specific quantile: 75% of all requests have what latency?histogram_quantile(0.75, request_latency_seconds_bucket). To get an accurate value, make sure there is a bucket at the specific value that needs to be met. Every time you add a bucket, it will slower the performance of the Prometheus! - summary is similar to histogram and tells us how many observation fell below X, do no thave to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s). Similarly to histogram, there will be
_countand_summetrics as well as quantiles like 0.7, 0.8, 0.9 (instead of buckets).
| histogram | summary |
|---|---|
| bucket sizes can be picked | quantile must be defined ahead of time |
| less taxing on client libraries | more taxing on client libraries |
| any quantile can be selected | only quantiles predefined in client can be used |
| Prometheus server must calculate quantiles | very minimal server-side cost |
Q:How many total unique time series are there in this output?
node_arp_entries{instance="node1" job="node"} 200
node_arp_entries{instance="node2" job="node"} 150
node_cpu_seconds_total{cpu="0", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="0", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}
node_memory_Active_bytes{instance="node1" job="node"} 419124
node_memory_Active_bytes{instance="node2" job="node"} 55589
A: 9
Q: What metric should be used to report the current memory utilization? A: gauge
Q: What metric should be used to report the amount of time a process has been running? A: counter
Q: Which of these is not a valid metric? A: 404_error_count
Q: How many labels does the following time series have? http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234 A: 5
Q: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this? A: histogram
Q: What are the two labels every metric is assigned by default? A: instance, job
Q: What are the 4 types of prometheus metrics? A: counter, gauge, histogram, summary
Q: What are the two attributes provided by a metric? A: Help, Type
Q: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782 ; What is the metric name? A: http_request_total
Q: For the http_request_total metric, what is the query/metric name that would be used to get the count of total requests on node node01:3000?
A: http_request_total_count{instance="node01:3000"}
Q: Construct a query to return the total number of requests for the /events route with a latency of less than 0.4s across all nodes.
A: http_request_total_bucket{route="/events",le="0.4"}
Q: Construct a query to find out how many requests took somewhere between 0.08s and 0.1s on node node02:3000. A:
Q: Construct a query to calculate the rate of http requests that took less than 0.08s. Use a time window of 1m across all nodes. A:
Q: Construct a query to calculate the average latency of a request over the past 4 minutes. Use the formula below to calculate average latency of request: rate of sum-of-all-requests / rate of count-of-all-requests A:
Q: Management would like to know what is the 95th percentile for the latency of requests going to node node01:3000. Construct a query to calculate the 95th percentile. A:
Q: The company is now offering customers an SLO stating that, 95% of all requests will be under 0.15s. What bucket size will need to be added to guarantee that the histogram_quantile function can accurately report whether or not that SLO has been met? A: 0.15
Q: A summary metric http_upload_bytes has been added to track the amount of bytes uploaded per request. What are percentiles being reported by this metric?
(A) 0.02, 0.05, 0.08, 0.1, 0.13, 0.18, 0.21, 0.24, 0.3, 0.35, 0.4 (B) 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99 (C) events, tickets (D) 200, 201, 400, 404 A:
Q: A:
- Expression browser
- Web UI for Prometheus server to query data
up- returns which targets are in up state (you can see aninstanceandjoband value on the right -0and1)
- Prometheus on Docker
- Pull image
prom/prometheus - Configure
prometheus.yml - Expose ports, bind mounts
docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus
- PromTools
- check & validate configuration before applying (e.g to production)
- prevent downtime while config issues are being identified
- validate metrics passed to it are correctly formatted
- can perform queris on a Prom server
- debugging & profiling a Prom server
- perform unit tests agains Recording/Alerting rules
promtool check config /etc/prometheus/prometheus.yml
- Container metrics
- metrics can be scraped from containerized envs
- docker engine metrics (how much CPU does Docker use etc. no metrics specific to a container!):
vi /etc/docker/daemon.json:{ "metrics-addr": "127.0.0.1:9323", "experimental": true }sudo systemctl restart dockercurl localhost:9323/metrics- prometheus job update:
scrape_configs: - job_namce: "docker" static_configs: - targets: ["12.1.13.4:9323"]
- cAdvisor (how much memory does each container use? container uptime? etc.):
vi docker-compose.ymlto pullgcr.io/cadvisor/cadvisordocker-compose upordocker compose upcurl localhost:8080/metrics
- short for Protmetheus Query Language
- data returned can be visualized in dashboards
- used to build alerting rules to notify about thersholds
- String (currently unused)
- Scalar - numeric floating point value (e.g.
54.743) - Instant vector - set of time series containing a single sample for each time series sharing the same timestamp (e.g.
node_cpu_seconds_totalfinds all unique labels and value for each and they all will going to be all at single point in time) - Range vector - set of time series containing a range of data points over time for each time series (e.g.
node_cpu_seconds_total[3m]finds all unique labels, but all values and timestamps from the past 3 minutes)
- if we only want to return a subset of times series for a metric = label matchers:
- exact match
=(e.g.node_filesystem_avail_bytes{instance="node1"}) - negative equality
!=(e.g.node_filesystem_avail_bytes{device!="tmpfs"}) - regular expression
=~(e.g. starts with /dev/sda -node_filesystem_avail_bytes{device=~"/dev/sda.*"}) - negative regular expression
!~(e.g. mountpoint does not start with /boot -node_filesystem_avail_bytes{mountpoint!~"/boot.*"})
- exact match
- we can combine multiple selectors with comma (e.g.
node_filesystem_avail_bytes{instance="node1",device!="tmpfs"})
- to get historic data, use an
offsetmodifier after the label matching (e.g. get value 5 minutes ago -node_memory_active_bytes{instance="node1"} offset 5m) - to get to the exact point in time (e.g. get value on September 15 -
node_memory_active_bytes{instance="node1"} @1663265188) - you can use both modifiers and order does not matter (e.g.
@1663265188 offset 5m=offset 5m @1663265188) - you can also add range vectors (e.g. get 2 minutes worth of data 10 minutes before September 15
[2m] @1663265188 offset 5m)
- between instant vectors and scalars
- types:
- Arithemtic
+,-,*,/,%,^(e.g.node_memory_Active_bytes / 1024- but it drops the metric name in the output as it is no longer the original metric!) - Comparison
==,!==,>,<,>=,<=,bool(e.g.node_network_flags > 100,node_network_receive_packets_total >= 220,node_filesystem_avail_bytes < bool 1000returns0or1mostly for generating alerts) - Logical
OR,AND,UNLESS(e.g.node_filesystem_avail_bytes > 1000 and node_filesystem_avail_bytes < 3000). Unless oeprator results in a cevtor consisting of elements ofn the left side for which there are no elements on the right side (e.g. return all vectors greater than 1000 unless they are greater than 30000node_filesystem_avail_bytes > 1000 unless node_filesystem_avail_bytes > 30000) - more than one operator follows the order of precedence from highest to lowest, while operators on the same precedence level are performed from the left (e.g.
2 * 3 % 2 = (2 * 3) % 2), however power is performed from the left (e.g.2 ^ 3 ^ 2 = 2 ^ (3 ^ 2)): high ^^|*,/,%,atan2|+,-|==,!=,<=,<,>=,>|and,unlesslow |or
Q: Construct a query to return all filesystems that have over 1000 bytes available on all instances under web job.
A: node_filesystem_avail_bytes{job="web"} > 1000
Q: Which of the following queries you will use for loadbalancer:9100 host to return all the interfaces that have received less than or equal to 10000 bytes of traffic?
A: node_network_receive_bytes_total{instance="loadbalancer:9100"} <= 10000
Q: node_filesystem_files tracks the filesystem's total file nodes. Construct a query that only returns time series greater than 500000 and less than 10000000 across all jobs
A: node_filesystem_files > 500000 and node_filesystem_files < 10000000
Q: The metric node_filesystem_avail_bytes lists the available bytes for all filesystems, and the metric node_filesystem_size_bytes lists the total size of all filesystems. Run each metric and see their outputs. There are three properties/labels these will return, device, fstype, and mountpoint. Which of the following queries will show the percentage of free disk space for all filesystems on all the targets under web job whose device label does not match tmpfs?
A: node_filesystem_avail_bytes{job="web", device!="tmpfs"}*100 / node_filesystem_size_bytes{job="web", device!="tmpfs"}
- between 2 instant vectors (e.g. to get the percentage of free space
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100) - samples with exactly the same lables get matched together (e.g. instance and job and mountpoint must be the same to get match) - every element in the vector on the left tries to find a single matching element on the right
- to perform operation on 2 vectors with differing labels like http_errors code="500", code="501", code="404", moethod="put" etc. use the
ignoringkeyword (e.g.http_errors{code="500"} / ignoring(code) http_requests) - if the entries with e.g. methods
putanddelhave no match in both metricshttp_errorsandhttp_requests, they will not show up in the results! - to get results on all labels to match on, we use the
onkeyword (e.g.http_errors{code="500"} / on(method) http_requests)
| vector1 | + vector2 | = resulting vector |
|---|---|---|
| {cpu=0,mode=idle} | {cpu=1,mode=steal} | {cpu=0} |
| {cpu=1,mode=iowait} | {cpu=2,mode=user} | {cpu=1} |
| {cpu=2,mode=user} | {cpu=0,mode=idle} | {cpu=2} |
- Resulting vector will have matching elements with all labels listed in
onor all labelsnot/ignored: e.g.vector1{}+on(cpu) vector2{}orvector1{}+ignore(mode) vector2{} - Another example is:
http_errors_total / ignoring(error) http_requests_total=http_errors_total / on(instance, job, path) http_requests_total
Q: Which of the following queries can be used to track the total number of seconds cpu has spent in user + system mode for instance loadbalancer:9100?
A: node_cpu_seconds_total{instance="loadbalancer:9100", mode="user"} + ignoring(mode) node_cpu_seconds_total{instance="loadbalancer:9100", mode="system"}
Q: Construct a query that will find out what percentage of time each cpu on each instance was spent in mode user. To calculate the percentage in mode user, get the total seconds spent in mode user and divide that by the sum of the time spent across all modes. Further, multiply that result by 100 to get a percentage.
A: node_cpu_seconds_total{mode="user"}*100 /ignoring(mode, job) sum by(instance, cpu) (node_cpu_seconds_total)
- error executing the query
multiple matches for labels: many-to-one matching must be explicit (group_left/group_right) - is where each vector elements on the one side can match with multiple elements on the many side (e.g.
http_errors + on(path) group_left http_requests) -group_lefttells PromQL that elements from the right side are now matched with multiple elements from the left (group_rightis the opposite of that - depending on which side is the many and which side is one)
| many | + one | = resulting vector |
|---|---|---|
| {error=400,path=/cats} 2 | {error=400,path=/cats} 4 | |
| {error=500,path=/cats} 5 | {path=/cats} 2 | {error=500,path=/cats} 7 |
| {error=400,path=/dogs 1 | {path=/dogs} 7 | {error=400,path=/dogs} 8 |
| {error=500,path=/dogs 7 | {error=500,path=/dogs} 14 |
Q: The api job collects metrics on an API used for uploading files. The API has 3 endpoints /images /videos, and /songs, which are used to upload respective file types. The API provides 2 metrics to track: http_uploaded_bytes_total - tracks the number of uploaded bytes and http_upload_failed_bytes_total - tracks the number of bytes failed to upload. Construct a query to calculate the percentage of bytes that failed for each endpoint. The formula for the same is http_upload_failed_bytes_total*100 / http_uploaded_bytes_total.
A: http_upload_failed_bytes_total*100 / ignoring(error) group_left http_uploaded_bytes_total
- allow you to take an instan vector and aggregate its elements resulting in a new instant vector with fewer elements
sum,min,max,avg,group,stddev,stdvar,count,count_values,bottomk,topk,quantile- for example
sum(http_requests),max(http_requests) bykeyword allows you to choose which labels to aggregate along (e.g.sum by(path) (http_requests),sum by(method) (http_requests),sum by(instance) (http_requests),sum by(instance, method) (http_requests))withoutkeyword does the opposite ofbyand tells the query which labels not to include in aggregation (e.g.sum without(cpu, mode) (node_cpu_seconds_total))
Q: On loadbalancer:9100 instance, calculate the sum of the size of all filesystems. The metric to get filesystem size is node_filesystem_size_bytes
A: sum(node_filesystem_size_bytes{instance="loadbalancer:9100"})
Q: Construct a query to find how many CPUs instance loadbalancer:9100 have. You can use the node_cpu_seconds_total metric to find out the same.
A: count(sum by (cpu) (node_cpu_seconds_total{instance="loadbalancer:9100"}))
Q: Construct a query that will show the number of CPUs on each instance across all jobs.
A: ?
Q: Use the node_network_receive_bytes_total metric to calculate the sum of the total received bytes across all interfaces on per instance basis
A: sum by(instance)(node_network_receive_bytes_total)
Q: Which of the following queries will be used to calculate the average packet size for each instance?
A: sum by(instance)(node_network_receive_bytes_total) / sum by(instance)(node_network_receive_packets_total)
- sorting, math, label transformations, metric manipulation
- use the
roundfunction to round the query's result to the nearest integer value - truncate/round up to the closest integer:
ceil(node_cpu_seconds_total) - round down:
floor(node_cpu_seconds_total) - absolute value for negative numbers:
abs(1-node_cpu_seconds_total) - date & time:
time(),minute()etc. - vector function takes a scalar value and converts it into an instanst vector:
vector(4) - scalar function returns the value of the single element as a scalar (otherwise returns
NaNif the input vector does not have exactly one element):scalar(process_start_time_seconds) - sorting:
sort(ascending) andsort_desc(descending) - rate at which a counter metric increases:
rateandirate(e.g. group together data points by 60 seconds, get last value minus first value in each of these 60s groups and divide it by 60:rate(http_errors[1m]); e.g. similar to rate, but you get the last value and the second to last data points:irate(http_errors[1m]))
| rate | irate |
|---|---|
| looks at the firs and last data points within a range | looks at the last two data points within a range |
| effectively an average rate over the range | instant rate |
| best for slow moving counters and alerting rules | should be user for graphing volatile fast-moving counters |
Note:
- make sure there is at least 4 sample withing the time range (e.g. 15s scrape interval 60s window gives 4 samples)
- when combining rate with an aggregation operator, always take
rate()first, then aggregate (so it can detect counter resets) - to get the rate of increase of the sum of latency actoss all requests:
rate(requests_latency_seconds_sum[1m]) - to calculate the average latency of a request over the past 5m:
rate(requests_latency_seconds_sum[5m]) / rate(requests_latency_seconds_count[5m])
Q: Management wants to keep track of the rate of bytes received by each instance. Each instance has two interfaces, so the rate of traffic being received on them must be summed up. Calculate the rate of received node_network_receive_bytes_total using 2-minute window, sum the rates across all interfaces, and group the results by instance. Save the query in /root/traffic.txt.
A: sum by(instance) (rate(node_network_receive_bytes_total[2m]))
- Syntax:
<instant_query> [<range>:<resolution>] [offset <duration>] - Example:
rate(http_requests_total[1m]) [5m:30s]- where sample range is 1m, query range is data from the last 5m and query step for subquery is 30s (gap between) - maximum value over a 10min of a gauge metric (
max_over_time(node_filesystem_avail_bytes[10m])) - for counter metrics, we need to find the max value of the rate over the past 5min (e.g. maximum rate of request from the last 5 minutes with a 30s query interval and a sample range of 1m:
max_over_time(rate(http_requests_total[1m]) [5m:30s])
Q: There were reports of a small outage of an application in the past few minutes, and some alerts pointed to potential high iowait on the CPUs. We need to calculate when the iowait rate was the highest over the past 10 minutes. [Construct a subquery that will calculate the rate at which all cpus spent in iowait mode using a 1 minute time window for the rate function. Find the max value of this result over the past 10 minutes using a 30s query step for the subquery.] A: ``
Q: Construct a query to calculate the average over time (avg_over_time) rate of http_requests_total over the past 20m using 1m query step. A: ``
- allow Prometheus to periodically evaluate PromQL expression and store the resulting times series generated by them
- speeding up your dashboards
- provide aggregated results for use elsewhere
- recording rules go in a seperate file called a rule file:
global: ... rule_files: - rules.yml # globs can be used here, like /etc/prometheus/rule_files.d/*.yml scrape_configs: ...
- Prometheus server must be restarted after this change
- syntax of the
rules.ymlfile:groups: # groups running in parallel - name: <group name 1> interval: <evaluation interval, global by default> rules: # however, rules evaluated sequentially - record: <rule name 1> expr: <promql expression 1> labels: <label name>: <label value> - record: <rule name 2> # you can also reference previous rule(s) expr: <promql expression 1> labels: - name: <group name 2> ...
- example of the
rules.ymlfile:groups: - name: example1 # it will show up in the WebGui under "status" - "rules" interval: 15s rules: - record: node_memory_memFree_percent expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_memTotal_bytes) - record: node_filesystem_free_percent expr: 100 * node_filesystem_free_bytes / node_filesystem_size_bytes
- best practices for rule naming:
aggregation_level:metric_name:operations, e.g. we have a http_errors counter with two instrumentation labels "method" and "path". All the rules for a specific job should be contained in a single group. It will look like:- record: job_method_path:http_errors:rate5m expr: sum without(instance) (rate(http_errors{job="api"}[5m]))
- execute queries, gather information on alert, rules, service discovery related configs
- send the POST request to
http://<prometheus_ip>/api/v1/query - example:
curl http://<prometheus_ip>:9090/api/v1/query --data 'query=node_arp_entries{instance="192.168.1.168:9100"}' - query at a specific time, just add another
--data 'time=169386192' - response back as JSON
Last update: Tue Jan 17 05:45:09 UTC 2023