Prometheus Certified Associate (PCA)

Curriculum

28% PromQL

Selecting Data
Rates and Derivatives
Aggregating over time
Aggregating over dimensions
Binary operators
Histograms
Timestamp Metrics

20% Prometheus Fundamentals

System Architecture
Configuration and Scraping
Understanding Prometheus Limitations
Data Model and Labels
Exposition Format

18% Observability Concepts

Metrics
Understand logs and events
Tracing and Spans
Push vs Pull
Service Discovery
Basics of SLOs, SLAs, and SLIs

18% Alerting & Dashboarding

Dashboarding basics
Configuring Alerting rules
Understand and Use Alertmanager
Alerting basics (when, what, and why)

16% Instrumentation & Exporters

Client Libraries
Instrumentation
Exporters
Structuring and naming metrics

Observability Fundamentals

Observability

the ability to understand and measure the state of a system based on data generated by the system
allows to generate actionable outputs from unexpected scenarions
to better understad the internal of your system
greater need for observability in distributed systems & microservices
troubleshooting - e.g. why are error rates high?
3 pillars are logging, metrics, traces:

a. Logs - records of event that have occurred and encapsulate info about the specific event b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace

Prometheus only handles metrics, not logs or traces!

SLO/SLA/SLI

a. SLI (service level indicators) = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)

not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

b. SLO (service level objectives) = target value or range for an SLI

examples: SLI - Latency SLO - Latency < 100ms SLI - Availability SLO - 99.99% uptime
should be directly related to the customer experience
purpose is to quantify reliability of a product to a customer
may be tempted to set to aggressive values
goal is not to achieve perfection, but make customers happy

c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO

Prometheus Fundamentals

Prometheus fundamentals

use cases:
- collect metrics from different locations like West DC, central DC, East DC, AWS etc.
- high memory on the hosting MySQL db and notify operations team via email
- find out which uploaded video length the application starts to degrade
open source monitoring tool that collects metrics data and provide tools to visualize the data
allows to generate alerts when treshold reached
collects data by scraping targets who expose metrics through HTTP endpoint
stored in time series db and can be queried with built-in PromQL
what can it monitor:
- CPU/memory
- disk space
- service uptime
- app specific data - number of exceptions, latency, pending requests
- networking devices, databases etc.
exclusively monitor numeric time-series data
does not monitor events, system logs, traces
originally sponsored by SoundCloud
written in Go

Prometheus Architecture

3 core components:
- retrieval (scrapes metric data)
- TSDB (stores metric data)
- HTTP server (accepts PromQL query)
lot of other components making up the whole solution:
- exporters (mini-processes running on the targets), retrieval component pulls the metrics from
- pushgateway (short lived job sends the data to it and retrieved from there)
- service discovery is all about providing list of targets so you don't have to hardocre those values
- alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
- Prometheus Web UI or Grafana etc.
collects by sending HTTP request to /metrics endpoint of each target, path can be changed
several native exporters:
- node exporters (Linux)
- Windows
- MySQL
- Apache
- HAProxy
- client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
Pull based is pros:
- easier to tell if the target is down
- does not DDoS the metrics server
- definitive list of targets to monitor (central source of truth)

Prometheus Installation

Download tar from http://prometheus.io/download
untarred folder contains console_libraries, consoles, prometheus (binary), prometheus.yml (config) and promtool (CLI utility)
Run ./prometheus
Open http://localhost:9090
Execute / query up in the console to see the one target (itself) - should work OK so we can turn it into a systemd service
Create a user sudo useradd --no-create-home --shell /bin/false prometheus
Create a config folder sudo mkdir /etc/prometheus
Create /var/lib/prometheus for the data
Move executables sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin
Move config file sudo cp prometheus.yaml /etc/prometheus/
Copy the consoles folder sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/
Change owner for these folders & executables sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
The command (ExecStart) will then look like this: sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries
Create a service file with this information /etc/systemd/system/prometheus.service and reload sudo systemctl daemon-reload
Start the daemon sudo systemctl start prometheus ; sudo systemctl enable prometheus

Node exporter

Download tar from http://prometheus.io/download
untarred folder contains basically just the binary node_exporter
Run the ./node_exporter and then curl localhost:9100/metrics
Run in the background & start on boot using the systemd
sudo cp node_exporter /usr/local/bin
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo vi /etc/systemd/system/node_exporter.service
sudo systemctl daemon-reload
sudo systemctl start node_exporter ; sudo systemctl enable node_exporter

Prometheus configuration

Sections: a. global - default parameters, it can be overriden by the same variables in sub-sections b. scrape_configs - define targets and job_name, which is a collection of instances that need to be scraped c. alerting d. rule_files e. remote_read & remote_write f. storage
Some examples:

scrape_configs:
  - job_name: 'nodes'               # call it whatever
    scrape_interval: 30s            # from the target every X seconds
    scrape_timeout: 3s              # timeouts after X seconds
    scheme: https                   # http or https
    metrics_path: /stats/metrics    # non-default path that you send requests to
    static_configs:
      - targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
    # basic_auth                    # this is the next section

Reload the config sudo systemctl restart prometheus

Encryption & Authentication

between Prometheus and targets

On the targets, you need to generate the keys:

sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost" - this will generate key & crt pair
config will have to be customized:

tls_server_config:
  cert_file: node_exporter.crt
  key_file: node_exporter.key

./node_exporter --web.config=config.yml
curl -k https://localhost:9100/metrics

On the server:

copy the node_exporter.crt from the target to the Prometheus server
update the scheme to https in the prometheus.yml and add tls_config with ca_file (e.g. /etc/prometheus/node_exporter.crt that we copied in the previous step) and insecure_skip_verify if self-signed
restart prometheus service

scrape_configs:
  - job_name: "node"
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/node_exporter.crt
      insecure_skip_verify: true

Authentication is done via generated hash (sudo apt install apache2-utils or httpd-tools etc) and htpasswd -nBC 12 "" | tr -d ':\n' (will prompt for password and spits out the hash):

add the basic_auth_users and username + hash underneath it:

# /etc/node_exporter/config.yml
basic_auth_users:
  prometheus: $2y$12$daXru320983rnofkwehj4039F

restart node_exporter service
update Prometheus server's config with the same auth:

- job_name: "node"
  basic_auth:
    username: prometheus
    password: <PLAIN TEXT PASSWORD!>

Metrics

3 properties:
- name - general feature of a system to be measured, may contain ASCII, numbers, underscores ([a-zA-Z_:][a-zA-Z0-9_:]*), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g. __name__=node_cpu_seconds_total)
- {labels (key/value pairs)} - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores ([a-zA-Z0-9_]*). Labels surrounded by __ are considered internal to Prometheus. Every matric is assigned 2 labels by default (instance and job).
- metric value
Example = node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86: labels provude us information on which cpu this metric is for
when Prometheus scrapes a target and retrieves metrics, it als ostores the time at which the metric was scraped
Example = 1668215300 (unix epoch timestamp, since Jan 1st 1970 UTC)
time series = stream of timestamped values sharing the same metric and set of labels
metric have a TYPE (counter, gauge, histogram, summary) and HELP (description of the metric is) attribute:
- counter can only go up, how many times did X happen
- gauge can go up or down, what is the current value of X
- histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s) - request_latency_seconds_bucket{le="0.05"} 50. Buckets are cumulative (i.e. all request in the le=0.03 bucket will include all requests less a 0.03s which includes all requests that fall into the buckets below it (e.g 0.02, 0.01) - e.g. to calculate the histogram's quantiles, we would use histogram_quantile, approximation of the value of a specific quantile: 75% of all requests have what latency? histogram_quantile(0.75, request_latency_seconds_bucket). To get an accurate value, make sure there is a bucket at the specific value that needs to be met. Every time you add a bucket, it will slower the performance of the Prometheus!
- summary is similar to histogram and tells us how many observation fell below X, do no thave to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s). Similarly to histogram, there will be _count and _sum metrics as well as quantiles like 0.7, 0.8, 0.9 (instead of buckets).

histogram	summary
bucket sizes can be picked	quantile must be defined ahead of time
less taxing on client libraries	more taxing on client libraries
any quantile can be selected	only quantiles predefined in client can be used
Prometheus server must calculate quantiles	very minimal server-side cost

Q:How many total unique time series are there in this output?

node_arp_entries{instance="node1" job="node"} 200
node_arp_entries{instance="node2" job="node"} 150
node_cpu_seconds_total{cpu="0", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="0", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}
node_memory_Active_bytes{instance="node1" job="node"} 419124
node_memory_Active_bytes{instance="node2" job="node"} 55589

A: 9

Q: What metric should be used to report the current memory utilization? A: gauge

Q: What metric should be used to report the amount of time a process has been running? A: counter

Q: Which of these is not a valid metric? A: 404_error_count

Q: How many labels does the following time series have? http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234 A: 5

Q: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this? A: histogram

Q: What are the two labels every metric is assigned by default? A: instance, job

Q: What are the 4 types of prometheus metrics? A: counter, gauge, histogram, summary

Q: What are the two attributes provided by a metric? A: Help, Type

Q: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782 ; What is the metric name? A: http_request_total

Q: For the http_request_total metric, what is the query/metric name that would be used to get the count of total requests on node node01:3000? A: http_request_total_count{instance="node01:3000"}

Q: Construct a query to return the total number of requests for the /events route with a latency of less than 0.4s across all nodes. A: http_request_total_bucket{route="/events",le="0.4"}

Q: Construct a query to find out how many requests took somewhere between 0.08s and 0.1s on node node02:3000. A:

Q: Construct a query to calculate the rate of http requests that took less than 0.08s. Use a time window of 1m across all nodes. A:

Q: Construct a query to calculate the average latency of a request over the past 4 minutes. Use the formula below to calculate average latency of request: rate of sum-of-all-requests / rate of count-of-all-requests A:

Q: Management would like to know what is the 95th percentile for the latency of requests going to node node01:3000. Construct a query to calculate the 95th percentile. A:

Q: The company is now offering customers an SLO stating that, 95% of all requests will be under 0.15s. What bucket size will need to be added to guarantee that the histogram_quantile function can accurately report whether or not that SLO has been met? A: 0.15

Q: A summary metric http_upload_bytes has been added to track the amount of bytes uploaded per request. What are percentiles being reported by this metric?

(A) 0.02, 0.05, 0.08, 0.1, 0.13, 0.18, 0.21, 0.24, 0.3, 0.35, 0.4 (B) 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99 (C) events, tickets (D) 200, 201, 400, 404 A:

Q: A:

Expression browser

Web UI for Prometheus server to query data
up - returns which targets are in up state (you can see an instance and job and value on the right - 0 and 1)

Prometheus on Docker

Pull image prom/prometheus
Configure prometheus.yml
Expose ports, bind mounts
docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus

PromTools

check & validate configuration before applying (e.g to production)
prevent downtime while config issues are being identified
validate metrics passed to it are correctly formatted
can perform queris on a Prom server
debugging & profiling a Prom server
perform unit tests agains Recording/Alerting rules
promtool check config /etc/prometheus/prometheus.yml

Container metrics

metrics can be scraped from containerized envs
docker engine metrics (how much CPU does Docker use etc. no metrics specific to a container!):
- vi /etc/docker/daemon.json:
```
{
  "metrics-addr": "127.0.0.1:9323",
  "experimental": true
}
```
- sudo systemctl restart docker
- curl localhost:9323/metrics
- prometheus job update:
```
scrape_configs:
  - job_namce: "docker"
    static_configs:
      - targets: ["12.1.13.4:9323"]
```
cAdvisor (how much memory does each container use? container uptime? etc.):
- vi docker-compose.yml to pull gcr.io/cadvisor/cadvisor
- docker-compose up or docker compose up
- curl localhost:8080/metrics

PromQL

short for Protmetheus Query Language
data returned can be visualized in dashboards
used to build alerting rules to notify about thersholds

Data Types

String (currently unused)
Scalar - numeric floating point value (e.g. 54.743)
Instant vector - set of time series containing a single sample for each time series sharing the same timestamp (e.g. node_cpu_seconds_total finds all unique labels and value for each and they all will going to be all at single point in time)
Range vector - set of time series containing a range of data points over time for each time series (e.g. node_cpu_seconds_total[3m] finds all unique labels, but all values and timestamps from the past 3 minutes)

Selectors

if we only want to return a subset of times series for a metric = label matchers:
- exact match = (e.g. node_filesystem_avail_bytes{instance="node1"})
- negative equality != (e.g. node_filesystem_avail_bytes{device!="tmpfs"})
- regular expression =~ (e.g. starts with /dev/sda - node_filesystem_avail_bytes{device=~"/dev/sda.*"})
- negative regular expression !~ (e.g. mountpoint does not start with /boot - node_filesystem_avail_bytes{mountpoint!~"/boot.*"})
we can combine multiple selectors with comma (e.g. node_filesystem_avail_bytes{instance="node1",device!="tmpfs"})

Modifiers

to get historic data, use an offset modifier after the label matching (e.g. get value 5 minutes ago - node_memory_active_bytes{instance="node1"} offset 5m)
to get to the exact point in time (e.g. get value on September 15 - node_memory_active_bytes{instance="node1"} @1663265188)
you can use both modifiers and order does not matter (e.g. @1663265188 offset 5m = offset 5m @1663265188)
you can also add range vectors (e.g. get 2 minutes worth of data 10 minutes before September 15 [2m] @1663265188 offset 5m)

Operators

between instant vectors and scalars
types:

Arithemtic +, -, *, /, %, ^ (e.g. node_memory_Active_bytes / 1024 - but it drops the metric name in the output as it is no longer the original metric!)
Comparison ==, !==, >, <, >=, <=, bool (e.g. node_network_flags > 100, node_network_receive_packets_total >= 220, node_filesystem_avail_bytes < bool 1000 returns 0 or 1 mostly for generating alerts)
Logical OR, AND, UNLESS (e.g. node_filesystem_avail_bytes > 1000 and node_filesystem_avail_bytes < 3000). Unless oeprator results in a cevtor consisting of elements ofn the left side for which there are no elements on the right side (e.g. return all vectors greater than 1000 unless they are greater than 30000 node_filesystem_avail_bytes > 1000 unless node_filesystem_avail_bytes > 30000)
more than one operator follows the order of precedence from highest to lowest, while operators on the same precedence level are performed from the left (e.g. 2 * 3 % 2 = (2 * 3) % 2), however power is performed from the left (e.g. 2 ^ 3 ^ 2 = 2 ^ (3 ^ 2)): high ^ ^ | *, /, %, atan2 | +, - | ==, !=, <=, <, >=, > | and, unless low | or

Q: Construct a query to return all filesystems that have over 1000 bytes available on all instances under web job. A: node_filesystem_avail_bytes{job="web"} > 1000

Q: Which of the following queries you will use for loadbalancer:9100 host to return all the interfaces that have received less than or equal to 10000 bytes of traffic? A: node_network_receive_bytes_total{instance="loadbalancer:9100"} <= 10000

Q: node_filesystem_files tracks the filesystem's total file nodes. Construct a query that only returns time series greater than 500000 and less than 10000000 across all jobs A: node_filesystem_files > 500000 and node_filesystem_files < 10000000

Q: The metric node_filesystem_avail_bytes lists the available bytes for all filesystems, and the metric node_filesystem_size_bytes lists the total size of all filesystems. Run each metric and see their outputs. There are three properties/labels these will return, device, fstype, and mountpoint. Which of the following queries will show the percentage of free disk space for all filesystems on all the targets under web job whose device label does not match tmpfs? A: node_filesystem_avail_bytes{job="web", device!="tmpfs"}*100 / node_filesystem_size_bytes{job="web", device!="tmpfs"}

Vector matching

between 2 instant vectors (e.g. to get the percentage of free space node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 )
samples with exactly the same lables get matched together (e.g. instance and job and mountpoint must be the same to get match) - every element in the vector on the left tries to find a single matching element on the right
to perform operation on 2 vectors with differing labels like http_errors code="500", code="501", code="404", moethod="put" etc. use the ignoring keyword (e.g. http_errors{code="500"} / ignoring(code) http_requests)
if the entries with e.g. methods put and del have no match in both metrics http_errors and http_requests, they will not show up in the results!
to get results on all labels to match on, we use the on keyword (e.g. http_errors{code="500"} / on(method) http_requests)

vector1	+ vector2	= resulting vector
{cpu=0,mode=idle}	{cpu=1,mode=steal}	{cpu=0}
{cpu=1,mode=iowait}	{cpu=2,mode=user}	{cpu=1}
{cpu=2,mode=user}	{cpu=0,mode=idle}	{cpu=2}

Resulting vector will have matching elements with all labels listed in on or all labels not/ignored: e.g. vector1{}+on(cpu) vector2{} or vector1{}+ignore(mode) vector2{}
Another example is: http_errors_total / ignoring(error) http_requests_total = http_errors_total / on(instance, job, path) http_requests_total

Q: Which of the following queries can be used to track the total number of seconds cpu has spent in user + system mode for instance loadbalancer:9100? A: node_cpu_seconds_total{instance="loadbalancer:9100", mode="user"} + ignoring(mode) node_cpu_seconds_total{instance="loadbalancer:9100", mode="system"}

Q: Construct a query that will find out what percentage of time each cpu on each instance was spent in mode user. To calculate the percentage in mode user, get the total seconds spent in mode user and divide that by the sum of the time spent across all modes. Further, multiply that result by 100 to get a percentage. A: node_cpu_seconds_total{mode="user"}*100 /ignoring(mode, job) sum by(instance, cpu) (node_cpu_seconds_total)

Many-to-one vector matching

error executing the query multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)
is where each vector elements on the one side can match with multiple elements on the many side (e.g. http_errors + on(path) group_left http_requests) - group_left tells PromQL that elements from the right side are now matched with multiple elements from the left (group_right is the opposite of that - depending on which side is the many and which side is one)

many	+ one	= resulting vector
{error=400,path=/cats} 2		{error=400,path=/cats} 4
{error=500,path=/cats} 5	{path=/cats} 2	{error=500,path=/cats} 7
{error=400,path=/dogs 1	{path=/dogs} 7	{error=400,path=/dogs} 8
{error=500,path=/dogs 7		{error=500,path=/dogs} 14

Q: The api job collects metrics on an API used for uploading files. The API has 3 endpoints /images /videos, and /songs, which are used to upload respective file types. The API provides 2 metrics to track: http_uploaded_bytes_total - tracks the number of uploaded bytes and http_upload_failed_bytes_total - tracks the number of bytes failed to upload. Construct a query to calculate the percentage of bytes that failed for each endpoint. The formula for the same is http_upload_failed_bytes_total*100 / http_uploaded_bytes_total. A: http_upload_failed_bytes_total*100 / ignoring(error) group_left http_uploaded_bytes_total

Aggregation operators

allow you to take an instan vector and aggregate its elements resulting in a new instant vector with fewer elements
sum, min, max, avg, group, stddev, stdvar, count, count_values, bottomk, topk, quantile
for example sum(http_requests), max(http_requests)
by keyword allows you to choose which labels to aggregate along (e.g. sum by(path) (http_requests), sum by(method) (http_requests), sum by(instance) (http_requests), sum by(instance, method) (http_requests))
without keyword does the opposite of by and tells the query which labels not to include in aggregation (e.g. sum without(cpu, mode) (node_cpu_seconds_total))

Q: On loadbalancer:9100 instance, calculate the sum of the size of all filesystems. The metric to get filesystem size is node_filesystem_size_bytes A: sum(node_filesystem_size_bytes{instance="loadbalancer:9100"})

Q: Construct a query to find how many CPUs instance loadbalancer:9100 have. You can use the node_cpu_seconds_total metric to find out the same. A: count(sum by (cpu) (node_cpu_seconds_total{instance="loadbalancer:9100"}))

Q: Construct a query that will show the number of CPUs on each instance across all jobs. A: ?

Q: Use the node_network_receive_bytes_total metric to calculate the sum of the total received bytes across all interfaces on per instance basis A: sum by(instance)(node_network_receive_bytes_total)

Q: Which of the following queries will be used to calculate the average packet size for each instance? A: sum by(instance)(node_network_receive_bytes_total) / sum by(instance)(node_network_receive_packets_total)

Functions

sorting, math, label transformations, metric manipulation
use the round function to round the query's result to the nearest integer value
truncate/round up to the closest integer: ceil(node_cpu_seconds_total)
round down: floor(node_cpu_seconds_total)
absolute value for negative numbers: abs(1-node_cpu_seconds_total)
date & time: time(), minute() etc.
vector function takes a scalar value and converts it into an instanst vector: vector(4)
scalar function returns the value of the single element as a scalar (otherwise returns NaN if the input vector does not have exactly one element): scalar(process_start_time_seconds)
sorting: sort (ascending) and sort_desc (descending)
rate at which a counter metric increases: rate and irate (e.g. group together data points by 60 seconds, get last value minus first value in each of these 60s groups and divide it by 60: rate(http_errors[1m]); e.g. similar to rate, but you get the last value and the second to last data points: irate(http_errors[1m]))

rate	irate
looks at the firs and last data points within a range	looks at the last two data points within a range
effectively an average rate over the range	instant rate
best for slow moving counters and alerting rules	should be user for graphing volatile fast-moving counters

Note:

make sure there is at least 4 sample withing the time range (e.g. 15s scrape interval 60s window gives 4 samples)
when combining rate with an aggregation operator, always take rate() first, then aggregate (so it can detect counter resets)
to get the rate of increase of the sum of latency actoss all requests: rate(requests_latency_seconds_sum[1m])
to calculate the average latency of a request over the past 5m: rate(requests_latency_seconds_sum[5m]) / rate(requests_latency_seconds_count[5m])

Q: Management wants to keep track of the rate of bytes received by each instance. Each instance has two interfaces, so the rate of traffic being received on them must be summed up. Calculate the rate of received node_network_receive_bytes_total using 2-minute window, sum the rates across all interfaces, and group the results by instance. Save the query in /root/traffic.txt. A: sum by(instance) (rate(node_network_receive_bytes_total[2m]))

Subquery

Syntax: <instant_query> [<range>:<resolution>] [offset <duration>]
Example: rate(http_requests_total[1m]) [5m:30s] - where sample range is 1m, query range is data from the last 5m and query step for subquery is 30s (gap between)
maximum value over a 10min of a gauge metric (max_over_time(node_filesystem_avail_bytes[10m]))
for counter metrics, we need to find the max value of the rate over the past 5min (e.g. maximum rate of request from the last 5 minutes with a 30s query interval and a sample range of 1m: max_over_time(rate(http_requests_total[1m]) [5m:30s])

Q: There were reports of a small outage of an application in the past few minutes, and some alerts pointed to potential high iowait on the CPUs. We need to calculate when the iowait rate was the highest over the past 10 minutes. [Construct a subquery that will calculate the rate at which all cpus spent in iowait mode using a 1 minute time window for the rate function. Find the max value of this result over the past 10 minutes using a 30s query step for the subquery.] A: ``

Q: Construct a query to calculate the average over time (avg_over_time) rate of http_requests_total over the past 20m using 1m query step. A: ``

Recording rules

allow Prometheus to periodically evaluate PromQL expression and store the resulting times series generated by them
speeding up your dashboards
provide aggregated results for use elsewhere

recording rules go in a seperate file called a rule file:

global: ...
rule_files:
  - rules.yml # globs can be used here, like /etc/prometheus/rule_files.d/*.yml
scrape_configs: ...

Prometheus server must be restarted after this change

syntax of the rules.yml file:

groups: # groups running in parallel
  - name: <group name 1>
    interval: <evaluation interval, global by default>
    rules: # however, rules evaluated sequentially
      - record: <rule name 1>
        expr: <promql expression 1>
        labels:
          <label name>: <label value>
      - record: <rule name 2> # you can also reference previous rule(s)
        expr: <promql expression 1>
        labels:
  - name: <group name 2>
  ...

example of the rules.yml file:

groups:
  - name: example1 # it will show up in the WebGui under "status" - "rules"
    interval: 15s
    rules:
      - record: node_memory_memFree_percent
        expr:  100 - (100 * node_memory_MemFree_bytes / node_memory_memTotal_bytes)
      - record: node_filesystem_free_percent
        expr: 100 * node_filesystem_free_bytes / node_filesystem_size_bytes

best practices for rule naming: aggregation_level:metric_name:operations, e.g. we have a http_errors counter with two instrumentation labels "method" and "path". All the rules for a specific job should be contained in a single group. It will look like:
```
- record: job_method_path:http_errors:rate5m
  expr: sum without(instance) (rate(http_errors{job="api"}[5m]))
```

HTTP API

execute queries, gather information on alert, rules, service discovery related configs
send the POST request to http://<prometheus_ip>/api/v1/query
example: curl http://<prometheus_ip>:9090/api/v1/query --data 'query=node_arp_entries{instance="192.168.1.168:9100"}'
query at a specific time, just add another --data 'time=169386192'
response back as JSON

Dasboarding & Visualization

several different ways:
- expression browser with graph tab (built-in)
- console templates (built-in)
- 3rd party like Grafana
expression browser has limited functionality, only for ad-hoc queries and quick debugging, cannot create custm dashboards, not good for day-to-day monitoring, but can have multiple panels and compare graphs

Console Templates

allow to create custom HTML pages using Go templating language (typically {{ and }})
Prometheus metrics, queries and charts can be embedded in the templates
ls /etc/prometheus/consoles to see the *.html and example (to see it, go to https://localhost:9090/consoles/index.html.example)

boilerplate will typically contain:

{{ template "head" . }}
{{ template "prom_content_head" . }}
<h1>Memory details</h1>
active memory: {{ template "prom_query_drilldown" (args "node_memory_Active_bytes") }}
{{ template "prom_content_tail" . }}
{{ template "tail" . }}

an example of inserting a chart:

{{ template "head" . }}
{{ template "prom_content_head" . }}
<h1>Memory details</h1>
active memory: {{ template "prom_query_drilldown" (args "node_memory_Active_bytes") }}
<div id="graph"></div>
<script>
new PromConsole.Graph({
node: document.querySelector("#graph"),
expr: "rate(node_memory_Active_bytes[2m])"
})
</script>
{{ template "prom_content_tail" . }}
{{ template "tail" . }}

another example:

{{ template "head" . }}

{{ template "prom_content_head" . }}

<h1>Node Stats</h1>

<h3>Memory</h3>

<strong>Memory utilization:</strong> {{ template "prom_query_drilldown" (args "100- (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100)") }}

<br/>

<strong>Memory Size:</strong> {{ template "prom_query_drilldown" (args "node_memory_MemTotal_bytes/1000000" "Mb") }}

<h3>CPU</h3>

<strong>CPU Count:</strong> {{ template "prom_query_drilldown" (args "count(node_cpu_seconds_total{mode='idle'})") }}

<br/>

<strong>CPU Utilization:</strong> {{ template "prom_query_drilldown" (args "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/56") }}
<!-->
Expression explanation: The expression will take the current rate of all cpu modes except idle because idle means cpu isn’t being used. It will then sum them up and multiply them by 100 to give a percentage. This final number is divided by 8 (if this server/node has 8 CPUs, we want to get the utilisation per CPU, so adjust this value as needed).
</!-->
<div id="cpu"></div>
<script>
new PromConsole.Graph({
node: document.querySelector("#cpu"),
expr: "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/2",
})
</script>
<h3>Network</h3>
<div id="network"></div>
<script>
new PromConsole.Graph({
node: document.querySelector("#network"),
expr: "rate(node_network_receive_bytes_total[2m])",
})
</script>

{{ template "prom_content_tail" . }}

{{ template "tail" . }}

Application Instrumentation

the Prometheus client libraries provide an easy way to add instrumentation to your code in order to track and expose metrics for Prometheus
they do 2 things:
- Track metrics in the Prometheus expected format
- Expose metrics via /metrics path so they can be scraped
official and unofficial libraries
Example for Python:
- You have an existing API in Flask, run pip install prometheus_client
- In your code, import it: from prometheus_client import Counter
- Initialize counter object: REQUESTS = Counter('http_requests_total', 'Total number of requests')
- When do we want to increment this? Within all of the @app.get("/path") like this: REQUESTS.inc()
- We can also get total requests per path using different counter objects, but that is not recommended. Instead we can use labels:
  - REQUESTS = Counter('http_requests_total', 'Total number of requests', labelnames=['path'])
  - REQUESTS.labels('/cars').inc()
- Then you can do the same approach for different HTTP method: labelnames=['path', 'method'] and REQUESTS.labels('/cars', 'post').inc()
- How to expose to /metrics endpoint though?
```
from prometheus_client import Counter, start_http_server
if __name__ == '__main__':
  start_http_server(8000) # start the metrics server on port
  app.run(port='5001')    # this is the Flask app
```
- curl 127.0.0.1:8000 will show the metrics
- However, you can also expose the metrics from Flask route and have Flash app on http://localhost:5001 and Prometheus on http://localhost:5001/metrics like e.g. app.wsgi_app = DispatcherMiddleware(app.wsgi_app, { '/metrics': make_wsgi_app() })

complete working example:

from flask import Flask
from prometheus_client import Counter, start_http_server, Gauge

REQUESTS = Counter('http_requests_total', 'Total number of requests', labelnames=['path', 'method'])

ERRORS = Counter('http_errors_total',
                'Total number of errors', labelnames=['code'])

IN_PROGRESS = Gauge('inprogress_requests',
                    'Total number of requests in progress')

def before_request():
    IN_PROGRESS.inc()

def after_request(response):
    IN_PROGRESS.dec()
    return response

app = Flask(__name__)

@app.get("/products")
def get_products():
    REQUESTS.labels('products', 'get').inc()
    return "product"

@app.post("/products")
def create_product():
    REQUESTS.labels('products', 'get').inc()
    return "created product", 201

@app.get("/cart")
def get_cart():
    REQUESTS.labels('products', 'get').inc()
    return "cart"

@app.post("/cart")
def create_cart():
    REQUESTS.labels('products', 'get').inc()
    return "created cart", 201

@app.errorhandler(404)
def page_not_found(e):
    ERRORS.labels('404').inc()
    return "page not found", 404

if __name__ == '__main__':
    start_http_server(8000)
    app.run(debug=False, host="0.0.0.0", port='6000')

Implementing histogram & summary in your code (example)

# add histogram metric to track latency/response time for each request
LATENCY = Histogram('request_latency_seconds', 'Request Latency', labelnames=['path', 'method'])
# get before_request time via `request.start_time = time.time()`
# calculate after_request as `request_latency = time.time() minus request.start_time` and pass it to:
LATENCY.labels(request.method, request.path).observe(request_latency)

client libraries can let you specify bucket sizes (e.g. buckets=[0.01, 0.02, 0.1])
to configure summary, it is the exact same, just use LATENCY = Summary('......)

Implementing gauge metric in your code (example)

# track the number of active requests getting processed at the moment
IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
# before_request will then increment IN_PROGRESS.inc()
# but after_request when it's done, then decrement IN_PROGRESS.dec()

Best practices

use snake_case naming, all lowercase, e.g. library_name_unit_suffix
first word should be app/library name it is used for
next add what is it used for
add unit (_bytes) at the end, use unprefixed base units (not microseconds or kilobytes)
avoid _count, _sum, _bucket suffixes
examples: process_cpu_seconds, http_requests_total, redis_connection_errors, node_disk_read_bytes_total
not good: container_docker_restarts, http_requests_sum, nginx_disk_free_kilobytes, dotnet_queue_waiting_time
three types of services/apps:
- online - immediate response is expected (tracking queries, errors, latency etc)
- offline - no one is actively waiting for response (amount of queue, wip, processing rate, errors etc)
- batch - similar to offline but regular, needs push gw (time processing, overall runtime, last completion time)

Service Discovery

allows Prometheus to dynamically update/populate/remove a list of endpoints to scrape
several built-ins: file, ec2, azure, gce, consul, nomad, k8s...
in the Web ui: "status" - "service discovery"

File SD

list of jobs/targets can be imported from a json/yaml file(s)

example:

scrape_configs:
  - job_name: file-example
    file_sd_configs:
      - files:
        - file-sd.json
        - '*.json'

then the file-sd.json would look like e.g.:

[
  {
    "targets": [ "node1:9100", "node2:9100" ],
    "labels": {
      "team": "dev",
      "job": "node"
    }
  }
]

AWS

just need to configure EC2 discovery in the config:

scrape_configs:
  - job_name: ec2
    ec2_sd_configs: # IAM with at least AmazonEC2ReadOnly policy
      - region: <region>
        access_key: <access key>
        secret_key: <secret key>

automatically extracts metadata for each EC2 instance
defaults to using private IPs

Re-labeling

classify Prometheus targets & metrics by rewriting their label set
e.g. rename instance from node1:9100 to just node1, drop metrics, drop labels etc
2 options:
- relabel_configs in Prometheus.yml which occurs before scrape and only has access to labels added by SD mechanism
- metric_relabel_configs in Prometheus.yml which occurs after the scrape

relabel_configs

example #1: __meta_ec2_tag_env = dev | prod

- job_name: aws
  relabel_configs:
    - source_labels: [__meta_ec2_tag_env] # array of labels to match on
      regex: prod                         # to match on specific value of that label
      action: keep|drop|replace           # keep=continue to scrape BUT in that case if regex is not match it will NOT be scraped (there is implicit invisible catchall at the end!), drop=no longer scrape this target

example #2: when there are more than 1 source labels (array) they will be joined by a ;:

relabel_configs:
- source_labels: [env, team]  # if the target has {env=dev} and {team=marketing}, we will keep it
  regex: dev;marketing
  action: keep                # everything else will be dropped
  # separator: "-"            # optional, change the delimiter between labels use the separate property

target labels = labels that are added to the labels of every time series returned from a scrape, relabeling will drop all auto-discovered labels (starting with __). In other words: Target labels are assigned to every metric from that specific target. Discovered labels are labels that start with a __ will be dropped after the initial relabeling process and will not get assigned as target labels.

example #3 of saving __address__=192.168.1.1:80 label in target label, but need to transform into {ip=192.168.1.1}:

relabel_configs:
  - source_labels: [__address__]
    regex: (.*):.*    # assign everything before the `:` into a group referenced with `$1` below
    target_label: ip  # name of the new label
    action: replace
    replacement: $1

example #4 of combining labels env="dev" & team="web" will turn into info="web-dev"

relabel_configs:
  - source_labels: [team, env]
    regex: (.*);(.*)  # parenthesis allow you to use the values as $ below
    action: replace
    target_label: info
    replacement: $1-$2

example #5 Re-label so the label team name changes to the organization and the value gets prepended with org- text:

relabel_configs:
- source_labels: [team]
  regex: (.*)
  action: replace
  target_label: organization
  replacement: org-$1

to drop the label, use action: labeldrop based on the regex:
```
- regex: size
  action: labeldrop
```
the opposite of labeldrop is labelkeep - but keep in mind ALL other labels will be dropped!
```
- regex: instance|job
  action: labelkeep
```

to modify the label name (not the value), use labelmap like this:

- regex: __meta_ec2_(.*)  # match any of these ec2 discovered labels - e.g. __meta_ec2_ami="ami-abcdefgh123456"
  action: labelmap
  replacement: ec2_$1     # we will prepend it with `ec2` - e.g. ec2_ami="ami-abcdefgh123456"

metric_relabel_configs

takes place after the perform the scrape and has access to scraped metrics (not just the labels)
configuration is identical to relabel_configs

example #1:

- job_name: example
  metric_relabel_configs: # this will drop a metric http_errors_total
    - source_labels: [__name__]
      regex: http_errors_total
      action: drop        # or keep, which will drop every other metrics

example #2:

- job_name: example
  metric_relabel_configs: # rename a metric name from http_errors_total to http_failures_total
    - source_labels: [__name__]
      regex: http_errors_total
      action: replace
      target_label: __name__            # whats the new name of the label key
      replacement: http_failures_total  # replacement is the new name of the value / the name of the metric

example #3:

- job_name: example
  metric_relabel_configs: # drop a label named code
    - regex: code
      action: labeldrop   # drop a label for a metric

example #4:

- job_name: example
  metric_relabel_configs: # strips of the forward slash and rename {path=/cars} -> {endpoint=cars}. Keep in mind there will now be path as well as endpoint. Use drop to get rid of label path showing the same information.
    - source_labels: [path]
      regex: \/(.*)       # any text after the forward slash (wrapping it in paranthesis gives you access with $)
      action: replace
      target_label: endpoint
      replacement: $1     # match the original value

Push Gateway

when process is already exited before the scrape occured
middle man between batch job and Prometheus server
Prometheus will scrape metrics from the PG
installation: pushgateway-1.4.3.linux-amd64.tar.gz from the releases page, untar, run ./pushgateway
create a new user sudo useradd --no-create-home --shell /bin/false pushgateway
copy the binary to /usr/local/bin, change owner to pushgateway, configure service file (same as the Prometheus)
systemct daemon-reload, restart, enable
curl localhost:9091/metrics
configure prometheus to scrape gateway. Same as other targets, but needs honor_labels: true (allows the metrics to specify custom labels like job1, job2 etc)
for sending the metrics, you send via HTTP POST request: http://<pushgateway_addr>:<port>/metrics/job/<job_name>/<label1>/<value1>/<label2>/<value2>... where job_name will be the job label of the metrics pushed, labels/values used as a grouping key, allows for grouping metrics together to update/delete multiple metrics at once. When sending a POST request, only metrics with the same name as the newly pushed, are replaced (this only applies to metrics in same group).
1. see the original metrics:
```
processing_time_seconds{quality="hd"} 120
processed_videos_total{quality="hd"} 10
processed_bytes_total{quality="hd"} 4400
```
1. POST the processing_time_seconds{quality="hd"} 999
2. result:
```
processing_time_seconds{quality="hd"} 999
processed_videos_total{quality="hd"} 10
processed_bytes_total{quality="hd"} 4400
```
example: push metric example_metric 4421 with a job label of {job="db_backup"}: echo "example_metric 4421 | curl --data-binary @-http://localhost:9091/metrics/job/db_backup (@- tells curl to read the binary data from stdin)

another example with multiple metrics at once:

cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/video_processing/instance/mp4_node1
processing_time_seconds{quality="hd"} 120
processed_videos_total{quality="hd"} 10
processed_bytes_total{quality="hd"} 4400
EOF

When using HTTP PUT request however, the bahivor is different. All metrics within a specific group get replaced by the new metrics being pushed (deletes preexisting):
1. start with:
```
processing_time_seconds{quality="hd"} 999
processed_videos_total{quality="hd"} 10
processed_bytes_total{quality="hd"} 4400
```
1. PUT the processing_time_seconds{quality="hd"} 666
2. result:
```
processing_time_seconds{quality="hd"} 666
```
Delete HTTP request will delete all metrics within a group (not going to touch any metrics in the other groups): curl -X DELETE http://localhost:9091/metrics/job/archive/app/web will only delete all with {app="web"}

Client library

Python: from prometheus_client import CollectorRegistry, pushadd_to_gateway, then initialize registry = CollectorRegistry(). You can then push via pushadd_to_gateway('user2:9091', job='batch', registry=registry)
3 functions within a library to push metrics:
- push - same as HTTP PUT (any existing metrics for this job are removed and the pushed metrics added)
- pushadd - same as HTTP POST (overrides existing metrics with the same names, but all other metrics in group remain uncahnged)
- delete - same as HTTP DELETE (all metrics for a group are removed)

Alerting

let's you define condition that if met trigger alerts
these are standard PromQL expressions (e.g. node_filesystem_avail_bytes < 1000 = 547)
Prometheus is only responsible for triggering alerts
responsibility of sending notification is offloaded onto alertmanager -> Slack, email, SMS etc.
alerts are visible in the web gui under "alerts" and they are green if not alerting

alerting rules are similar to recording rules, in fact they are in the same location (rule_files in prometheus.yaml):

groups:
  - name: node
    interval: 15s
    rules:
      - record: ...
        expr: ...
      - alert: LowMemory
        expr: node_memory_memFree_percent < 20

The for clause tells Prometheus that an expression must evaluate true for specific period of time:

- alert: node down
  expr: up{job="node"} == 0
  for: 5m   # expects the node to be down for 5 minutes before firing an alert

3 alert states:
1. inactive - has not returned any results green
2. pending - it hasn't been long enough to be considered firing (related to for) orange
3. firing - active for more than the defined for clause red

Labels & Annotations

optional labels can be added to alerts to provide a mechanism to classify and match alerts
important because they can be used when you set up rules in alert manager so you can match on these and group them together

- alert: node down
  expr: ...
  labels:
    severity: warning
- alert: multiple nodes down
  expr: ...
  labels:
    severity: critical

annotations (use Go templating) can be used to provide additional/descriptive information (unlike labels they do not play a part in the alerts identity)

- alert: node_filesystem_free_percent
  expr: ...
  annotations:
    description: "Filesystem {{.Labels.device}} on {{.Labels.instance}} is low on space, current available space is {{.Value}}"

This is how the templating works:

{{.Labels}} to access alert labels
{{.Labels.instance}} to get instance label
{{.Value}} to get the firin sample value

Alertmanager

responsible for receiving alerts generated by Prometheus and converting them to notifications
supports multiple Prometheus servers via API
workflow:
1. dispatcher picks up the alerts first,
2. inhibition allows suppress certain alerts if other alerts exist,
3. silencing mutes alerts (e.g. maintenance)
4. routing is responsible what alert gets to send where
5. notification integrates with all 3rd party tools (email, Slack, SMS, etc.)
installation tarball (alertmanager-0.24.0.linux-amd64.tar.gz) contains alertmanager binary, alertmanager.yml config file, amtool command line utility and data folder where the notification states are stored. The installation is the same as previous tools (add new user, create /etc/alertmanager, create /var/lib/alertmanager, copy executables to /usr/local/bin, change ownerships, create service file, daemon-reload, start, enable). ExecStart in systemd expects --config.file and --storage.path
starting is simple ./alertmanager and listens on 9093 (you can see the interface on https://localhost:9093)
restarting AM can be done via HTTP POST to /-/reload endpoint, systemctl restart alertmanager or killall -HUP alertmanager

configure Prometheus to use that alertmanager:

global: ...
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 127.0.0.1:9093
            - alertmanager2:9093

there are 3 main sections of alertmanager.yml:

global - applies across all sections which can be overwritten (e.g. smtp_smarthost)

route - set of rules to determine what alerts get matched up (match_re, matchers) with what receiver

at the top level, there is a default route - any alerts that don't match any of the other routes will use this default example route:

route:
  routes:
    - match_re:               # regular expresion
        job: (node|windows)
      receiver: infra-email
    - matchers:               # all alerts with job=kubernetes & severity=ticket labels will match this rule
        job: kubernetes
        severity: ticket
      receiver: k8s-slack     # they will be send to this receiver

nested routes are supported:

routes:
- matchers:                    # parent route
    job: kubernetes           # 2. all other alerts with this label will match this main route (k8s-email)
  receiver: k8s-email
  routes:                     # sub-route for further route matching (AND)
    - matchers:
        severity: pager       # 1. if the alert has also label severity=pager, then it will be send to k8s-pager
      receiver: k8s-pager

if you need an alert to match two routes, use continue:

route:
  routes:
    - receiver: alert-logs    # all alerts to be sent to alert-logs
      continue: true
    - matchers:
        job: kubernetes       # and then if it also has this label, it will be sent to k8s-email
      receiver: k8s-email

grouping allows to split up your notification by labels (otherwise all alerts results in one big notification):

receiver: fallback-pager
group_by: [team]
routes:
  - matchers:
      team: infra
    group_by: [region,env]    # infra team has alerts grouped based on region and env labels
    receiver: infra-email
    # any child routes underneath here will inherit the grouping policy and group based on same 2 labels region, env

receivers - one or more notifiers to forward alerts to users (e.g. slack_configs)
- make use of global configurations so all of the receivers don't have to manually define the same key:
```
global:
  victorops_api_key: XXX      # this will be automatically provided to all receivers
receivers:
  - name: infra-pager
    victorops_configs:
      - routing_key: some-route-here
```
- you can customize the message by using Go templating:
  - GroupLabels (e.g. title: in slack_configs: {{.GroupLabels.severity}} alerts in region {{.GroupLabels.region}})
  - CommonLabels
  - CommonAnnotations
  - ExternalURL
  - Status
  - Receiver
  - Alerts (e.g. text: in slack_configs: {{.Alerts | len}} alerts:)
    - Labels
    - Annotations ({{range .Alerts}}{{.Annotations.description}}{{"\n"}}{{end}})
    - Status
    - StartsAt
    - EndsAt

Example alertmanager.yml config:

global:
  smtp_smarthost: 'localhost:25'
  smtp_from: '[email protected]'
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 2m
  repeat_interval: 1h
  receiver: 'general-email'
  routes:
    - matchers:
            - team=global-infra
      receiver: global-infra-email
    - matchers:
            - team=internal-infra-email
      receiver: internal-infra-email
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
  - name: global-infra-email
    email_configs:
            - to: [email protected]
              require_tls: false
  - name: internal-infra-email
    email_configs:
            - to: [email protected]
              require_tls: false
  - name: general-email
    email_configs:
            - to: [email protected]
              require_tls: false

Silences

alerts can be silence to prevent generating notifications for a period of time (like maintenance windows)
in the "new silence" button - specify start, end/duration, matchers (list of labels), creator, comment
you can then view those in the silence tab

Monitoring Kubernetes

Conclusion

You will have 1.5 hours to complete the exam.
The certification is valid for 3 years.
This exam is online, proctored with multiple-choice questions.
One retake is available for this exam.
Important links:
- Prometheus Certified Associate(PCA) registration link: https://training.linuxfoundation.org/certification/prometheus-certified-associate/
- Exam curriculum: https://github.com/cncf/curriculum/blob/master/PCA_Curriculum.pdf
- Certification FAQs: https://docs.linuxfoundation.org/tc-docs/certification/frequently-asked-questions-pca
- Candidate Handbook: https://docs.linuxfoundation.org/tc-docs/certification/lf-handbook2
- To ensure your system meets the exam requirements, visit this link: https://syscheck.bridge.psiexams.com/
- Important exams instructions to check before scheduling the exam: https://docs.linuxfoundation.org/tc-docs/certification/important-instructions-pca

Mock Exam 1 & 2

Last update: Fri Jan 20 05:13:26 UTC 2023

vdt/Mock Exam 1.md