Skip to content

Instantly share code, notes, and snippets.

@vdt
Forked from luckylittle/Mock Exam 1.md
Created February 27, 2024 13:47
Show Gist options
  • Save vdt/fddcaeaa05028365fa825403b0ae70e6 to your computer and use it in GitHub Desktop.
Save vdt/fddcaeaa05028365fa825403b0ae70e6 to your computer and use it in GitHub Desktop.

Revisions

  1. @luckylittle luckylittle revised this gist Jan 25, 2023. 1 changed file with 485 additions and 429 deletions.
    914 changes: 485 additions & 429 deletions Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -39,8 +39,8 @@ Prometheus Certified Associate (PCA)

    ### Observability
    - the ability to understand and measure the state of a system based on data generated by the system
    - allows to generate actionable outputs from **unexpected** scenarions
    - to better understad the internals of your system
    - allows to generate actionable outputs from **unexpected** scenarios
    - to better understand the internals of your system
    - greater need for observability in distributed systems & microservices
    - troubleshooting - e.g. why are error rates high?
    - 3 pillars of observability:
    @@ -74,7 +74,7 @@ c. **SLA (service level agreement)** = contract between a vendor and a user that
    * collect metrics **from different locations** (e.g. like West DC, central DC, East DC, AWS etc.)
    * high memory on the hosting MySQL db and **notify operations team** via email
    * find out which uploaded video length the application **starts to degrade**
    - allows to generate alerts when **treshold** reached
    - allows to generate alerts when **threshold** reached
    - collects data by **scraping targets** who expose metrics through HTTP endpoint
    - stored in time series db and can be queried with built-in **PromQL** (Prometheus Query Language)
    - what can it monitor:
    @@ -106,7 +106,7 @@ c. **SLA (service level agreement)** = contract between a vendor and a user that
    * MySQL
    * Apache
    * HAProxy
    * client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
    * client libraries to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
    - Pull based approach is better, because:
    * easier to tell if the target is down
    * does not DDoS the metrics server
    @@ -147,7 +147,7 @@ c. **SLA (service level agreement)** = contract between a vendor and a user that

    ### Prometheus configuration
    - Sections:
    1. `global` - Default parameters, it can be overriden by the same variables in sub-sections
    1. `global` - Default parameters, it can be overridden by the same variables in sub-sections
    2. `scrape_configs` - Define targets and `job_name`, which is a collection of instances that need to be scraped
    3. `alerting` - Alerting specifies settings related to the Alertmanager
    4. `rule_files` - Rule files specifies a list of globs, rules and alerts are read from all matching files
    @@ -169,65 +169,70 @@ c. **SLA (service level agreement)** = contract between a vendor and a user that

    ### Encryption & Authentication
    - between the Prometheus and the targets
    - on the targets, you need to generate the key & crt pair first - e.g.:
    `sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"`
    - config will have to be customized after that:

    #### Encryption
    1. **On the targets**, you need to generate the key & crt pair first - e.g.:
    - `sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"`
    - then target config will have to be customized after that:
    ```yaml
    # /etc/node_exporter/config.yml
    tls_server_config:
    # Certificate and key files for server to use to authenticate to client
    cert_file: node_exporter.crt
    key_file: node_exporter.key
    ```
    - The exporter supports TLS via a new web configuration file: `./node_exporter --web.config=config.yml`
    - Test with: `curl -k https://localhost:9100/metrics`
    2. **On the server**, you need:
    - copy the `node_exporter.crt` from the target to the Prometheus server
    - update the `scheme` to `https` in the `prometheus.yml` and add `tls_config` with `ca_file` (e.g. `/etc/prometheus/node_exporter.crt` that we copied in the previous step) and `insecure_skip_verify` if self-signed:
    ```yaml
    # /etc/prometheus/prometheus.yaml
    scrape_configs:
    - job_name: "node"
    scheme: https
    tls_config:
    # Certificate and key files for client cert authentication to the server
    ca_file: /etc/prometheus/node_exporter.crt
    insecure_skip_verify: true
    ```
    - restart prometheus service

    #### Authentication
    - Authentication is done via generated hash (`sudo apt install apache2-utils` or `httpd-tools` etc.) and then: `htpasswd -nBC 12 "" | tr -d ':\n'` (will prompt for password and spits out the hash)
    - add the `basic_auth_users` and username + generated hash underneath it:
    ```yaml
    tls_server_config:
    cert_file: node_exporter.crt
    key_file: node_exporter.key
    # /etc/node_exporter/config.yml
    basic_auth_users:
    prometheus: $2y$12$daXru320983rnofkwehj4039F
    ```
    - `./node_exporter --web.config=config.yml`
    - `curl -k https://localhost:9100/metrics`

    On the server:
    - copy the `node_exporter.crt` from the target to the Prometheus server
    - update the `scheme` to `https` in the `prometheus.yml` and add `tls_config` with `ca_file` (e.g. /etc/prometheus/node_exporter.crt that we copied in the previous step) and `insecure_skip_verify` if self-signed
    - restart prometheus service

    ```yaml
    scrape_configs:
    - restart `node_exporter` service
    - update Prometheus server's config with the same auth and restart Prometheus:
    ```yaml
    - job_name: "node"
    scheme: https
    tls_config:
    ca_file: /etc/prometheus/node_exporter.crt
    insecure_skip_verify: true
    ```

    Authentication is done via generated hash (`sudo apt install apache2-utils` or httpd-tools etc) and `htpasswd -nBC 12 "" | tr -d ':\n'` (will prompt for password and spits out the hash):
    - add the `basic_auth_users` and username + hash underneath it:

    ```yaml
    # /etc/node_exporter/config.yml
    basic_auth_users:
    prometheus: $2y$12$daXru320983rnofkwehj4039F
    ```

    - restart node_exporter service
    - update Prometheus server's config with the same auth:

    ```yaml
    - job_name: "node"
    basic_auth:
    username: prometheus
    password: <PLAIN TEXT PASSWORD!>
    ```
    basic_auth:
    username: prometheus
    password: <PLAIN TEXT PASSWORD!>
    ```

    9. Metrics
    ### Metrics
    - 3 properties:
    * name - general feature of a system to be measured, may contain ASCII, numbers, underscores (`[a-zA-Z_:][a-zA-Z0-9_:]*`), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g. `__name__=node_cpu_seconds_total`)
    * {labels (key/value pairs)} - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores (`[a-zA-Z0-9_]*`). Labels surrounded by `__` are considered internal to Prometheus. Every matric is assigned 2 labels by default (`instance` and `job`).
    * metric value
    - Example = `node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86`: labels provude us information on which cpu this metric is for
    - when Prometheus scrapes a target and retrieves metrics, it als ostores the time at which the metric was scraped
    - Example = 1668215300 (unix epoch timestamp, since Jan 1st 1970 UTC)
    - time series = stream of timestamped values sharing the same metric and set of labels
    - metric have a `TYPE` (counter, gauge, histogram, summary) and `HELP` (description of the metric is) attribute:
    * counter can only go up, how many times did X happen
    * gauge can go up or down, what is the current value of X
    * histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s) - `request_latency_seconds_bucket{le="0.05"} 50`. Buckets are cumulative (i.e. all request in the le=0.03 bucket will include all requests less a 0.03s which includes all requests that fall into the buckets below it (e.g 0.02, 0.01) - e.g. to calculate the histogram's quantiles, we would use `histogram_quantile`, approximation of the value of a specific quantile: 75% of all requests have what latency? `histogram_quantile(0.75, request_latency_seconds_bucket)`. To get an accurate value, make sure there is a bucket at the specific value that needs to be met. Every time you add a bucket, it will slower the performance of the Prometheus!
    * summary is similar to histogram and tells us how many observation fell below X, do no thave to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s). Similarly to histogram, there will be `_count` and `_sum` metrics as well as quantiles like 0.7, 0.8, 0.9 (instead of buckets).
    * **name** - general feature of a system to be measured, may contain ASCII, numbers, underscores (`[a-zA-Z_:][a-zA-Z0-9_:]*`), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g. `__name__=node_cpu_seconds_total`)
    * **{labels (key/value pairs)}** - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores (`[a-zA-Z0-9_]*`). Labels surrounded by `__` are considered internal to Prometheus. Every metric is assigned 2 labels by default (`instance` and `job`).
    * **value** of the metric
    - Example = `node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86`: labels provide us information on which CPU this metric is for (cpu number zero)
    - when Prometheus scrapes a target and retrieves metrics, it also **stores the time at which the metric was scraped**
    - Example = `1668215300` (unix epoch timestamp, since Jan 1st 1970 UTC)
    - time series = stream of timestamped values **sharing the same metric and set of labels**
    - metric have a `TYPE` (counter, gauge, histogram, summary) and `HELP` (description of the metric is) attributes
    - explanation of each types:
    * **counter** can only go up, e.g. _how many times did X happened?_
    * **gauge** can go up or down, e.g. _what is the current value of X?_
    * **histogram** tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s)
    - e.g. `request_latency_seconds_bucket{le="0.05"} 50` - Buckets are cumulative (i.e. all request in the `le=0.05` bucket will include all requests less than `0.05` which includes all requests that fall into the buckets below it (e.g `0.03`, `0.02`, `0.01`...)
    - e.g. to calculate the histogram's quantiles, we would use `histogram_quantile`, approximation of the value of a specific quantile: _75% of all requests have what latency?_ `histogram_quantile(0.75, request_latency_seconds_bucket)`. To get an accurate value, **make sure there is a bucket at the specific value that needs to be met**. Every time you add a bucket, it will slow the performance of the Prometheus!
    * **summary** is similar to histogram and tells us _how many observation fell below X?_, do not have to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s). Similarly to histogram, there will be `_count` and `_sum` metrics as well as quantiles like `0.7`, `0.8`, `0.9` (instead of buckets).
    - table - difference:

    |histogram|summary|
    |---------|-------|
    @@ -237,7 +242,9 @@ basic_auth_users:
    |Prometheus server must calculate quantiles|very minimal server-side cost|


    Q:How many total unique time series are there in this output?
    >Quiz:

    _Q1: How many total unique time series are there in this output?_
    ```
    node_arp_entries{instance="node1" job="node"} 200
    node_arp_entries{instance="node2" job="node"} 150
    @@ -249,89 +256,103 @@ node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}
    node_memory_Active_bytes{instance="node1" job="node"} 419124
    node_memory_Active_bytes{instance="node2" job="node"} 55589
    ```
    A: 9

    Q: What metric should be used to report the current memory utilization?
    A: gauge
    A1: **9**

    _Q2: What metric should be used to report the current memory utilization?_

    A2: **gauge**

    _Q3: What metric should be used to report the amount of time a process has been running?_

    A3: **counter**

    _Q4: Which of these is NOT a valid metric?_

    A4: **404_error_count**

    _Q5: How many labels does the following time series have? `http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234`_

    A5: **5**

    _Q6: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this?_

    A6: **histogram**

    _Q7: What are the two labels every metric is assigned by default?_

    A7: **instance, job**

    _Q8: What are the 4 types of Prometheus metrics?_

    A8: **counter, gauge, histogram, summary**

    Q: What metric should be used to report the amount of time a process has been running?
    A: counter
    _Q9: What are the two attributes provided by a metric?_

    Q: Which of these is not a valid metric?
    A: 404_error_count
    A9: **Help, Type**

    Q: How many labels does the following time series have? http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234
    A: 5
    _Q10: For the metric `http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782`; What is the metric name?_

    Q: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this?
    A: histogram
    A10: **http_request_total**

    Q: What are the two labels every metric is assigned by default?
    A: instance, job
    _Q11: For the `http_request_total` metric, what is the query/metric name that would be used to get the count of total requests on node node01:3000?_

    Q: What are the 4 types of prometheus metrics?
    A: counter, gauge, histogram, summary
    A11: `http_request_total_count{instance="node01:3000"}`

    Q: What are the two attributes provided by a metric?
    A: Help, Type
    _Q12: Construct a query to return the total number of requests for the `/events` route with a latency of less than 0.4s across all nodes._

    Q: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782 ; What is the metric name?
    A: http_request_total
    A12: `http_request_total_bucket{route="/events",le="0.4"}`

    Q: For the http_request_total metric, what is the query/metric name that would be used to get the count of total requests on node node01:3000?
    A: `http_request_total_count{instance="node01:3000"}`
    _Q13: Construct a query to find out how many requests took somewhere between 0.08s and 0.1s on node node02:3000._

    Q: Construct a query to return the total number of requests for the /events route with a latency of less than 0.4s across all nodes.
    A: `http_request_total_bucket{route="/events",le="0.4"}`
    A13: **?**

    Q: Construct a query to find out how many requests took somewhere between 0.08s and 0.1s on node node02:3000.
    A:
    _Q14: Construct a query to calculate the rate of http requests that took less than 0.08s. Use a time window of 1m across all nodes._

    Q: Construct a query to calculate the rate of http requests that took less than 0.08s. Use a time window of 1m across all nodes.
    A:
    A14: **?**

    Q: Construct a query to calculate the average latency of a request over the past 4 minutes. Use the formula below to calculate average latency of request: rate of sum-of-all-requests / rate of count-of-all-requests
    A:
    _Q15: Construct a query to calculate the average latency of a request over the past 4 minutes. Use the formula below to calculate average latency of request: rate of sum-of-all-requests / rate of count-of-all-requests_

    Q: Management would like to know what is the 95th percentile for the latency of requests going to node node01:3000. Construct a query to calculate the 95th percentile.
    A:
    A15: **?**

    Q: The company is now offering customers an SLO stating that, 95% of all requests will be under 0.15s. What bucket size will need to be added to guarantee that the histogram_quantile function can accurately report whether or not that SLO has been met?
    A: 0.15
    _Q16: Management would like to know what is the 95th percentile for the latency of requests going to node node01:3000. Construct a query to calculate the 95th percentile._
    A16: **?**

    Q: A summary metric http_upload_bytes has been added to track the amount of bytes uploaded per request. What are percentiles being reported by this metric?
    _Q17: The company is now offering customers an SLO stating that, 95% of all requests will be under 0.15s. What bucket size will need to be added to guarantee that the histogram_quantile function can accurately report whether or not that SLO has been met?_

    (A) 0.02, 0.05, 0.08, 0.1, 0.13, 0.18, 0.21, 0.24, 0.3, 0.35, 0.4
    (B) 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99
    (C) events, tickets
    (D) 200, 201, 400, 404
    A:
    A17: **0.15**

    Q:
    A:
    _Q18: A summary metric `http_upload_bytes` has been added to track the amount of bytes uploaded per request. What are percentiles being reported by this metric?_
    1. 0.02, 0.05, 0.08, 0.1, 0.13, 0.18, 0.21, 0.24, 0.3, 0.35, 0.4
    2. 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99
    3. events, tickets
    4. 200, 201, 400, 404

    10. Expression browser
    A18: **?**

    ### Expression browser
    - Web UI for Prometheus server to query data
    - `up` - returns which targets are in up state (you can see an `instance` and `job` and value on the right - `0` and `1`)

    11. Prometheus on Docker
    ### Prometheus on Docker
    - Pull image `prom/prometheus`
    - Configure `prometheus.yml`
    - Expose ports, bind mounts
    - `docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus`
    - Run: `docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus`

    12. PromTools
    - check & validate configuration before applying (e.g to production)
    ### PromTools
    - check & validate configuration before applying (e.g before production)
    - prevent downtime while config issues are being identified
    - validate metrics passed to it are correctly formatted
    - can perform queris on a Prom server
    - can perform queries on a Prom server
    - debugging & profiling a Prom server
    - perform unit tests agains Recording/Alerting rules
    - `promtool check config /etc/prometheus/prometheus.yml`
    - perform unit tests against Recording/Alerting rules
    - To check/validate config, run: `promtool check config /etc/prometheus/prometheus.yml`

    13. Container metrics
    ### Container metrics
    - metrics can be scraped from containerized envs
    - docker engine metrics (how much CPU does Docker use etc. no metrics specific to a container!):

    #### Docker engine metrics (how much CPU does Docker use etc. **not** metrics specific to a container!)
    * `vi /etc/docker/daemon.json`:
    ```json
    {
    @@ -341,36 +362,36 @@ A:
    ```
    * `sudo systemctl restart docker`
    * `curl localhost:9323/metrics`
    * prometheus job update:
    * Prometheus job update:
    ```yaml
    scrape_configs:
    - job_namce: "docker"
    static_configs:
    - targets: ["12.1.13.4:9323"]
    ```
    - cAdvisor (how much memory does each container use? container uptime? etc.):
    #### cAdvisor (how much memory does each container use? container uptime? etc.)
    * `vi docker-compose.yml` to pull `gcr.io/cadvisor/cadvisor`
    * `docker-compose up` or `docker compose up`
    * `curl localhost:8080/metrics`

    ## PromQL
    - short for Protmetheus Query Language
    - short for Prometheus Query Language
    - data returned can be visualized in dashboards
    - used to build alerting rules to notify about thersholds
    - used to build alerting rules to notify about thresholds

    ### Data Types
    1. String (currently unused)
    2. Scalar - numeric floating point value (e.g. `54.743`)
    3. Instant vector - set of time series containing a single sample for each time series sharing the same timestamp (e.g. `node_cpu_seconds_total` finds all unique labels and value for each and they all will going to be all at single point in time)
    4. Range vector - set of time series containing a range of data points over time for each time series (e.g. `node_cpu_seconds_total[3m]` finds all unique labels, but all values and timestamps from the past 3 minutes)
    1. **String** (currently unused)
    2. **Scalar** - numeric floating point value (e.g. `54.743`)
    3. **Instant vector** - set of time series containing a single sample for each time series sharing the same timestamp (e.g. `node_cpu_seconds_total` finds all unique labels and value for each and they all will going to be at a single point in time)
    4. **Range vector** - set of time series containing a range of data points over time for each time series (e.g. `node_cpu_seconds_total[3m]` finds all unique labels, but all values and timestamps from the past 3 minutes)

    ### Selectors
    - if we only want to return a subset of times series for a metric = label matchers:
    * exact match `=` (e.g. `node_filesystem_avail_bytes{instance="node1"}`)
    * negative equality `!=` (e.g. `node_filesystem_avail_bytes{device!="tmpfs"}`)
    * regular expression `=~` (e.g. starts with /dev/sda - `node_filesystem_avail_bytes{device=~"/dev/sda.*"}`)
    * negative regular expression `!~` (e.g. mountpoint does not start with /boot - `node_filesystem_avail_bytes{mountpoint!~"/boot.*"}`)
    - we can combine multiple selectors with comma (e.g. `node_filesystem_avail_bytes{instance="node1",device!="tmpfs"}`)
    * **exact** match `=` (e.g. `node_filesystem_avail_bytes{instance="node1"}`)
    * **negative** equality `!=` (e.g. `node_filesystem_avail_bytes{device!="tmpfs"}`)
    * **regular** expression `=~` (e.g. starts with /dev/sda - `node_filesystem_avail_bytes{device=~"/dev/sda.*"}`)
    * **negative regular** expression `!~` (e.g. mountpoint does not start with /boot - `node_filesystem_avail_bytes{mountpoint!~"/boot.*"}`)
    - we can combine multiple selectors with comma `,`: (e.g. `node_filesystem_avail_bytes{instance="node1",device!="tmpfs"}`)

    ### Modifiers
    - to get historic data, use an `offset` modifier after the label matching (e.g. get value 5 minutes ago - `node_memory_active_bytes{instance="node1"} offset 5m`)
    @@ -381,35 +402,45 @@ A:
    ### Operators
    - between instant vectors and scalars
    - types:
    1. Arithemtic `+`, `-`, `*`, `/`, `%`, `^` (e.g. `node_memory_Active_bytes / 1024` - but it drops the metric name in the output as it is no longer the original metric!)
    2. Comparison `==`, `!==`, `>`, `<`, `>=`, `<=`, `bool` (e.g. `node_network_flags > 100`, `node_network_receive_packets_total >= 220`, `node_filesystem_avail_bytes < bool 1000` returns `0` or `1` mostly for generating alerts)
    3. Logical `OR`, `AND`, `UNLESS` (e.g. `node_filesystem_avail_bytes > 1000 and node_filesystem_avail_bytes < 3000`). Unless oeprator results in a cevtor consisting of elements ofn the left side for which there are no elements on the right side (e.g. return all vectors greater than 1000 unless they are greater than 30000 `node_filesystem_avail_bytes > 1000 unless node_filesystem_avail_bytes > 30000`)
    4. more than one operator follows the order of precedence from highest to lowest, while operators on the same precedence level are performed from the left (e.g. `2 * 3 % 2 = (2 * 3) % 2`), however power is performed from the left (e.g. `2 ^ 3 ^ 2 = 2 ^ (3 ^ 2)`):
    high ^ `^`
    | `*`, `/`, `%`, `atan2`
    | `+`, `-`
    | `==`, `!=`, `<=`, `<`, `>=`, `>`
    | `and`, `unless`
    low | `or`
    1. Arithmetic `+`, `-`, `*`, `/`, `%`, `^` (e.g. `node_memory_Active_bytes / 1024` - but it **drops the metric name** in the output as it is no longer the original metric!)
    2. Comparison `==`, `!==`, `>`, `<`, `>=`, `<=`, `bool` (e.g. `node_network_flags > 100`, `node_network_receive_packets_total >= 220`, `node_filesystem_avail_bytes < bool 1000` returns `0` or `1` mostly for generating alerts)
    3. Logical `OR`, `AND`, `UNLESS` (e.g. `node_filesystem_avail_bytes > 1000 and node_filesystem_avail_bytes < 3000`). Unless operator results in a vector consisting of elements on the left side for which there are no elements on the right side (e.g. return all vectors greater than 1000 unless they are greater than 30000 `node_filesystem_avail_bytes > 1000 unless node_filesystem_avail_bytes > 30000`)
    4. more than one operator follows the order of precedence from highest to lowest, while operators on the same precedence level are performed from the left (e.g. `2 * 3 % 2 = (2 * 3) % 2`), however power is performed from the left (e.g. `2 ^ 3 ^ 2 = 2 ^ (3 ^ 2)`):

    <pre>
    high ^ ^
    | *, /, %, atan2
    | +, -
    | ==, !=, <=, <, >=, >
    | and, unless
    low | or
    </pre>

    >Quiz

    _Q1: Construct a query to return all filesystems that have over 1000 bytes available on all instances under web job._

    A1: `node_filesystem_avail_bytes{job="web"} > 1000`

    _Q2: Which of the following queries you will use for loadbalancer:9100 host to return all the interfaces that have received less than or equal to 10000 bytes of traffic?_

    A2: `node_network_receive_bytes_total{instance="loadbalancer:9100"} <= 10000`

    Q: Construct a query to return all filesystems that have over 1000 bytes available on all instances under web job.
    A: `node_filesystem_avail_bytes{job="web"} > 1000`
    _Q3: node_filesystem_files tracks the filesystem's total file nodes. Construct a query that only returns time series greater than 500000 and less than 10000000 across all jobs_

    Q: Which of the following queries you will use for loadbalancer:9100 host to return all the interfaces that have received less than or equal to 10000 bytes of traffic?
    A: `node_network_receive_bytes_total{instance="loadbalancer:9100"} <= 10000`
    A3: `node_filesystem_files > 500000 and node_filesystem_files < 10000000`

    Q: node_filesystem_files tracks the filesystem's total file nodes. Construct a query that only returns time series greater than 500000 and less than 10000000 across all jobs
    A: `node_filesystem_files > 500000 and node_filesystem_files < 10000000`
    _Q4: The metric node_filesystem_avail_bytes lists the available bytes for all filesystems, and the metric node_filesystem_size_bytes lists the total size of all filesystems. Run each metric and see their outputs. There are three properties/labels these will return: device, fstype, and mountpoint. Which of the following queries will show the percentage of free disk space for all filesystems on all the targets under web job whose device label does not match tmpfs?_

    Q: The metric node_filesystem_avail_bytes lists the available bytes for all filesystems, and the metric node_filesystem_size_bytes lists the total size of all filesystems. Run each metric and see their outputs. There are three properties/labels these will return, device, fstype, and mountpoint. Which of the following queries will show the percentage of free disk space for all filesystems on all the targets under web job whose device label does not match tmpfs?
    A: `node_filesystem_avail_bytes{job="web", device!="tmpfs"}*100 / node_filesystem_size_bytes{job="web", device!="tmpfs"}`
    A4: `node_filesystem_avail_bytes{job="web", device!="tmpfs"}*100 / node_filesystem_size_bytes{job="web", device!="tmpfs"}`

    ### Vector matching
    - between 2 instant vectors (e.g. to get the percentage of free space `node_filesystem_avail_bytes / node_filesystem_size_bytes * 100` )
    - samples with exactly the same lables get matched together (e.g. instance and job and mountpoint must be the same to get match) - every element in the vector on the left tries to find a single matching element on the right
    - to perform operation on 2 vectors with differing labels like http_errors code="500", code="501", code="404", moethod="put" etc. use the `ignoring` keyword (e.g. `http_errors{code="500"} / ignoring(code) http_requests`)
    - if the entries with e.g. methods `put` and `del` have no match in both metrics `http_errors` and `http_requests`, they will not show up in the results!
    - samples with exactly the same labels get matched together (e.g. instance and job and mountpoint **must be the same** to get a match) - every element in the vector on the left tries to find a single matching element on the right
    - to perform operation on 2 vectors with differing labels like `http_errors` `code="500"`, `code="501"`, `code="404"`, `method="put"` etc. use the `ignoring` keyword (e.g. `http_errors{code="500"} / ignoring(code) http_requests`)
    - if the entries with e.g. methods `put` and `del` have no match in both metrics `http_errors` and `http_requests`, they will **not** show up in the results!
    - to get results on all labels to match on, we use the `on` keyword (e.g. `http_errors{code="500"} / on(method) http_requests`)
    - table - matching:

    |vector1|+ vector2|= resulting vector|
    |-------|-------|--------------------|
    @@ -420,15 +451,19 @@ A: `node_filesystem_avail_bytes{job="web", device!="tmpfs"}*100 / node_filesyste
    - Resulting vector will have matching elements with all labels listed in `on` or all labels `not/ignored`: e.g. `vector1{}+on(cpu) vector2{}` or `vector1{}+ignore(mode) vector2{}`
    - Another example is: `http_errors_total / ignoring(error) http_requests_total` = `http_errors_total / on(instance, job, path) http_requests_total`

    Q: Which of the following queries can be used to track the total number of seconds cpu has spent in `user` + `system` mode for instance `loadbalancer:9100`?
    A: `node_cpu_seconds_total{instance="loadbalancer:9100", mode="user"} + ignoring(mode) node_cpu_seconds_total{instance="loadbalancer:9100", mode="system"}`
    >Quiz

    Q: Construct a query that will find out what percentage of time each cpu on each instance was spent in mode user. To calculate the percentage in mode user, get the total seconds spent in mode user and divide that by the sum of the time spent across all modes. Further, multiply that result by 100 to get a percentage.
    A: `node_cpu_seconds_total{mode="user"}*100 /ignoring(mode, job) sum by(instance, cpu) (node_cpu_seconds_total)`
    _Q1: Which of the following queries can be used to track the total number of seconds cpu has spent in `user` + `system` mode for instance `loadbalancer:9100`?_

    A1: `node_cpu_seconds_total{instance="loadbalancer:9100", mode="user"} + ignoring(mode) node_cpu_seconds_total{instance="loadbalancer:9100", mode="system"}`

    _Q2: Construct a query that will find out what percentage of time each cpu on each instance was spent in mode user. To calculate the percentage in mode user, get the total seconds spent in mode user and divide that by the sum of the time spent across all modes. Further, multiply that result by 100 to get a percentage._

    A2: `node_cpu_seconds_total{mode="user"}*100 / ignoring(mode, job) sum by(instance, cpu) (node_cpu_seconds_total)`

    #### Many-to-one vector matching
    - error executing the query `multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)`
    - is where each vector elements on the one side can match with multiple elements on the many side (e.g. `http_errors + on(path) group_left http_requests`) - `group_left` tells PromQL that elements from the right side are now matched with multiple elements from the left (`group_right` is the opposite of that - depending on which side is the many and which side is one)
    - when you get error executing the query `multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)`
    - it is where each vector elements on the one side can match with multiple elements on the many side (e.g. `http_errors + on(path) group_left http_requests`) - `group_left` tells PromQL that elements **from the right side** are now matched with multiple elements from the left (`group_right` is the opposite of that - depending on which side is the many and which side is one)

    |many|+ one|= resulting vector|
    |-------|-------|--------------------|
    @@ -437,75 +472,93 @@ A: `node_cpu_seconds_total{mode="user"}*100 /ignoring(mode, job) sum by(instance
    |{error=400,path=/dogs 1|{path=/dogs} 7|{error=400,path=/dogs} 8|
    |{error=500,path=/dogs 7| |{error=500,path=/dogs} 14|

    Q: The api job collects metrics on an API used for uploading files. The API has 3 endpoints /images /videos, and /songs, which are used to upload respective file types. The API provides 2 metrics to track: http_uploaded_bytes_total - tracks the number of uploaded bytes and http_upload_failed_bytes_total - tracks the number of bytes failed to upload. Construct a query to calculate the percentage of bytes that failed for each endpoint. The formula for the same is http_upload_failed_bytes_total*100 / http_uploaded_bytes_total.
    A: `http_upload_failed_bytes_total*100 / ignoring(error) group_left http_uploaded_bytes_total`
    >Quiz

    _Q1: The api job collects metrics on an API used for uploading files. The API has 3 endpoints `/images`, `/videos` and `/songs`, which are used to upload respective file types. The API provides 2 metrics to track: `http_uploaded_bytes_total` - tracks the number of uploaded bytes and `http_upload_failed_bytes_total` - tracks the number of bytes failed to upload. Construct a query to calculate the percentage of bytes that failed for each endpoint. The formula for the same is http_upload_failed_bytes_total*100 / http_uploaded_bytes_total._

    A1: `http_upload_failed_bytes_total*100 / ignoring(error) group_left http_uploaded_bytes_total`

    ### Aggregation operators
    - allow you to take an instan vector and aggregate its elements resulting in a new instant vector with fewer elements
    - allow you to take an **instan vector** and aggregate its elements resulting in a new instant vector with **fewer elements**
    - `sum`, `min`, `max`, `avg`, `group`, `stddev`, `stdvar`, `count`, `count_values`, `bottomk`, `topk`, `quantile`
    - for example `sum(http_requests)`, `max(http_requests)`
    - `by` keyword allows you to choose which labels to aggregate along (e.g. `sum by(path) (http_requests)`, `sum by(method) (http_requests)`, `sum by(instance) (http_requests)`, `sum by(instance, method) (http_requests)`)
    - `by` keyword allows you to choose **which labels to aggregate along** (e.g. `sum by(path) (http_requests)`, `sum by(method) (http_requests)`, `sum by(instance) (http_requests)`, `sum by(instance, method) (http_requests)`)
    - `without` keyword does the opposite of `by` and tells the query which labels not to include in aggregation (e.g. `sum without(cpu, mode) (node_cpu_seconds_total)`)

    Q: On loadbalancer:9100 instance, calculate the sum of the size of all filesystems. The metric to get filesystem size is node_filesystem_size_bytes
    A: `sum(node_filesystem_size_bytes{instance="loadbalancer:9100"})`
    >Quiz

    _Q1: On loadbalancer:9100 instance, calculate the sum of the size of all filesystems. The metric to get filesystem size is `node_filesystem_size_bytes`_

    A1: `sum(node_filesystem_size_bytes{instance="loadbalancer:9100"})`

    _Q2: Construct a query to find how many CPUs instance loadbalancer:9100 have. You can use the node_cpu_seconds_total metric to find out the same._

    Q: Construct a query to find how many CPUs instance loadbalancer:9100 have. You can use the node_cpu_seconds_total metric to find out the same.
    A: `count(sum by (cpu) (node_cpu_seconds_total{instance="loadbalancer:9100"}))`
    A2: `count(sum by (cpu) (node_cpu_seconds_total{instance="loadbalancer:9100"}))`

    Q: Construct a query that will show the number of CPUs on each instance across all jobs.
    A: `?`
    _Q3: Construct a query that will show the number of CPUs on each instance across all jobs._

    Q: Use the node_network_receive_bytes_total metric to calculate the sum of the total received bytes across all interfaces on per instance basis
    A: `sum by(instance)(node_network_receive_bytes_total)`
    A3: `?`

    Q: Which of the following queries will be used to calculate the average packet size for each instance?
    A: `sum by(instance)(node_network_receive_bytes_total) / sum by(instance)(node_network_receive_packets_total)`
    _Q4: Use the node_network_receive_bytes_total metric to calculate the sum of the total received bytes across all interfaces on per instance basis_

    A4: `sum by(instance)(node_network_receive_bytes_total)`

    _Q5: Which of the following queries will be used to calculate the average packet size for each instance?_

    A5: `sum by(instance)(node_network_receive_bytes_total) / sum by(instance)(node_network_receive_packets_total)`

    ### Functions
    - sorting, math, label transformations, metric manipulation
    - use the `round` function to round the query's result to the nearest integer value
    - truncate/round up to the closest integer: `ceil(node_cpu_seconds_total)`
    - round down: `floor(node_cpu_seconds_total)`
    - absolute value for negative numbers: `abs(1-node_cpu_seconds_total)`
    - truncate/round **up** to the closest integer: `ceil(node_cpu_seconds_total)`
    - round **down**: `floor(node_cpu_seconds_total)`
    - **absolute** value for negative numbers: `abs(1-node_cpu_seconds_total)`
    - date & time: `time()`, `minute()` etc.
    - vector function takes a scalar value and converts it into an instanst vector: `vector(4)`
    - vector function takes a scalar value and converts it into an instant vector: `vector(4)`
    - scalar function returns the value of the single element as a scalar (otherwise returns `NaN` if the input vector does not have exactly one element): `scalar(process_start_time_seconds)`
    - sorting: `sort` (ascending) and `sort_desc` (descending)
    - rate at which a counter metric increases: `rate` and `irate` (e.g. group together data points by 60 seconds, get last value minus first value in each of these 60s groups and divide it by 60: `rate(http_errors[1m])`; e.g. similar to rate, but you get the last value and the second to last data points: `irate(http_errors[1m])`)
    - rate at which a counter metric increases: `rate` and `irate` (e.g. group together data points by 60 seconds, get last value minus first value in each of these 60s groups and divide it by 60: `rate(http_errors[1m])`; irate is similar to rate, but you get the last value and the second to last data points: `irate(http_errors[1m])`)
    - table - difference:

    |rate|irate|
    |----|-----|
    |looks at the firs and last data points within a range|looks at the last two data points within a range|
    |looks at the first and last data points within a range|looks at the last two data points within a range|
    |effectively an average rate over the range|instant rate|
    |best for slow moving counters and alerting rules|should be user for graphing volatile fast-moving counters|

    Note:
    - make sure there is at least 4 sample withing the time range (e.g. 15s scrape interval 60s window gives 4 samples)
    Notes:
    - make sure there is at least **4** samples within the time range (e.g. 15s scrape interval 60s window gives 4 samples)
    - when combining rate with an aggregation operator, always take `rate()` first, then aggregate (so it can detect counter resets)
    - to get the rate of increase of the sum of latency actoss all requests: `rate(requests_latency_seconds_sum[1m])`
    - to get the rate of increase of the sum of latency across all requests: `rate(requests_latency_seconds_sum[1m])`
    - to calculate the average latency of a request over the past 5m: `rate(requests_latency_seconds_sum[5m]) / rate(requests_latency_seconds_count[5m])`

    Q: Management wants to keep track of the rate of bytes received by each instance. Each instance has two interfaces, so the rate of traffic being received on them must be summed up. Calculate the rate of received node_network_receive_bytes_total using 2-minute window, sum the rates across all interfaces, and group the results by instance. Save the query in /root/traffic.txt.
    A: `sum by(instance) (rate(node_network_receive_bytes_total[2m]))`
    >Quiz

    _Q1: Management wants to keep track of the rate of bytes received by each instance. Each instance has two interfaces, so the rate of traffic being received on them must be summed up. Calculate the rate of received `node_network_receive_bytes_total` using 2-minute window, sum the rates across all interfaces, and group the results by instance._

    A1: `sum by(instance) (rate(node_network_receive_bytes_total[2m]))`

    ### Subquery
    - Syntax: `<instant_query> [<range>:<resolution>] [offset <duration>]`
    - Example: `rate(http_requests_total[1m]) [5m:30s]` - where sample range is 1m, query range is data from the last 5m and query step for subquery is 30s (gap between)
    - maximum value over a 10min of a gauge metric (`max_over_time(node_filesystem_avail_bytes[10m])`)
    - for counter metrics, we need to find the max value of the rate over the past 5min (e.g. maximum rate of request from the last 5 minutes with a 30s query interval and a sample range of 1m: `max_over_time(rate(http_requests_total[1m]) [5m:30s]`)

    Q: There were reports of a small outage of an application in the past few minutes, and some alerts pointed to potential high iowait on the CPUs. We need to calculate when the iowait rate was the highest over the past 10 minutes. [Construct a subquery that will calculate the rate at which all cpus spent in iowait mode using a 1 minute time window for the rate function. Find the max value of this result over the past 10 minutes using a 30s query step for the subquery.]
    A: ``
    >Quiz

    _Q1: There were reports of a small outage of an application in the past few minutes, and some alerts pointed to potential high iowait on the CPUs. We need to calculate when the iowait rate was the highest over the past 10 minutes. [Construct a subquery that will calculate the rate at which all cpus spent in iowait mode using a 1 minute time window for the rate function. Find the max value of this result over the past 10 minutes using a 30s query step for the subquery.]_

    A1: **?**

    _Q2: Construct a query to calculate the average over time (avg_over_time) rate of http_requests_total over the past 20m using 1m query step._

    Q: Construct a query to calculate the average over time (avg_over_time) rate of http_requests_total over the past 20m using 1m query step.
    A: ``
    A2: **?**

    ### Recording rules
    - allow Prometheus to periodically evaluate PromQL expression and store the resulting times series generated by them
    - speeding up your dashboards
    - provide aggregated results for use elsewhere
    - recording rules go in a seperate file called a rule file:
    - recording rules go in a separate file called a rule file:
    ```yaml
    global: ...
    rule_files:
    @@ -516,18 +569,18 @@ A: ``
    - syntax of the `rules.yml` file:
    ```yaml
    groups: # groups running in parallel
    - name: <group name 1>
    - name: <group_name_1>
    interval: <evaluation interval, global by default>
    rules: # however, rules evaluated sequentially
    - record: <rule name 1>
    expr: <promql expression 1>
    - record: <rule_name_1>
    expr: <promql_expression_1>
    labels:
    <label name>: <label value>
    - record: <rule name 2> # you can also reference previous rule(s)
    expr: <promql expression 1>
    <label_name>: <label_value>
    - record: <rule_name_2> # you can also reference previous rule(s)
    expr: <promql_expression_1>
    labels:
    - name: <group name 2>
    ...
    - name: <group_name_2>
    ...
    ```
    - example of the `rules.yml` file:
    ```yaml
    @@ -540,28 +593,28 @@ A: ``
    - record: node_filesystem_free_percent
    expr: 100 * node_filesystem_free_bytes / node_filesystem_size_bytes
    ```
    - best practices for rule naming: `aggregation_level:metric_name:operations`, e.g. we have a http_errors counter with two instrumentation labels "method" and "path". All the rules for a specific job should be contained in a single group. It will look like:
    - best practices for rule naming: `aggregation_level:metric_name:operations`, e.g. we have a `http_errors` counter with two instrumentation labels "method" and "path". All the rules for a specific job should be contained in a single group. It will look like:
    ```yaml
    - record: job_method_path:http_errors:rate5m
    expr: sum without(instance) (rate(http_errors{job="api"}[5m]))
    ```

    ### HTTP API
    - execute queries, gather information on alert, rules, service discovery related configs
    - execute queries, gather information on alerts, rules, service discovery related configs
    - send the POST request to `http://<prometheus_ip>/api/v1/query`
    - example: `curl http://<prometheus_ip>:9090/api/v1/query --data 'query=node_arp_entries{instance="192.168.1.168:9100"}'`
    - query at a specific time, just add another `--data 'time=169386192'`
    - response back as JSON

    ## Dasboarding & Visualization
    ## Dashboarding & Visualization
    - several different ways:
    * expression browser with graph tab (built-in)
    * console templates (built-in)
    * 3rd party like Grafana
    - expression browser has limited functionality, only for ad-hoc queries and quick debugging, cannot create custm dashboards, not good for day-to-day monitoring, but can have multiple panels and compare graphs
    - expression browser has limited functionality, only for ad-hoc queries and quick debugging, cannot create custom dashboards, not good for day-to-day monitoring, but can at least have multiple panels and compare graphs

    ### Console Templates
    - allow to create custom HTML pages using Go templating language (typically `{{` and `}}`)
    - allow to create custom HTML pages using Go templating language (typically between `{{` and `}}`)
    - Prometheus metrics, queries and charts can be embedded in the templates
    - `ls /etc/prometheus/consoles` to see the `*.html` and example (to see it, go to https://localhost:9090/consoles/index.html.example)
    - boilerplate will typically contain:
    @@ -589,52 +642,40 @@ A: ``
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}
    ```
    - another example:
    ```html
    {{ template "head" . }}
    {{ template "prom_content_head" . }}
    <h1>Node Stats</h1>
    <h3>Memory</h3>
    <strong>Memory utilization:</strong> {{ template "prom_query_drilldown" (args "100- (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100)") }}
    <br/>
    <strong>Memory Size:</strong> {{ template "prom_query_drilldown" (args "node_memory_MemTotal_bytes/1000000" "Mb") }}
    <h3>CPU</h3>
    <strong>CPU Count:</strong> {{ template "prom_query_drilldown" (args "count(node_cpu_seconds_total{mode='idle'})") }}
    <br/>
    <strong>CPU Utilization:</strong> {{ template "prom_query_drilldown" (args "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/56") }}
    <!-->
    Expression explanation: The expression will take the current rate of all cpu modes except idle because idle means cpu isn’t being used. It will then sum them up and multiply them by 100 to give a percentage. This final number is divided by 8 (if this server/node has 8 CPUs, we want to get the utilisation per CPU, so adjust this value as needed).
    </!-->
    <div id="cpu"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#cpu"),
    expr: "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/2",
    })
    </script>
    <h3>Network</h3>
    <div id="network"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#network"),
    expr: "rate(node_network_receive_bytes_total[2m])",
    })
    </script>
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}
    ```
    - another example with memory/cpu graphs:
    ```html
    {{ template "head" . }}
    {{ template "prom_content_head" . }}
    <h1>Node Stats</h1>
    <h3>Memory</h3>
    <strong>Memory utilization:</strong> {{ template "prom_query_drilldown" (args "100- (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100)") }}
    <br/>
    <strong>Memory Size:</strong> {{ template "prom_query_drilldown" (args "node_memory_MemTotal_bytes/1000000" "Mb") }}
    <h3>CPU</h3>
    <strong>CPU Count:</strong> {{ template "prom_query_drilldown" (args "count(node_cpu_seconds_total{mode='idle'})") }}
    <br/>
    <strong>CPU Utilization:</strong> {{ template "prom_query_drilldown" (args "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/56") }}
    <!-->
    Expression explanation: The expression will take the current rate of all cpu modes except idle because idle means cpu isn’t being used. It will then sum them up and multiply them by 100 to give a percentage. This final number is divided by 56 (if this server/node has 56 CPUs, we want to get the utilization per CPU, so adjust this value as needed).
    </!-->
    <div id="cpu"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#cpu"),
    expr: "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/2",
    })
    </script>
    <h3>Network</h3>
    <div id="network"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#network"),
    expr: "rate(node_network_receive_bytes_total[2m])",
    })
    </script>
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}
    ```

    ## Application Instrumentation
    - the Prometheus client libraries provide an easy way to add instrumentation to your code in order to track and expose metrics for Prometheus
    @@ -712,7 +753,7 @@ expr: "rate(node_network_receive_bytes_total[2m])",
    app.run(debug=False, host="0.0.0.0", port='6000')
    ```

    ### Implementing histogram & summary in your code (example)
    ### Implementing histogram & summary in your Python code (example)

    ```python
    # add histogram metric to track latency/response time for each request
    @@ -721,11 +762,10 @@ LATENCY = Histogram('request_latency_seconds', 'Request Latency', labelnames=['p
    # calculate after_request as `request_latency = time.time() minus request.start_time` and pass it to:
    LATENCY.labels(request.method, request.path).observe(request_latency)
    ```

    - client libraries can let you specify bucket sizes (e.g. `buckets=[0.01, 0.02, 0.1]`)
    - to configure summary, it is the exact same, just use `LATENCY = Summary('......)`

    ### Implementing gauge metric in your code (example)
    ### Implementing gauge metric in your Python code (example)

    ```python
    # track the number of active requests getting processed at the moment
    @@ -736,16 +776,16 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])

    ### Best practices
    - use snake_case naming, all lowercase, e.g. `library_name_unit_suffix`
    - first word should be app/library name it is used for
    - next add what is it used for
    - add unit (`_bytes`) at the end, use unprefixed base units (not microseconds or kilobytes)
    - first word should be **app/library name** it is used for
    - next add **what is it used for**
    - add unit (`_bytes`) at the end, use **unprefixed base units** (not microseconds or kilobytes)
    - avoid `_count`, `_sum`, `_bucket` suffixes
    - examples: `process_cpu_seconds`, `http_requests_total`, `redis_connection_errors`, `node_disk_read_bytes_total`
    - not good: `container_docker_restarts`, `http_requests_sum`, `nginx_disk_free_kilobytes`, `dotnet_queue_waiting_time`
    - good examples: `process_cpu_seconds`, `http_requests_total`, `redis_connection_errors`, `node_disk_read_bytes_total`
    - bad examples: `container_docker_restarts`, `http_requests_sum`, `nginx_disk_free_kilobytes`, `dotnet_queue_waiting_time`
    - three types of services/apps:
    * online - immediate response is expected (tracking queries, errors, latency etc)
    * offline - no one is actively waiting for response (amount of queue, wip, processing rate, errors etc)
    * batch - similar to offline but regular, needs push gw (time processing, overall runtime, last completion time)
    * **online** - immediate response is expected (tracking queries, errors, latency etc)
    * **offline** - no one is actively waiting for response (amount of queue, wip, processing rate, errors etc)
    * **batch** - similar to offline but regular, needs push gw (time processing, overall runtime, last completion time)


    ## Service Discovery
    @@ -755,7 +795,7 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])

    ### File SD
    - list of jobs/targets can be imported from a json/yaml file(s)
    - example:
    - example #1:
    ```yaml
    scrape_configs:
    - job_name: file-example
    @@ -788,16 +828,16 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    secret_key: <secret key>
    ```
    - automatically extracts metadata for each EC2 instance
    - defaults to using private IPs
    - defaults to using **private IPs**

    ### Re-labeling
    - classify Prometheus targets & metrics by rewriting their label set
    - e.g. rename instance from `node1:9100` to just `node1`, drop metrics, drop labels etc
    - 2 options:
    * `relabel_configs` in `Prometheus.yml` which occurs **before** scrape and only has access to labels added by SD mechanism
    * `metric_relabel_configs` in `Prometheus.yml` which occurs **after** the scrape
    * `relabel_configs` (in `Prometheus.yml`) which occurs **before** scrape and only has access to labels added by SD mechanism
    * `metric_relabel_configs` (in `Prometheus.yml`) which occurs **after** the scrape

    #### relabel_configs
    #### Examples - `relabel_configs`
    - example #1: `__meta_ec2_tag_env = dev | prod`
    ```yaml
    - job_name: aws
    @@ -806,17 +846,16 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    regex: prod # to match on specific value of that label
    action: keep|drop|replace # keep=continue to scrape BUT in that case if regex is not match it will NOT be scraped (there is implicit invisible catchall at the end!), drop=no longer scrape this target
    ```
    - example #2: when there are more than 1 source labels (array) they will be joined by a `;`:
    - example #2: when there are more than 1 source labels (array) they will be joined by a `;`
    ```yaml
    relabel_configs:
    - source_labels: [env, team] # if the target has {env=dev} and {team=marketing}, we will keep it
    regex: dev;marketing
    action: keep # everything else will be dropped
    # separator: "-" # optional, change the delimiter between labels use the separate property
    ```

    - target labels = labels that are added to the labels of every time series returned from a scrape, relabeling will drop all auto-discovered labels (starting with `__`). In other words: Target labels are assigned to every metric from that specific target. Discovered labels are labels that start with a `__` will be dropped after the initial relabeling process and will not get assigned as target labels.
    - example #3 of saving `__address__=192.168.1.1:80` label in target label, but need to transform into `{ip=192.168.1.1}`:
    - **target labels** = labels that are added to the labels of every time series returned from a scrape, relabeling will drop all auto-discovered labels (starting with `__`). In other words, target labels are assigned to every metric from that specific target. Discovered labels are labels that start with a `__` will be dropped after the initial relabeling process and will not get assigned as target labels.
    - example #3 of saving `__address__=192.168.1.1:80` label in target label, but need to transform into `{ip=192.168.1.1}`
    ```yaml
    relabel_configs:
    - source_labels: [__address__]
    @@ -834,7 +873,7 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    target_label: info
    replacement: $1-$2
    ```
    - example #5 Re-label so the label `team` name changes to the `organization` and the value gets prepended with `org-` text:
    - example #5 Re-label so the label `team` name changes to the `organization` and the value gets prepended with `org-`
    ```yaml
    relabel_configs:
    - source_labels: [team]
    @@ -848,7 +887,7 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    - regex: size
    action: labeldrop
    ```
    - the opposite of labeldrop is `labelkeep` - but keep in mind ALL other labels will be dropped!
    - the opposite of labeldrop is `labelkeep` - but keep in mind **ALL** other labels will be dropped!
    ```yaml
    - regex: instance|job
    action: labelkeep
    @@ -860,16 +899,16 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    replacement: ec2_$1 # we will prepend it with `ec2` - e.g. ec2_ami="ami-abcdefgh123456"
    ```
    #### metric_relabel_configs
    - takes place after the perform the scrape and has access to scraped metrics (not just the labels)
    #### Examples - `metric_relabel_configs`
    - takes place **after** the perform the scrape and has access to scraped metrics (not just the labels)
    - configuration is identical to `relabel_configs`
    - example #1:
    ```yaml
    - job_name: example
    metric_relabel_configs: # this will drop a metric http_errors_total
    - source_labels: [__name__]
    regex: http_errors_total
    action: drop # or keep, which will drop every other metrics
    action: drop # or keep, which will drop EVERY other metrics
    ```
    - example #2:
    ```yaml
    @@ -891,72 +930,78 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    - example #4:
    ```yaml
    - job_name: example
    metric_relabel_configs: # strips of the forward slash and rename {path=/cars} -> {endpoint=cars}. Keep in mind there will now be path as well as endpoint. Use drop to get rid of label path showing the same information.
    metric_relabel_configs: # strips of the forward slash and rename {path=/cars} -> {endpoint=cars}. Keep in mind there will now be a path as well as an endpoint. Use drop to get rid of the label path showing the same information.
    - source_labels: [path]
    regex: \/(.*) # any text after the forward slash (wrapping it in paranthesis gives you access with $)
    regex: \/(.*) # any text after the forward slash (wrapping it in parenthesis gives you access with $)
    action: replace
    target_label: endpoint
    replacement: $1 # match the original value
    ```

    ## Push Gateway
    - when process is already exited before the scrape occured
    - By default, Pushgateway listens to port **9091**
    - when process is already exited **before** the scrape occurred
    - middle man between batch job and Prometheus server
    - Prometheus will scrape metrics from the PG
    - installation: pushgateway-1.4.3.linux-amd64.tar.gz from the releases page, untar, run `./pushgateway`
    - create a new user `sudo useradd --no-create-home --shell /bin/false pushgateway`
    - copy the binary to /usr/local/bin, change owner to pushgateway, configure service file (same as the Prometheus)
    - systemct daemon-reload, restart, enable
    - `curl localhost:9091/metrics`
    - configure prometheus to scrape gateway. Same as other targets, but needs `honor_labels: true` (allows the metrics to specify custom labels like `job1`, `job2` etc)
    - for sending the metrics, you send via HTTP POST request: `http://<pushgateway_addr>:<port>/metrics/job/<job_name>/<label1>/<value1>/<label2>/<value2>...` where job_name will be the job label of the metrics pushed, labels/values used as a grouping key, allows for grouping metrics together to update/delete multiple metrics at once. When sending a POST request, only metrics with the same name as the newly pushed, are replaced (this only applies to metrics in same group).
    - installation:
    1. `pushgateway-1.4.3.linux-amd64.tar.gz` from the releases page, untar, run `./pushgateway`
    2. create a new user `sudo useradd --no-create-home --shell /bin/false pushgateway`
    3. copy the binary to `/usr/local/bin`, change owner to pushgateway, configure service file (same as the Prometheus)
    4. `systemctl daemon-reload`, restart, enable
    5. Test `curl localhost:9091/metrics`
    - configure Prometheus to scrape gateway. Same as other targets, but needs additional `honor_labels: true` (allows the metrics to specify custom labels like `job1`, `job2` etc)
    - for sending the metrics, you send via HTTP **POST** request: `http://<pushgateway_addr>:<port>/metrics/job/<job_name>/<label1>/<value1>/<label2>/<value2>...` where `job_name` will be the job label of the metrics pushed, labels/values paths used as a grouping key, allows for grouping metrics together to update/delete multiple metrics at once. When sending a POST request, only metrics with the same name as the newly pushed, are replaced (this only applies to metrics in the same group):
    1. see the original metrics:
    ```
    processing_time_seconds{quality="hd"} 120
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    ```
    2. POST the `processing_time_seconds{quality="hd"} 999`
    2. **POST** the `processing_time_seconds{quality="hd"} 999`
    3. result:
    ```
    processing_time_seconds{quality="hd"} 999
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    ```
    - example: push metric `example_metric 4421` with a job label of `{job="db_backup"}`: `echo "example_metric 4421 | curl --data-binary @-http://localhost:9091/metrics/job/db_backup` (`@-` tells curl to read the binary data from stdin)
    - another example with multiple metrics at once:
    - example: push metric `example_metric 4421` with a job label of `{job="db_backup"}`:
    ```bash
    # ('@-' tells curl to read the binary data from stdin)
    echo "example_metric 4421 | curl --data-binary @-http://localhost:9091/metrics/job/db_backup
    ```
    - another example with sending multiple metrics at once:
    ```bash
    cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/video_processing/instance/mp4_node1
    processing_time_seconds{quality="hd"} 120
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    EOF
    ```
    - When using HTTP PUT request however, the bahivor is different. All metrics within a specific group get replaced by the new metrics being pushed (deletes preexisting):
    - When using HTTP **PUT** request however, the behavior is different. All metrics within a specific group get replaced by the new metrics being pushed (deletes preexisting):
    1. start with:
    ```
    processing_time_seconds{quality="hd"} 999
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    ```
    2. PUT the `processing_time_seconds{quality="hd"} 666`
    2. **PUT** the `processing_time_seconds{quality="hd"} 666`
    3. result:
    ```
    processing_time_seconds{quality="hd"} 666
    ```
    - Delete HTTP request will delete all metrics within a group (not going to touch any metrics in the other groups): `curl -X DELETE http://localhost:9091/metrics/job/archive/app/web` will only delete all with `{app="web"}`
    - HTTP **DELETE** request will delete all metrics within a group (not going to touch any metrics in the other groups): `curl -X DELETE http://localhost:9091/metrics/job/archive/app/web` will only delete all with `{app="web"}`

    #### Client library
    - Python: `from prometheus_client import CollectorRegistry, pushadd_to_gateway`, then initialize `registry = CollectorRegistry()`. You can then push via `pushadd_to_gateway('user2:9091', job='batch', registry=registry)`
    - 3 functions within a library to push metrics:
    * `push` - same as HTTP PUT (any existing metrics for this job are removed and the pushed metrics added)
    * `pushadd` - same as HTTP POST (overrides existing metrics with the same names, but all other metrics in group remain uncahnged)
    * `pushadd` - same as HTTP POST (overrides existing metrics with the same names, but all other metrics in group remain unchanged)
    * `delete` - same as HTTP DELETE (all metrics for a group are removed)

    ## Alerting
    - let's you define condition that if met trigger alerts
    - these are standard PromQL expressions (e.g. `node_filesystem_avail_bytes < 1000` = 547)
    - Prometheus is only responsible for triggering alerts
    - Prometheus is only **responsible for triggering** alerts
    - responsibility of sending notification is offloaded onto **alertmanager** -> Slack, email, SMS etc.
    - alerts are visible in the web gui under "alerts" and they are green if not alerting
    - alerting rules are similar to recording rules, in fact they are in the same location (`rule_files` in `prometheus.yaml`):
    @@ -970,20 +1015,20 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    - alert: LowMemory
    expr: node_memory_memFree_percent < 20
    ```
    - The `for` clause tells Prometheus that an expression must evaluate true for specific period of time:
    - The `for` clause tells Prometheus that an expression must evaluate true **for specific period of time**:
    ```yaml
    - alert: node down
    expr: up{job="node"} == 0
    for: 5m # expects the node to be down for 5 minutes before firing an alert
    ```
    - 3 alert states:
    1. inactive - has not returned any results **green**
    2. pending - it hasn't been long enough to be considered firing (related to `for`) **orange**
    3. firing - active for more than the defined `for` clause **red**
    1. **inactive** - has not returned any results [**green**]
    2. **pending** - it hasn't been long enough to be considered firing (related to `for`) [**orange**]
    3. **firing** - active for more than the defined `for` clause [**red**]

    ### Labels & Annotations
    - optional labels can be added to alerts to provide a mechanism to classify and match alerts
    - important because they can be used when you set up rules in alert manager so you can match on these and group them together
    - optional labels can be added to alerts to provide a mechanism to **classify and match alerts**
    - important, because they can be used when you set up rules in the alert manager so you can match on these and group them together
    ```yaml
    - alert: node down
    expr: ...
    @@ -994,7 +1039,7 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    labels:
    severity: critical
    ```
    - annotations (use Go templating) can be used to provide additional/descriptive information (unlike labels they do not play a part in the alerts identity)
    - annotations (use Go templating) can be used to **provide additional/descriptive information** (unlike labels they do not play a part in the alerts identity)
    ```yaml
    - alert: node_filesystem_free_percent
    expr: ...
    @@ -1004,9 +1049,10 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    This is how the templating works:
    - `{{.Labels}}` to access alert labels
    - `{{.Labels.instance}}` to get instance label
    - `{{.Value}}` to get the firin sample value
    - `{{.Value}}` to get the firing sample value

    ### Alertmanager
    - By default, Alertmanager is running on port **9093**
    - responsible for receiving alerts generated by Prometheus and converting them to notifications
    - supports multiple Prometheus servers via API
    - workflow:
    @@ -1015,9 +1061,11 @@ This is how the templating works:
    3. silencing mutes alerts (e.g. maintenance)
    4. routing is responsible what alert gets to send where
    5. notification integrates with all 3rd party tools (email, Slack, SMS, etc.)
    - installation tarball (`alertmanager-0.24.0.linux-amd64.tar.gz`) contains `alertmanager` binary, `alertmanager.yml` config file, `amtool` command line utility and `data` folder where the notification states are stored. The installation is the same as previous tools (add new user, create /etc/alertmanager, create /var/lib/alertmanager, copy executables to /usr/local/bin, change ownerships, create service file, daemon-reload, start, enable). `ExecStart` in systemd expects `--config.file` and `--storage.path`
    - starting is simple `./alertmanager` and listens on 9093 (you can see the interface on https://localhost:9093)
    - restarting AM can be done via HTTP POST to `/-/reload` endpoint, `systemctl restart alertmanager` or `killall -HUP alertmanager`
    - installation:
    1. tarball (`alertmanager-0.24.0.linux-amd64.tar.gz`) contains `alertmanager` binary, `alertmanager.yml` config file, `amtool` command line utility and `data` folder where the notification states are stored
    2. The installation is the same as previous tools (add new user, create `/etc/alertmanager`, create `/var/lib/alertmanager`, copy executables to `/usr/local/bin`, change ownerships, create service file, daemon-reload, start, enable). `ExecStart` in systemd expects `--config.file` and `--storage.path`!
    3. starting is simple `./alertmanager` and listens on 9093 (you can see the interface on https://localhost:9093)
    4. restarting AM can be done via HTTP **POST** to `/-/reload` endpoint, `systemctl restart alertmanager` or `killall -HUP alertmanager`
    - configure Prometheus to use that alertmanager:
    ```yaml
    global: ...
    @@ -1029,28 +1077,27 @@ This is how the templating works:
    - alertmanager2:9093
    ```
    - there are 3 main sections of `alertmanager.yml`:
    * global - applies across all sections which can be overwritten (e.g. `smtp_smarthost`)
    * route - set of rules to determine what alerts get matched up (`match_re`, `matchers`) with what receiver
    - at the top level, there is a default route - any alerts that don't match any of the other routes will use this default
    example route:
    1. **global** - applies across all sections which can be overwritten (e.g. `smtp_smarthost`)
    2. **route** - set of rules to determine what alerts get matched up (`match_re`, `matchers`) with what receiver
    - at the top level, there is a default route - any alerts that **don't match any of the other** routes will use this default, example route:
    ```yaml
    route:
    routes:
    - match_re: # regular expresion
    - match_re: # regular expression
    job: (node|windows)
    receiver: infra-email
    - matchers: # all alerts with job=kubernetes & severity=ticket labels will match this rule
    job: kubernetes
    severity: ticket
    receiver: k8s-slack # they will be send to this receiver
    ```
    - nested routes are supported:
    - nested routes / subroutes are also supported:
    ```yaml
    routes:
    - matchers: # parent route
    - matchers: # parent route
    job: kubernetes # 2. all other alerts with this label will match this main route (k8s-email)
    receiver: k8s-email
    routes: # sub-route for further route matching (AND)
    routes: # sub-route for further route matching (logical AND)
    - matchers:
    severity: pager # 1. if the alert has also label severity=pager, then it will be send to k8s-pager
    receiver: k8s-pager
    @@ -1062,10 +1109,10 @@ This is how the templating works:
    - receiver: alert-logs # all alerts to be sent to alert-logs
    continue: true
    - matchers:
    job: kubernetes # and then if it also has this label, it will be sent to k8s-email
    job: kubernetes # AND then if it also has this label job=kubernetes, it will be also sent to k8s-email
    receiver: k8s-email
    ```
    - grouping allows to split up your notification by labels (otherwise all alerts results in one big notification):
    - grouping allows to split up your notification by labels (otherwise all alerts results in **one big** notification):
    ```yaml
    receiver: fallback-pager
    group_by: [team]
    @@ -1076,164 +1123,175 @@ This is how the templating works:
    receiver: infra-email
    # any child routes underneath here will inherit the grouping policy and group based on same 2 labels region, env
    ```
    * receivers - one or more notifiers to forward alerts to users (e.g. `slack_configs`)
    3. **receivers** - one or more notifiers to forward alerts to users (e.g. `slack_configs`)
    - make use of global configurations so all of the receivers don't have to manually define the same key:
    ```yaml
    global:
    victorops_api_key: XXX # this will be automatically provided to all receivers
    victorops_api_key: XXX # this will be automatically provided to all receivers below
    receivers:
    - name: infra-pager
    victorops_configs:
    - routing_key: some-route-here
    ```
    - you can customize the message by using Go templating:
    * GroupLabels (e.g. `title:` in `slack_configs`: `{{.GroupLabels.severity}} alerts in region {{.GroupLabels.region}}`)
    * CommonLabels
    * CommonAnnotations
    * ExternalURL
    * Status
    * Receiver
    * Alerts (e.g. `text:` in `slack_configs`: `{{.Alerts | len}} alerts:`)
    * Labels
    * Annotations (`{{range .Alerts}}{{.Annotations.description}}{{"\n"}}{{end}}`)
    * Status
    * StartsAt
    * EndsAt
    - Example alertmanager.yml config:
    ```yaml
    global:
    smtp_smarthost: 'localhost:25'
    smtp_from: '[email protected]'
    route:
    group_by: ['alertname']
    group_wait: 10s
    group_interval: 2m
    repeat_interval: 1h
    receiver: 'general-email'
    routes:
    - matchers:
    - team=global-infra
    receiver: global-infra-email
    - matchers:
    - team=internal-infra-email
    receiver: internal-infra-email
    receivers:
    - name: 'web.hook'
    webhook_configs:
    - url: 'http://127.0.0.1:5001/'
    - name: global-infra-email
    email_configs:
    - to: [email protected]
    require_tls: false
    - name: internal-infra-email
    email_configs:
    - to: [email protected]
    require_tls: false
    - name: general-email
    email_configs:
    - to: [email protected]
    require_tls: false
    ```
    * **GroupLabels** (e.g. `title:` in `slack_configs`: `{{.GroupLabels.severity}} alerts in region {{.GroupLabels.region}}`)
    * **CommonLabels**
    * **CommonAnnotations**
    * **ExternalURL**
    * **Status**
    * **Receiver**
    * **Alerts** (e.g. `text:` in `slack_configs`: `{{.Alerts | len}} alerts:`)
    * **Labels**
    * **Annotations** (`{{range .Alerts}}{{.Annotations.description}}{{"\n"}}{{end}}`)
    * **Status**
    * **StartsAt**
    * **EndsAt**
    - Example `alertmanager.yml` config:
    ```yaml
    global:
    smtp_smarthost: 'localhost:25'
    smtp_from: '[email protected]'
    route:
    group_by: ['alertname']
    group_wait: 10s
    group_interval: 2m
    repeat_interval: 1h
    receiver: 'general-email'
    routes:
    - matchers:
    - team=global-infra
    receiver: global-infra-email
    - matchers:
    - team=internal-infra-email
    receiver: internal-infra-email
    receivers:
    - name: 'web.hook'
    webhook_configs:
    - url: 'http://127.0.0.1:5001/'
    - name: global-infra-email
    email_configs:
    - to: [email protected]
    require_tls: false
    - name: internal-infra-email
    email_configs:
    - to: [email protected]
    require_tls: false
    - name: general-email
    email_configs:
    - to: [email protected]
    require_tls: false
    ```

    ### Silences
    - alerts can be silence to prevent generating notifications for a period of time (like maintenance windows)
    - in the "new silence" button - specify start, end/duration, matchers (list of labels), creator, comment
    - you can then view those in the silence tab
    - you can then view those in the "silence" tab

    ## Monitoring Kubernetes
    - applications & clusters (control plane components, kubelet/cAdvisor, kube-state-metrics, node-exporter)
    - for both applications & clusters (control plane components, kubelet/cAdvisor, kube-state-metrics, node-exporter)
    - deploy Prometheus as close to targets as possible
    - make use of preexisting Kube infrastructure
    - to get access to cluster level metrics, we need `kube-state-metrics`
    - every host should run node-exporter on every node (DaemonSet)
    - make use of service discovery via Kube API

    ### Installation via Helm chart
    - source: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
    - makes use of the Prometheus Operator (https://github.com/prometheus-operator/prometheus-operator)
    - couple of custom resources: `Prometheus`, `Prometheus Rule`, `Alermanager Config`, `ServiceMonitor`, `Pod Monitor`
    - `helm repo add prometheus-community https://prometheus-community.github.io/helm-charts`
    - `helm repo update`
    - `helm show values prometheus-community/kube-prometheus-stack > values.yaml`
    - `helm install prometheus-community/kube-prometheus-stack`
    - `kubectl patch ds prometheus-prometheus-node-exporter --type "json" -p '[{"op": "remove", "path" : "/spec/template/spec/containers/0/volumeMounts/2/mountPropagation"}]'` - might need this due to node-exporter bug
    - installs 2 StatefulSets (AM, Prometheus), 3 Deployments (Grafana, kube-prometheus-operator, kube-state-metrics), 1 DaemonSet (node-exporter)
    - SD can discover node, service, pod, endpoint (discovers targets from listed endpoints of a service. For each endpoint address one target is discovered per port. If the endpoint is backed up by a pod, all additional container ports of the pod, not bound to an endpoint port, are discovered as targets as well)
    1. source: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
    2. makes use of the Prometheus Operator (https://github.com/prometheus-operator/prometheus-operator)
    3. couple of custom resources (CRD): `Prometheus`, `Prometheus Rule`, `Alermanager Config`, `ServiceMonitor`, `PodMonitor`
    4. Add Helm repo: `helm repo add prometheus-community https://prometheus-community.github.io/helm-charts`
    5. Update Helm repo: `helm repo update`
    6. Export all possible values: `helm show values prometheus-community/kube-prometheus-stack > values.yaml`
    7. Install the chart: `helm install prometheus-community/kube-prometheus-stack`
    8. (*Optional* `kubectl patch ds prometheus-prometheus-node-exporter --type "json" -p '[{"op": "remove", "path" : "/spec/template/spec/containers/0/volumeMounts/2/mountPropagation"}]'` - might need this due to node-exporter bug)
    - What does it do?
    - installs 2 StatefulSets (AM, Prometheus), 3 Deployments (Grafana, kube-prometheus-operator, kube-state-metrics), 1 DaemonSet (node-exporter)
    - SD can discover node, service, pod, endpoint (discovers targets from listed endpoints of a service. For each endpoint address one target is discovered per port. If the endpoint is backed up by a pod, all additional container ports of the pod, not bound to an endpoint port, are discovered as targets as well)

    ### Monitor K8s Application
    - once you have application deployed and listening on some port (i.e. `3000`), you can change the Prometheus value `additionalScrapeConfigs` in the Helm chart and upgrade via `helm upgrade prometheus prometheus-community/kube-prometheus-stack -f new-values.yaml` (this is less ideal option, it is better to use service monitors to apply new scrapes more declaratively)
    - instead, look at CRDS: `kubectl get crd`, specifically `prometheuses`, `servicemonitors` (set of targets to monitor and scrape, they allow to avoid touching config directly and give you a declarative Kube synax to define targets)
    - once you have application deployed and listening on some port (i.e. `3000`), you can change the Prometheus value `additionalScrapeConfigs` in the Helm chart and upgrade via `helm upgrade prometheus prometheus-community/kube-prometheus-stack -f new-values.yaml` (this is less ideal option, it is better to use **service monitors** to apply new scrapes more declaratively)
    - instead, look at CRDs: `kubectl get crd`, specifically `prometheuses`, `servicemonitors` (set of targets to monitor and scrape, they allow to **avoid touching config directly** and give you a declarative Kube syntax to define targets)
    - if you want to scrape e.g. service named `api-service` exposing metrics on `/swagger-stats/metrics`, use:
    ```yaml
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    name: api-service-monitor
    labels:
    release: prometheus # default label that is used by serviceMonitorSelector - it dynamically discovers it
    app: prometheus
    spec:
    jobLabel: job # look for label job in the Service and take the value
    endpoints:
    - interval: 30s # equivalent of scrape_interval
    port: web # matches up with the port 3000 in the Service definition
    path: /swagger-stats/metrics # equivalent of metrics_path (path where the metrics are exposed)
    selector:
    matchLabels:
    app: service-api
    ```
    ```yaml
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    name: api-service-monitor
    labels:
    release: prometheus # default label that is used by serviceMonitorSelector - it dynamically discovers it
    app: prometheus
    spec:
    jobLabel: job # look for label job in the Service and take the value
    endpoints:
    - interval: 30s # equivalent of scrape_interval
    port: web # matches up with the port 3000 in the Service definition
    path: /swagger-stats/metrics # equivalent of metrics_path (path where the metrics are exposed)
    selector:
    matchLabels:
    app: service-api
    ```
    - but also look at `kind: Prometheuses` and what is under `serviceMonitorSelector` (e.g. `matchLabels`: `release: prometheus`) - this label allows Prometheus fo find service monitors in the cluster and register them so that it can start scraping the app the service monitor is pointing to (can confirm via Web UI - Status - Configuration)
    - to add rules, use CRD called `PrometheusRule` - e.g.:
    ```yaml
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    labels:
    release: prometheus # similar to ServiceMonitor, to add the rule dynamically
    name: api-rules
    spec:
    groups:
    - name: api
    rules:
    - alert: down
    expr: up == 0
    for: 0m
    labels:
    severity: critical
    annotations:
    summary: Prometheus target missing {{$labels.instance}}
    ```
    ```yaml
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    labels:
    release: prometheus # similar to ServiceMonitor, to add the rule dynamically
    name: api-rules
    spec:
    groups:
    - name: api
    rules:
    - alert: down
    expr: up == 0
    for: 0m
    labels:
    severity: critical
    annotations:
    summary: Prometheus target missing {{$labels.instance}}
    ```
    - to add AM rules, use CRD called `AlertmanagerConfig` - e.g.:
    ```yaml
    apiVersion: monitoring.coreos.com/v1alpha1
    kind: AlertmanagerConfig
    metadata:
    name: alert-config
    labels:
    resource: prometheus # once again, must match alertmanagerConfigSelector - BUT Helm chart does not specify a label, so you need to update this value yourself!
    spec:
    route:
    groupBy: ["severity"]
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: "webhook"
    receivers:
    - name: "webhook"
    webhookConfigs:
    - url: "http://example.com/"
    ```
    - keep in mind the differences betwe a standard AM and K8s one:
    ```yaml
    apiVersion: monitoring.coreos.com/v1alpha1
    kind: AlertmanagerConfig
    metadata:
    name: alert-config
    labels:
    resource: prometheus # once again, must match alertmanagerConfigSelector - BUT Helm chart does not specify a label, so you need to update this value yourself!
    spec:
    route:
    groupBy: ["severity"]
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: "webhook"
    receivers:
    - name: "webhook"
    webhookConfigs:
    - url: "http://example.com/"
    ```
    - table - keep in mind the differences between a standard AM and K8s one:

    |Standard|Kube|
    |--------|----|
    |Standard|Kubernetes|
    |--------|----------|
    |group_by|groupBy|
    |group_wait|groupWait|
    |group_interval|groupInterval|
    |repeat_interval|repeatInterval|
    |matchers job: kubernetes|matchers name: job, value: kubernetes|

    ## Conclusion

    Default ports:

    |Component|Port number|
    |---------|-----------|
    |prometheus|9090|
    |node-exporter|9100|
    |push gateway|9091|
    |alertmanager|9093|

    - KodeKloud slides: https://kodekloud.com/wp-content/uploads/2022/12/Prometheus_Certified_Associate-1.pdf
    - ​You will have 1.5 hours to complete the exam.​
    - The certification is valid for 3 years.​
    @@ -1247,9 +1305,7 @@ spec:
    * ​To ensure your system meets the exam requirements, visit this link: https://syscheck.bridge.psiexams.com/
    * Important exams instructions to check before scheduling the exam: https://docs.linuxfoundation.org/tc-docs/certification/important-instructions-pca



    ---
    _Author: [@luckylittle](https://github.com/luckylittle)_

    _Last update: Wed Jan 25 02:58:41 UTC 2023_
    _Last update: Wed Jan 25 05:22:25 UTC 2023_
  2. @luckylittle luckylittle revised this gist Jan 25, 2023. 3 changed files with 853 additions and 851 deletions.
    450 changes: 450 additions & 0 deletions Mock Exam 1.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,450 @@
    ## Mock Exam

    ### 1
    Q1. The metric `node_cpu_temp_celcius` reports the current temperature of a nodes CPU in celsius. What query will return the average temperature across all CPUs on a per node basis? The query should return {instance=“node1”} 23.5 //average temp across all CPUs on node1 {instance=“node2”} 33.5 //average temp across all CPUs on node2.
    ```
    node_cpu_temp_celsius{instance="node1", cpu="0"} 28
    node_cpu_temp_celsius{instance="node1", cpu="1"} 19
    node_cpu_temp_celsius{instance="node2", cpu="0"} 36
    node_cpu_temp_celsius{instance="node2", cpu="1"} 31
    ```
    A1: `avg by(instance) (node_cpu_temp_celsius)

    Q2: What method does Prometheus use to collect metrics from targets?
    A2: pull

    Q3: An engineer forgot to address an alert, based off the alertmanager config below, how long will they need to wait to see the alert again?
    ```yaml
    route:
    receiver: pager
    group_by: [alertname]
    group_wait: 10s
    repeat_interval: 4h
    group_interval: 5m
    routes:
    - match:
    team: api
    receiver: api-pager
    - match:
    team: frontend
    receiver: frontend-pager
    ```
    A3: 4h
    Q4: Which query below will get all time series for metric `node_disk_read_bytes_total` for job=web, and job=node?
    A4: `node_disk_read_bytes_total{job=~"web|node"}`

    Q5: What type of database does Prometheus use?
    A5: Time Series

    Q6: Analyze the alertmanager configs below. For all the alerts that got generated, how many total notifications will be sent out?
    ```yaml
    route:
    receiver: general-email
    group_by: [alertname]
    routes:
    - receiver: frontend-email
    group_by: [env]
    matchers:
    - team: frontend
    The following alerts get generated by Prometheus with the defined labels.
    alert1
    team: frontend
    env: dev
    alert2team: frontend
    env: dev
    alert3
    team: frontend
    env: prod
    alert4
    team: frontend
    env: prod
    alert5
    team: frontend
    env: staging
    ```
    A6: 3

    Q7: What is the Prometheus client library used for?
    A7: Instrumenting applications to generate prometheus metrics and to push metrics to the Push Gateway

    Q8: Management has decided to offer a file upload service where the SLO states that 97% of all upload should complete within 30s. A histogram metric is configured to track the upload time, which of the following bucket configurations is recommended for the desired SLO?
    A8: 10, 25, 27, 30, 32, 35, 49, 50
    [since histogram quantiles are approximations, to find out if a SLO has been met make sure that a bucket is specified at the desired SLO value]

    Q9: Which of the following is not a valid method for reloading alertmanager configuration?
    A9: hit the reload config button in alertmanager web ui

    Q10: What two labels are assigned to every metric by default?
    A10: instance, job

    Q11: What configuration will make it so Prometheus doesn’t scrape targets with a label of `team: frontend`?
    ```yaml
    #Option A:
    relabel_configs:
    - source_labels: [team]
    regex: frontend
    action: drop
    #Option B:
    relabel_configs:
    - source_labels: [frontend]
    regex: team
    action: drop
    #Option C:
    metric_relabel_configs:
    - source_labels: [team]
    regex: frontend
    action: drop
    #Option D:
    relabel_configs:
    - match: [team]
    regex: frontend
    action: drop
    ```
    A11: Option A
    [relabel_configs is where you will define which targets Prometheus should scrape]

    Q12: Where should alerting rules be defined?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: database_errors_total
    action: replace
    target_label: __name__
    replacement: database_failures_total
    ```
    A12: separate rules file

    Q13: Which query below will give the 99% quantile of the metric `http_requests_total`?
    A13: `histogram_quantile(0.99, http_requests_total_bucket)`

    Q14: What metric should be used to track the uptime of a server?
    A14: counter

    Q15: Which component of the Prometheus architecture should be used to collect metrics of short-lived jobs?
    A15: push gateway

    Q16: What is the purpose of Prometheus `scrape_interval`?
    A16: Defines how frequently to scrape a target

    Q17: What does the following metric_relabel_config do?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: database_errors_total
    action: replace
    target_label: __name__
    replacement: database_failures_total
    ```
    A17: Renames the metric `database_errors_total` to `database_failures_total`

    Q18: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster?
    A18: service discovery

    Q19: For a histogram metric, what are the different submetrics?
    A19: `__count` [total number of observations], `__bucket` [number of observations for a specific bucket], `__sum` [sum of all observations]

    Q20: What is the default web port of Prometheus?
    A20: 9090

    Q21: Add an annotation to the alert called `description` that will print out the message that looks like this `Instance has low disk space on filesystem, current free space is at %`
    ```yaml
    groups:
    - name: node
    rules:
    - alert: node_filesystem_free_percent
    expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10
    ## Examples of the two metrics used in the alert can be seen below.
    # node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
    # node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}
    # Choose the correct answer:
    # Option A:
    description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%
    # Option B:
    description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%
    # Option C:
    description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%
    # Option D:
    description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
    ```
    A21: Option B

    Q22: What does the double underscore `__` before a label name signify?
    A22: The label is reserved label

    Q23: The metric `http_errors_total` has 3 labels, `path`, `method`, `error`. Which of the following queries will give the total number of errors for a path of `/auth`, method of `POST`, and error code of `401`?
    A23: `http_errors_total{path="/auth", method="POST", code="401"}`

    Q24: What are the different states a Prometheus alert can be in?
    A24: inactive, pending, firing

    Q25: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects?
    A25: exporters

    Q26: Which of the following is not a valid time value to be used in a range selector?
    A26: 2mo

    Q27: Analyze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to `team: api` and `severity: critical`?
    ```yaml
    route:
    receiver: general-email
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    severity: critical
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    routes:
    - matchers:
    severity: critical
    receiver: backend-pager
    - receiver: auth-email
    matchers:
    - team: auth
    routes:
    - matchers:
    severity: critical
    receiver: auth-pager
    receiver: auth-pager
    ```
    A27: general-email

    Q28: A metric to track requests to an api `http_requests_total` is created. Which of the following would not be a good choice for a label?
    A28: email

    Q29: Which query below will return a range vector?
    A29: `node_boot_time_seconds[5m]`

    Q30: Based off the metrics below, which query will return the same result as the query database_write_timeouts / ignoring(error) database_error_total
    ```
    database_write_timeouts{instance="db1", job="db", error="212, type="mysql"} 12
    database_error_total{instance="db1", job="db", type="mysql"} 67
    ```
    A30: `database_write_timeouts / on(instance, job, type) database_error_total`

    Q31: What is the purpose of the for attribute in a Prometheus alert rule?
    A31: Determines how long a rule must be true before firing an alert

    Q32: Which query will give sum of all filesystems on the machine? The metric `node_filesystem_size_bytes` will list out all of the filesystems and their total size.
    ```
    node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", mountpoint="/boot/efi"} 536834048
    node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="192.168.1.168:9100", mountpoint="/"} 13268975616
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run"} 727924736
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/lock"} 5242880
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/snapd/ns"} 727924736
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/user/1000"} 727920640
    ```
    A32: `sum(node_filesystem_size_bytes{instance="192.168.1.168:9100"})`

    Q33: What are the 3 components of the prometheus server?
    A33: retrieval node, tsdb, http server

    Q34: What selector will match on time series whose `mountpoint` label doesn’t start with /run?
    ```
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node1", mountpoint="/boot/efi"}​
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node2", mountpoint="/boot/efi"}​
    node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node1", mountpoint="/"}​
    node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node2", mountpoint="/"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/lock"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/snapd/ns"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/user/1000"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/lock"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/snapd/ns"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/user/1000"}
    ```
    A34: `node_filesysten_avail_bytes{mountpoint!~"/run.*"}`

    Q35: Which statement is true about the rate/irate functions?
    A35: `rate()` calculates average rate over entire interval, `irate()` calculates the rate only between the last two datapoints in an interval

    Q36: What is the default path Prometheus will scrape to collect metrics?
    A36: `/metrics`

    Q37: The following PromQL expression is trying to divide the the `node_filesystem_avail_bytes` by `node_filesystem_size_bytes` , and `node_filesystem_avail_bytes` / `node_filesystem_size_bytes`. The PromQL expression does not return any results, fix the expression so that it successfully divides the two metric. This is what the two metrics look like before the division operation:
    ```
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", class=”SSD” instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}
    node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}
    ```
    A37: `node_filesystem_avail_bytes / ignoring(class) node_filesystem_size_bytes`

    Q38: What are the 3 components of observability?
    A38: logging, metrics, traces

    Q39: Which of the following statements are true regarding Alert `labels` and `annotations`?
    ```yaml
    route:
    receiver: staff
    group_by: ['severity']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    routes:
    - matchers:
    job: kubernetes
    receiver: infra
    group_by: ['severity']
    ```
    A39: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, whereas annotations should be used for cosmetic descriptions of the alerts

    Q40: The metric http_errors_total{code=”404”} tracks the number of 404 errors a web server has seen. Which query returns what is the average rate of 404s a server has seen for the past 2 hours? Use a 2m sample range and a query interval of 1m:
    A40: `avg_over_time(rate(http_errors_total{code="404"}[2m]) [2h:1m])`
    [since we need the average for the past 2 hours, the first value in the subquery will be 2h and the second number is the query interval]

    Q41: Which query will return all time series for the metric `node_network_transmit_drop_total` this is greater than 20 and less than 100?
    A41: `node_network_transmit_drop_total > 20 and node_network_transmit_drop_total < 100`

    Q42: What does the following `metric_relabel_config` do?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [datacenter]
    regex: (.*)
    action: replace
    target_label: location
    replacement: dc-$1
    ```
    A42: changes the datacenter label to location and prepends the value with dc-

    Q43: What type of data should Prometheus monitor?
    A43: numeric

    Q44: Which type of observability would be used to track a request/transaction as it traverses a system?
    A44: traces

    Q45: Add an annotation to the alert called description that will print out the message that looks like this Instance has low disk space on filesystem , current free space is at %
    ```yaml
    groups:
    - name: node
    rules:
    - alert: node_filesystem_free_percent
    expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10
    # Examples of the two metrics used in the alert can be seen below
    # node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
    # node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}
    # Choose the correct option:
    #Option A:
    description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%
    #Option B:
    description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%
    #Option C:
    description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%
    #Option D:
    description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
    ```
    A45: Option B

    Q46: Regarding histogram and summary metrics, which of the following are true?
    A46: histogram is calculated server side and summary is calculated client side
    [for histograms, quantiles must be calculated server side thus they are less taxin on client libraries, whereas sumary metrics are the opposite]

    Q47: What is this an example of? `Service provider guaranteed 99.999% uptime each month or else customer will be awarded $10k’
    A47: SLA

    Q48: Which of the following is Prometheus’ built in dashboarding/visualization feature?
    A48: Console templates

    Q49: Which query below will give the active bytes on instance 10.1.1.1:9100 45m ago?
    A49: `node_memory_Active_bytes{instance="10.1.1.1:9100"} offset 45m`

    Q50: What type of metric should be used for measuring internal temperature of a server?
    A50: gauge

    Q51: What is the name of the cli utility that comes with Prometheus?
    A51: promtool

    Q52: How can alertmanager prevent certain alerts from generating notification for a temporary period of time?
    A52: Configuring a silence

    Q53: In the scrape configs for a pushgateway, what is the purpose of the `honor_labels: true`
    ```yaml
    scrape_configs:
    - job_name: pushgateway
    honor_labels: true
    static_configs:
    - targets: ["192.168.1.168:9091"]
    ```
    A53: Allows metrics to specify the instance and job labels instead of pulling it from scrape_configs

    Q54: Analayze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: backend and severity: critical
    ```yaml
    route:
    receiver: general-email
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    severity: critical
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    routes:
    - matchers:
    severity: critical
    receiver: backend-pager
    - receiver: auth-email
    matchers:
    - team: auth
    routes:
    - matchers:
    severity: critical
    receiver: auth-pager
    receiver: auth-pager
    ```
    A54: backend-pager

    Q55: Which of the following would make for a poor SLI?
    A55: high disk utilization
    [things like CPU, memory, disk utilization are poor as user may not experience any degradation of service during these events]

    Q56: Which of the following is not a valid way to reload Prometheus configuration?
    A56: promtool config reload

    Q57: Which of the following is not something that is tracked in a span within a trace?
    A57: complexity

    Q58: You are writing your own exporter for a Redis database. Which of the following would be the correct name for a metric to represent used memory on the by the Redis instance?
    A58: `redis_mem_used_bytes`
    [the first should be the app, second metric name, third the unit]

    Q59: Which cli command can be used to verify/validate prometheus configurations?
    A59: `promtool check config`

    Q60: Which query will return targets who have more than 50 arp entries?
    A60: `node_arp_entries{job="node"} > 50`
    299 changes: 299 additions & 0 deletions Mock Exam 2.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,299 @@
    ## Mock Exam

    ### 2

    Q1: What data type do Prometheus metric values use?
    A1: 64bit floats

    Q2: The metric `node_fan_speed_rpm` tracks the current fan speeds. The `location` label specifies where on the server the fan is located. Which query will return the fan speeds for all fans except the `rear` fan
    A2: `node_fan_speed_rpm{location!="rear"}`

    Q3: With the following alertmanager configs, after a notification has been sent out, a new alert comes in. How long will alertmanager wait before firing a new notification?
    ```yaml
    route:
    receiver: staff
    group_by: ['severity']
    group_wait: 60s
    group_interval: 15m
    repeat_interval: 12h
    routes:
    - matches:
    job: kubernetes
    receiver: infra
    group_by: ['severity']
    ```
    A3: 15m
    [the group_interval property determines how long alertmanager will wait after sending a notification ebfore it sends a new notification for a group]
    Q4: What is the purpose of Prometheus `scrape_interval`?
    A4: defines how frequently to scrape a target

    Q5: The metric http_requests tracks the total number of requests across each endpoint and method. What query will return the total number of requests for each path
    ```
    http_requests{method="get", path="/auth"} 3​
    http_requests{method="post", path="/auth"} 1​
    http_requests{method="get", path="/user"} 4​
    http_requests{method="post", path="/user"} 8​
    http_requests{method="post", path="/upload"} 2​
    http_requests{method="get", path="/tasks"} 4​
    http_requests{method="put", path="/tasks"} 6​
    http_requests{method="post", path="/tasks"} 1​
    http_requests{method="get", path="/admin"} 3​
    http_requests{method="post", path="/admin"} 9
    ```
    A5: `sum by(path) (http_requests)`

    Q6: An application is advertising metrics at the path `/monitoring/stats`. What property in the scrape configs needs to be modified?
    A6: `metrics_path: "/monitoring/stats"`

    Q7: Analyze the alertmanager configs below : Based off the alert below, which receiver will send the notification for the alert alert labels: `team: frontend`
    ```yaml
    route:
    group_wait: 20s
    receiver: general
    group_by: ['alertname']
    routes:
    - match:
    org: kodekloud
    receiver: kodekloud-pager
    - match:
    org: apple
    receiver: apple
    ```
    A7: general

    Q8: What type of database does Prometheus use?
    A8: Time-series database

    Q9: Which of the following is Prometheus’ built in dashboarding/visualization feature?
    A9: Console templates

    Q10: What command should be used to verify that a Prometheus config is valid?
    A10: `promtool check config prometheus.yml`

    Q11: What type of data should prometheus monitor?
    A11: numeric

    Q12: What is the default port that Prometheus listens on?
    A12: 9090

    Q13: A car reports the number of miles it has been driven with the metric `car_total_miles` Which query returns what is the average rate of miles the car has driven the past 2 hours. Use a 4m sample range and a query interval of 1m.
    A13: `avg_over_time(rate(car_total_miles[4m]) [2h:1m])`

    Q14: Groups and rules within a group are run sequentially
    A14: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, annotations should be used for cosmetic descriptions of the alerts

    Q15: What method does Prometheus use to collect metrics from targets?
    A15: pull

    Q16: Which of the following is not a form of observability?
    A16: streams

    Q17: How is application instrumentation achieved?
    A17: Client libraries

    Q18: Which query below will give the 95% quantile of the metric `http_file_upload_bytes`?
    A18: `histogram_quantile(0.95, http_file_upload_bytes_bucket)`

    Q19: What is this an example of 99% availability with a median latency less than 300ms?
    A19: SLO

    Q20: What is the default path Prometheus will scrape to collect metrics?
    A20: `/metrics`

    Q21: Where are alert rules defined?
    A21: In a separate rules file on the Prometheus server

    Q22: `kafka_topic_partition_replicas` metric tracks the number of partitions for a topic/partition. Which query will get the number of partitions for the past 2 hours. Result should return a range vector.
    A22: `kafka_topic_partition_replicas[2h]`

    Q23: The metric `http_errors_total` has 3 labels: `path`, `method`, `error`. Which of the following queries will give the total number of errors for a path of `/auth`, method of `POST`, and error code of `401`?
    A23: `http_errors_total{path="/auth", method="POST", code="401"}`

    Q24: What update needs to occur to add an annotation called `description` that prints out the message `redis server <insert instance name> is down!`
    A24: `description: "redis server {{.Labels.instance}} is down!"`

    Q25: Which statement is true regarding Prometheus rules?
    A25: Groups are run in parallel, and rules within a group are run sequentially

    Q26: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "demo"
    metric_relabel_configs:
    - regex: fstype
    action: labeldrop
    ```
    A26: The label `fstype` will be dropped for all metrics

    Q27: The metric `node_filesystem_avail_bytes` reports the available bytes for each filesystem on a node. Which query will return all filesystems that has either less than 1000 available bytes or greater than 50000 bytes
    A27: `node_filesystem_avail_bytes < 1000 or node_filesystem_avail_bytes > 50000`

    Q28: For `metric_relabel_configs` and `relabel_configs`, when matching on multiple source labels, what is the default delimiter
    A28: `;`

    Q29: Which of the following is not a valid method for reloading alertmanager configuration?
    A29: hit the reload config button in alertmanager web ui

    Q30: Which of the following components is responsible for receiving metrics from short lived jobs?
    A30: pushgateway

    Q31: For a histogram metric, what are the different submetrics?
    A31: `__count`, `__bucket`, `__sum`

    Q32: Which query will return whether or not a target is currently able to be scraped?
    A32: `up`

    Q33: What does the double underscore `__` before a label name signify?
    A33: The label is reserver label

    Q34: Which configuration in alertmanager will wait 2 minutes before firing off an alert to prevent unnecessary notifications getting sent?
    A34: `group_wait: 2m`
    [when an alert arrives on alertmanager, it will wait for the amoount of time specified in group_wait to wait for other alerts to arrive before firing off a notification]

    Q35: Which of the following is not a component of the Prometheus solution?
    A35: influxdb

    Q36: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster?
    A36: service discovery

    Q37: The metric `mealplanner_consumed_calories` tracks the number of calories that have been consumed by the user. What query will return the amount of calories that had been consumed 4 days ago?
    A37: `mealplanner_consumed_calories offset 4d`

    Q38: Which of the following would make for a good SLI?
    A38: request failures
    [For good SLIs metrics, use metrics that impact the user's experience. Disk utilization, memory utilization, fan speed, and server temperature are not things that impact the user. Request failures will impact a user’s experience for sure]

    Q39: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "demo"
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: docker_container_crash_total
    action: replace
    target_label: __name__
    replacement: docker_container_restart_total
    ```
    A39: Renames the metric `docker_container_crash_total` to `docker_container_restart_total`

    Q40: What type of metric should be used to track the number of miles a car has driven?
    A40: counter

    Q41: What type of metric should be used for measuring a users heart rate?
    A41: gauge

    Q42: What is the purpose of `repeat_interval` in alertmanager?
    A42: How long to wait before sending a notification again if it has already been sent successfully for an alert

    Q43: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects?
    A43: exporters

    Q44: What are the two attributes that metrics can have?
    A44: TYPE, HELP

    Q45: What query will return all the instances whose active memory bytes is less than 10000?
    A45: `node_memory_Active_bytes < 10000`

    Q46: How many labels does the following time series have `node_fan_speed{instance=“node8”, job=“server”, fan=“2”}`?
    A46: 3

    Q47: In the prometheus configuration, what is the purpose of the `scheme` field?
    A47: Determines if Prometheus will use HTTP or HTTPS

    Q48: The metric `health_consumed_calories` tracks how many calories a user has eaten and `health_burned_calories` tracks the number of calories burned while exercising. To calculate net calories for the day subtract health_burned_calories from health_consumed_calories. Based on the time series below, which expression successfully calculates net calories.
    ```
    health_consumed_calories{job=“health”, meal=“dinner”} 800
    health_burned_calories{job=“health”, activity=“cardio”} 200
    ```
    A48: `health_consumed_calories - ignoring(meal, activity) health_burned_calories`

    Q49: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: example
    relabel_configs:
    - source_labels: [env, team]
    regex: dev;marketing
    action: drop
    ```
    A49: Drops all targets whose `env` label is set to `dev` and `team` label is set to `marketing`

    Q50: What is the name of the Prometheus query language?
    A50: PromQL

    Q51: You are writing an exporter for RabbitMQ and are creating a metric to track the size of the message queue. Which of the following would be an appropriate name for the metric.
    A51: `rabbitmq_message_bytes`

    Q52: What are the different states a Prometheus alert can be in?
    A52: inactive, pending, firing

    Q53: Which statement is true about the rate/irate functions?
    A53: `rate()` calculates average rate over entire interval, `irate()` calculates the rate only between the last two datapoints in an interval

    Q54: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "example"
    metric_relabel_configs:
    - source_labels: [team]
    regex: (.*)
    action: replace
    target_label: organization
    replacement: org-$1
    ```
    A54: renames the `team` label to `organization` and the value of the label will get prepended with `org-`

    Q55: Analayze alertmanager configs below. Based off the following alert which receiver will receive the notification alertname: `node_filesystem_full`, labels: `team: frontend`, `notification: pager`
    ```yaml
    route:
    receiver: general-email
    group_by: [alertname]
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    notification: pager
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    - receiver: auth-email
    matchers:
    - team: auth
    ```
    A55: frontend-pager

    Q56: A database backup service has an SLO that states that 97% of all backup jobs will be completed within 60s. A histogram metric is configured to track the backup process time, which of the following bucket configurations is recommended for the desired SLO?
    A56: 35, 45, 55, 60, 65, 75, 100
    [Since histogram quantiles are approximations, to find out if a SLO has been met, make sure that a bucket is specified at the desired SLO value of 60s. The exact number (60s) must be present in the list.]

    Q57: Which of the following is not a valid time value to be used in a range selector?
    A57: 3hr

    Q58: What type of data does Prometheus collect?
    A58: numeric

    Q59: The `node_cpu_seconds_total` metric tracks the number of seconds cpu has spent in a specific mode. The metric will break it down per cpu using the `cpu` label. Which query will return the total time all cpus on an instance spent in a mode that is not `idle`. Make sure to group the result on a per instance basis.
    ```
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="idle"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="iowait"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="irq"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="nice"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="softirq"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="steal"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="system"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="idle"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="iowait"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="irq"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="nice"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="softirq"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="steal"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="system"}
    ```
    A59: `sumb by(instance) (node_cpu_seconds{mode!="idle"})`

    Q60: The following time series return values with a lot of decimal values. What query will return values rounded down to the closest integer `node_cpu_seconds_total {cpu=“0”, mode=“idle”} 115.12​ {cpu=“0”, mode=“irq”} 87.4482​ {cpu=“0”, mode=“steal”} 44.245`
    A60: `floor(node_cpu_seconds_total)`
    955 changes: 104 additions & 851 deletions Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -37,148 +37,146 @@ Prometheus Certified Associate (PCA)

    ## Observability Fundamentals

    1. Observability
    ### Observability
    - the ability to understand and measure the state of a system based on data generated by the system
    - allows to generate actionable outputs from **unexpected** scenarions
    - to better understad the internal of your system
    - to better understad the internals of your system
    - greater need for observability in distributed systems & microservices
    - troubleshooting - e.g. why are error rates high?
    - 3 pillars are logging, metrics, traces:

    a. Logs - records of event that have occurred and encapsulate info about the specific event
    b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions
    c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace
    - 3 pillars of observability:
    1. **Logs** - records of events that have occurred and encapsulate info about the specific event
    2. **Metrics** - numerical value/information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions
    3. **Traces** - follow operations (_trace-id_) as they travel through different hops, **spans** are events forming a trace
    - Prometheus only handles metrics, not logs or traces!

    2. SLO/SLA/SLI
    ### SLO/SLA/SLI

    a. SLI (service level indicators) = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)
    a. **SLI (service level indicators)** = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)
    - not all metrics make for good SLIs, you want to find metrics that accurately measure a **user's** experience
    - high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

    b. SLO (service level objectives) = target value or range for an SLI
    b. **SLO (service level objectives)** = target value or range for an SLI
    - examples:
    SLI - Latency
    SLO - Latency < 100ms
    SLI - Availability
    SLO - 99.99% uptime
    * SLI = Latency
    * SLO = Latency < 100ms
    * SLI = Availability
    * SLO = 99.99% uptime
    - should be directly related to the customer experience
    - purpose is to quantify reliability of a product to a customer
    - may be tempted to set to aggressive values
    - goal is not to achieve perfection, but make customers happy
    - may be tempted to set unnecessarily aggressive values
    - goal is not to achieve perfection, but **make customers happy**

    c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO
    c. **SLA (service level agreement)** = contract between a vendor and a user that guarantees SLO

    ## Prometheus Fundamentals

    3. Prometheus fundamentals
    - use cases:
    * collect metrics from different locations like West DC, central DC, East DC, AWS etc.
    * high memory on the hosting MySQL db and notify operations team via email
    * find out which uploaded video length the application starts to degrade
    - open source monitoring tool that collects metrics data and provide tools to visualize the data
    - allows to generate alerts when treshold reached
    - collects data by scraping targets who expose metrics through HTTP endpoint
    - stored in time series db and can be queried with built-in PromQL
    - use cases:
    * collect metrics **from different locations** (e.g. like West DC, central DC, East DC, AWS etc.)
    * high memory on the hosting MySQL db and **notify operations team** via email
    * find out which uploaded video length the application **starts to degrade**
    - allows to generate alerts when **treshold** reached
    - collects data by **scraping targets** who expose metrics through HTTP endpoint
    - stored in time series db and can be queried with built-in **PromQL** (Prometheus Query Language)
    - what can it monitor:
    * CPU/memory
    * disk space
    * service uptime
    * app specific data - number of exceptions, latency, pending requests
    * networking devices, databases etc.
    - exclusively monitor numeric time-series data
    - does **not** monitor events, system logs, traces
    - exclusively monitor **numeric** time-series data!
    - does **not** monitor events, system logs, traces!
    - originally sponsored by SoundCloud
    - written in Go

    4. Prometheus Architecture
    ### Prometheus Architecture
    - 3 core components:
    * retrieval (scrapes metric data)
    * TSDB (stores metric data)
    * HTTP server (accepts PromQL query)
    - lot of other components making up the whole solution:
    * exporters (mini-processes running on the targets), retrieval component **pulls** the metrics from
    * pushgateway (short lived job sends the data to it and retrieved from there)
    * service discovery is all about providing list of targets so you don't have to hardocre those values
    * alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
    * Prometheus Web UI or Grafana etc.
    - collects by sending HTTP request to `/metrics` endpoint of each target, path can be changed
    * **Retrieval** (scrapes metric data)
    * **TSDB** (time-series database stores metric data)
    * **HTTP server** (accepts PromQL query)
    - lots of other components making up the whole solution:
    * **exporters** (mini-processes running on the targets), retrieval component **pulls** the metrics from
    * **pushgateway** (short lived job sends the data to it and then retrieved from there)
    * **service discovery** is all about providing list of targets so you don't have to hardcode those values
    * **alertmanager** handles all of the emails, SMS, slack etc. after the alerts is pushed to it
    * Prometheus **Web UI** or Grafana etc.
    - collects by sending HTTP request to `/metrics` endpoint of each target, path can be changed via `metrics_path`
    - several native exporters:
    * node exporters (Linux)
    * node-exporters (Linux)
    * Windows
    * MySQL
    * Apache
    * HAProxy
    * client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
    - Pull based is pros:
    - Pull based approach is better, because:
    * easier to tell if the target is down
    * does not DDoS the metrics server
    * definitive list of targets to monitor (central source of truth)

    5. Prometheus Installation
    - Download tar from http://prometheus.io/download
    - untarred folder contains console_libraries, consoles, prometheus (binary), prometheus.yml (config) and promtool (CLI utility)
    - Run `./prometheus`
    - Open http://localhost:9090
    - Execute / query `up` in the console to see the one target (itself) - should work OK so we can turn it into a systemd service
    - Create a user `sudo useradd --no-create-home --shell /bin/false prometheus`
    - Create a config folder `sudo mkdir /etc/prometheus`
    - Create `/var/lib/prometheus` for the data
    - Move executables `sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin`
    - Move config file `sudo cp prometheus.yaml /etc/prometheus/`
    - Copy the consoles folder `sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/`
    - Change owner for these folders & executables `sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool`
    - The command (ExecStart) will then look like this: `sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries`
    - Create a service file with this information `/etc/systemd/system/prometheus.service` and reload `sudo systemctl daemon-reload`
    - Start the daemon `sudo systemctl start prometheus ; sudo systemctl enable prometheus`

    6. Node exporter
    - Download tar from http://prometheus.io/download
    - By default, the Prometheus server will use port **9090**.

    ### Prometheus Installation
    1. Download `*.tar` from http://prometheus.io/download
    2. untarred folder contains `console_libraries`, `consoles`, `prometheus` (binary), `prometheus.yml` (config) and `promtool` (CLI utility) + docs
    3. Run `./prometheus` - does it work?
    4. Open http://localhost:9090 - does it work?
    5. Execute / query `up` in the console to see the one target (itself) - should work OK so we can turn it into a systemd service now:
    1. Create a new/separate user: `sudo useradd --no-create-home --shell /bin/false prometheus`
    2. Create a config folder: `sudo mkdir /etc/prometheus`
    3. Create folder `/var/lib/prometheus` for the data
    4. Move executables: `sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin`
    5. Move config file: `sudo cp prometheus.yaml /etc/prometheus/`
    6. Copy the consoles folder: `sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/`
    7. Change owner for these folders & executables: `sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool`
    8. The command (ExecStart) in the service file will then look like this: `sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries`
    9. Create a service file with this information `/etc/systemd/system/prometheus.service` and reload `sudo systemctl daemon-reload`
    10. Start the daemon `sudo systemctl start prometheus ; sudo systemctl enable prometheus`

    ### Node exporter
    - Download `*.tar` from http://prometheus.io/download
    - untarred folder contains basically just the binary `node_exporter`
    - The `node_exporter` listens on HTTP port **9100** by default
    - Run the `./node_exporter` and then `curl localhost:9100/metrics`
    - Run in the background & start on boot using the systemd
    - `sudo cp node_exporter /usr/local/bin`
    - `sudo useradd --no-create-home --shell /bin/false node_exporter`
    - `sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter`
    - `sudo vi /etc/systemd/system/node_exporter.service`
    - `sudo systemctl daemon-reload`
    - `sudo systemctl start node_exporter ; sudo systemctl enable node_exporter`

    7. Prometheus configuration
    - Sections:
    a. `global` - default parameters, it can be overriden by the same variables in sub-sections
    b. `scrape_configs` - define targets and `job_name`, which is a collection of instances that need to be scraped
    c. `alerting`
    d. `rule_files`
    e. `remote_read` & `remote_write`
    f. `storage`
    - Some examples:

    ```yaml
    scrape_configs:
    - job_name: 'nodes' # call it whatever
    scrape_interval: 30s # from the target every X seconds
    scrape_timeout: 3s # timeouts after X seconds
    scheme: https # http or https
    metrics_path: /stats/metrics # non-default path that you send requests to
    static_configs:
    - targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
    # basic_auth # this is the next section
    ```
    - Reload the config `sudo systemctl restart prometheus`

    8. Encryption & Authentication
    - between Prometheus and targets
    - Run in the background & start on boot using the systemd, very similar to Prometheus installation:
    ```bash
    sudo cp node_exporter /usr/local/bin
    sudo useradd --no-create-home --shell /bin/false node_exporter
    sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
    sudo vi /etc/systemd/system/node_exporter.service
    sudo systemctl daemon-reload
    sudo systemctl start node_exporter ; sudo systemctl enable node_exporter
    ```

    On the targets, you need to generate the keys:
    - `sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"` - this will generate key & crt pair
    - config will have to be customized:
    ### Prometheus configuration
    - Sections:
    1. `global` - Default parameters, it can be overriden by the same variables in sub-sections
    2. `scrape_configs` - Define targets and `job_name`, which is a collection of instances that need to be scraped
    3. `alerting` - Alerting specifies settings related to the Alertmanager
    4. `rule_files` - Rule files specifies a list of globs, rules and alerts are read from all matching files
    5. `remote_read` & `remote_write` - Settings related to the remote read/write feature
    6. `storage` - Storage related settings that are runtime reloadable
    - Example config:
    ```yaml
    scrape_configs:
    - job_name: 'nodes' # call it whatever
    scrape_interval: 30s # from the target every X seconds
    scrape_timeout: 3s # timeouts after X seconds
    scheme: https # http or https
    metrics_path: /stats/metrics # non-default path that you send requests to
    static_configs:
    - targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
    # basic_auth # this is the next section
    ```
    - To reload the config: `sudo systemctl restart prometheus`

    ```yaml
    tls_server_config:
    cert_file: node_exporter.crt
    key_file: node_exporter.key
    ```
    ### Encryption & Authentication
    - between the Prometheus and the targets
    - on the targets, you need to generate the key & crt pair first - e.g.:
    `sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"`
    - config will have to be customized after that:
    ```yaml
    tls_server_config:
    cert_file: node_exporter.crt
    key_file: node_exporter.key
    ```
    - `./node_exporter --web.config=config.yml`
    - `curl -k https://localhost:9100/metrics`
    @@ -1236,7 +1234,7 @@ spec:
    |matchers job: kubernetes|matchers name: job, value: kubernetes|

    ## Conclusion
    - https://kodekloud.com/wp-content/uploads/2022/12/Prometheus_Certified_Associate-1.pdf
    - KodeKloud slides: https://kodekloud.com/wp-content/uploads/2022/12/Prometheus_Certified_Associate-1.pdf
    - ​You will have 1.5 hours to complete the exam.​
    - The certification is valid for 3 years.​
    - This exam is online, proctored with multiple-choice questions.
    @@ -1249,754 +1247,9 @@ spec:
    * ​To ensure your system meets the exam requirements, visit this link: https://syscheck.bridge.psiexams.com/
    * Important exams instructions to check before scheduling the exam: https://docs.linuxfoundation.org/tc-docs/certification/important-instructions-pca

    ## Mock Exam 1 & 2

    ### 1
    Q1. The metric `node_cpu_temp_celcius` reports the current temperature of a nodes CPU in celsius. What query will return the average temperature across all CPUs on a per node basis? The query should return {instance=“node1”} 23.5 //average temp across all CPUs on node1 {instance=“node2”} 33.5 //average temp across all CPUs on node2.
    ```
    node_cpu_temp_celsius{instance="node1", cpu="0"} 28
    node_cpu_temp_celsius{instance="node1", cpu="1"} 19
    node_cpu_temp_celsius{instance="node2", cpu="0"} 36
    node_cpu_temp_celsius{instance="node2", cpu="1"} 31
    ```
    A1: `avg by(instance) (node_cpu_temp_celsius)
    Q2: What method does Prometheus use to collect metrics from targets?
    A2: pull
    Q3: An engineer forgot to address an alert, based off the alertmanager config below, how long will they need to wait to see the alert again?
    ```yaml
    route:
    receiver: pager
    group_by: [alertname]
    group_wait: 10s
    repeat_interval: 4h
    group_interval: 5m
    routes:
    - match:
    team: api
    receiver: api-pager
    - match:
    team: frontend
    receiver: frontend-pager
    ```
    A3: 4h

    Q4: Which query below will get all time series for metric `node_disk_read_bytes_total` for job=web, and job=node?
    A4: `node_disk_read_bytes_total{job=~"web|node"}`

    Q5: What type of database does Prometheus use?
    A5: Time Series

    Q6: Analyze the alertmanager configs below. For all the alerts that got generated, how many total notifications will be sent out?
    ```yaml
    route:
    receiver: general-email
    group_by: [alertname]
    routes:
    - receiver: frontend-email
    group_by: [env]
    matchers:
    - team: frontend

    The following alerts get generated by Prometheus with the defined labels.
    alert1
    team: frontend
    env: dev

    alert2team: frontend
    env: dev

    alert3
    team: frontend
    env: prod

    alert4
    team: frontend
    env: prod

    alert5
    team: frontend
    env: staging
    ```
    A6: 3
    Q7: What is the Prometheus client library used for?
    A7: Instrumenting applications to generate prometheus metrics and to push metrics to the Push Gateway
    Q8: Management has decided to offer a file upload service where the SLO states that 97% of all upload should complete within 30s. A histogram metric is configured to track the upload time, which of the following bucket configurations is recommended for the desired SLO?
    A8: 10, 25, 27, 30, 32, 35, 49, 50
    [since histogram quantiles are approximations, to find out if a SLO has been met make sure that a bucket is specified at the desired SLO value]
    Q9: Which of the following is not a valid method for reloading alertmanager configuration?
    A9: hit the reload config button in alertmanager web ui
    Q10: What two labels are assigned to every metric by default?
    A10: instance, job
    Q11: What configuration will make it so Prometheus doesn’t scrape targets with a label of `team: frontend`?
    ```yaml
    #Option A:
    relabel_configs:
    - source_labels: [team]
    regex: frontend
    action: drop
    #Option B:
    relabel_configs:
    - source_labels: [frontend]
    regex: team
    action: drop
    #Option C:
    metric_relabel_configs:
    - source_labels: [team]
    regex: frontend
    action: drop
    #Option D:
    relabel_configs:
    - match: [team]
    regex: frontend
    action: drop
    ```
    A11: Option A
    [relabel_configs is where you will define which targets Prometheus should scrape]

    Q12: Where should alerting rules be defined?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: database_errors_total
    action: replace
    target_label: __name__
    replacement: database_failures_total
    ```
    A12: separate rules file

    Q13: Which query below will give the 99% quantile of the metric `http_requests_total`?
    A13: `histogram_quantile(0.99, http_requests_total_bucket)`

    Q14: What metric should be used to track the uptime of a server?
    A14: counter

    Q15: Which component of the Prometheus architecture should be used to collect metrics of short-lived jobs?
    A15: push gateway

    Q16: What is the purpose of Prometheus `scrape_interval`?
    A16: Defines how frequently to scrape a target

    Q17: What does the following metric_relabel_config do?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: database_errors_total
    action: replace
    target_label: __name__
    replacement: database_failures_total
    ```
    A17: Renames the metric `database_errors_total` to `database_failures_total`

    Q18: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster?
    A18: service discovery

    Q19: For a histogram metric, what are the different submetrics?
    A19: `__count` [total number of observations], `__bucket` [number of observations for a specific bucket], `__sum` [sum of all observations]

    Q20: What is the default web port of Prometheus?
    A20: 9090

    Q21: Add an annotation to the alert called `description` that will print out the message that looks like this `Instance has low disk space on filesystem, current free space is at %`
    ```yaml
    groups:
    - name: node
    rules:
    - alert: node_filesystem_free_percent
    expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10
    ## Examples of the two metrics used in the alert can be seen below.
    # node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
    # node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}
    # Choose the correct answer:
    # Option A:
    description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%
    # Option B:
    description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%
    # Option C:
    description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%
    # Option D:
    description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
    ```
    A21: Option B

    Q22: What does the double underscore `__` before a label name signify?
    A22: The label is reserved label

    Q23: The metric `http_errors_total` has 3 labels, `path`, `method`, `error`. Which of the following queries will give the total number of errors for a path of `/auth`, method of `POST`, and error code of `401`?
    A23: `http_errors_total{path="/auth", method="POST", code="401"}`

    Q24: What are the different states a Prometheus alert can be in?
    A24: inactive, pending, firing

    Q25: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects?
    A25: exporters

    Q26: Which of the following is not a valid time value to be used in a range selector?
    A26: 2mo

    Q27: Analyze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to `team: api` and `severity: critical`?
    ```yaml
    route:
    receiver: general-email
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    severity: critical
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    routes:
    - matchers:
    severity: critical
    receiver: backend-pager
    - receiver: auth-email
    matchers:
    - team: auth
    routes:
    - matchers:
    severity: critical
    receiver: auth-pager
    receiver: auth-pager
    ```
    A27: general-email

    Q28: A metric to track requests to an api `http_requests_total` is created. Which of the following would not be a good choice for a label?
    A28: email

    Q29: Which query below will return a range vector?
    A29: `node_boot_time_seconds[5m]`

    Q30: Based off the metrics below, which query will return the same result as the query database_write_timeouts / ignoring(error) database_error_total
    ```
    database_write_timeouts{instance="db1", job="db", error="212, type="mysql"} 12
    database_error_total{instance="db1", job="db", type="mysql"} 67
    ```
    A30: `database_write_timeouts / on(instance, job, type) database_error_total`

    Q31: What is the purpose of the for attribute in a Prometheus alert rule?
    A31: Determines how long a rule must be true before firing an alert

    Q32: Which query will give sum of all filesystems on the machine? The metric `node_filesystem_size_bytes` will list out all of the filesystems and their total size.
    ```
    node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", mountpoint="/boot/efi"} 536834048
    node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="192.168.1.168:9100", mountpoint="/"} 13268975616
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run"} 727924736
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/lock"} 5242880
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/snapd/ns"} 727924736
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/user/1000"} 727920640
    ```
    A32: `sum(node_filesystem_size_bytes{instance="192.168.1.168:9100"})`

    Q33: What are the 3 components of the prometheus server?
    A33: retrieval node, tsdb, http server

    Q34: What selector will match on time series whose `mountpoint` label doesn’t start with /run?
    ```
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node1", mountpoint="/boot/efi"}​
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node2", mountpoint="/boot/efi"}​
    node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node1", mountpoint="/"}​
    node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node2", mountpoint="/"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/lock"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/snapd/ns"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/user/1000"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/lock"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/snapd/ns"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/user/1000"}
    ```
    A34: `node_filesysten_avail_bytes{mountpoint!~"/run.*"}`

    Q35: Which statement is true about the rate/irate functions?
    A35: `rate()` calculates average rate over entire interval, `irate()` calculates the rate only between the last two datapoints in an interval

    Q36: What is the default path Prometheus will scrape to collect metrics?
    A36: `/metrics`

    Q37: The following PromQL expression is trying to divide the the `node_filesystem_avail_bytes` by `node_filesystem_size_bytes` , and `node_filesystem_avail_bytes` / `node_filesystem_size_bytes`. The PromQL expression does not return any results, fix the expression so that it successfully divides the two metric. This is what the two metrics look like before the division operation:
    ```
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", class=”SSD” instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}
    node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}
    ```
    A37: `node_filesystem_avail_bytes / ignoring(class) node_filesystem_size_bytes`

    Q38: What are the 3 components of observability?
    A38: logging, metrics, traces

    Q39: Which of the following statements are true regarding Alert `labels` and `annotations`?
    ```yaml
    route:
    receiver: staff
    group_by: ['severity']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    routes:
    - matchers:
    job: kubernetes
    receiver: infra
    group_by: ['severity']
    ```
    A39: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, whereas annotations should be used for cosmetic descriptions of the alerts

    Q40: The metric http_errors_total{code=”404”} tracks the number of 404 errors a web server has seen. Which query returns what is the average rate of 404s a server has seen for the past 2 hours? Use a 2m sample range and a query interval of 1m:
    A40: `avg_over_time(rate(http_errors_total{code="404"}[2m]) [2h:1m])`
    [since we need the average for the past 2 hours, the first value in the subquery will be 2h and the second number is the query interval]

    Q41: Which query will return all time series for the metric `node_network_transmit_drop_total` this is greater than 20 and less than 100?
    A41: `node_network_transmit_drop_total > 20 and node_network_transmit_drop_total < 100`

    Q42: What does the following `metric_relabel_config` do?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [datacenter]
    regex: (.*)
    action: replace
    target_label: location
    replacement: dc-$1
    ```
    A42: changes the datacenter label to location and prepends the value with dc-

    Q43: What type of data should Prometheus monitor?
    A43: numeric

    Q44: Which type of observability would be used to track a request/transaction as it traverses a system?
    A44: traces

    Q45: Add an annotation to the alert called description that will print out the message that looks like this Instance has low disk space on filesystem , current free space is at %
    ```yaml
    groups:
    - name: node
    rules:
    - alert: node_filesystem_free_percent
    expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10
    # Examples of the two metrics used in the alert can be seen below
    # node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
    # node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}
    # Choose the correct option:
    #Option A:
    description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%
    #Option B:
    description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%
    #Option C:
    description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%
    #Option D:
    description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
    ```
    A45: Option B

    Q46: Regarding histogram and summary metrics, which of the following are true?
    A46: histogram is calculated server side and summary is calculated client side
    [for histograms, quantiles must be calculated server side thus they are less taxin on client libraries, whereas sumary metrics are the opposite]

    Q47: What is this an example of? `Service provider guaranteed 99.999% uptime each month or else customer will be awarded $10k’
    A47: SLA

    Q48: Which of the following is Prometheus’ built in dashboarding/visualization feature?
    A48: Console templates

    Q49: Which query below will give the active bytes on instance 10.1.1.1:9100 45m ago?
    A49: `node_memory_Active_bytes{instance="10.1.1.1:9100"} offset 45m`

    Q50: What type of metric should be used for measuring internal temperature of a server?
    A50: gauge

    Q51: What is the name of the cli utility that comes with Prometheus?
    A51: promtool

    Q52: How can alertmanager prevent certain alerts from generating notification for a temporary period of time?
    A52: Configuring a silence

    Q53: In the scrape configs for a pushgateway, what is the purpose of the `honor_labels: true`
    ```yaml
    scrape_configs:
    - job_name: pushgateway
    honor_labels: true
    static_configs:
    - targets: ["192.168.1.168:9091"]
    ```
    A53: Allows metrics to specify the instance and job labels instead of pulling it from scrape_configs

    Q54: Analayze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: backend and severity: critical
    ```yaml
    route:
    receiver: general-email
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    severity: critical
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    routes:
    - matchers:
    severity: critical
    receiver: backend-pager
    - receiver: auth-email
    matchers:
    - team: auth
    routes:
    - matchers:
    severity: critical
    receiver: auth-pager
    receiver: auth-pager
    ```
    A54: backend-pager

    Q55: Which of the following would make for a poor SLI?
    A55: high disk utilization
    [things like CPU, memory, disk utilization are poor as user may not experience any degradation of service during these events]

    Q56: Which of the following is not a valid way to reload Prometheus configuration?
    A56: promtool config reload

    Q57: Which of the following is not something that is tracked in a span within a trace?
    A57: complexity

    Q58: You are writing your own exporter for a Redis database. Which of the following would be the correct name for a metric to represent used memory on the by the Redis instance?
    A58: `redis_mem_used_bytes`
    [the first should be the app, second metric name, third the unit]

    Q59: Which cli command can be used to verify/validate prometheus configurations?
    A59: `promtool check config`

    Q60: Which query will return targets who have more than 50 arp entries?
    A60: `node_arp_entries{job="node"} > 50`

    ### 2

    Q1: What data type do Prometheus metric values use?
    A1: 64bit floats

    Q2: The metric `node_fan_speed_rpm` tracks the current fan speeds. The `location` label specifies where on the server the fan is located. Which query will return the fan speeds for all fans except the `rear` fan
    A2: `node_fan_speed_rpm{location!="rear"}`

    Q3: With the following alertmanager configs, after a notification has been sent out, a new alert comes in. How long will alertmanager wait before firing a new notification?
    ```yaml
    route:
    receiver: staff
    group_by: ['severity']
    group_wait: 60s
    group_interval: 15m
    repeat_interval: 12h
    routes:
    - matches:
    job: kubernetes
    receiver: infra
    group_by: ['severity']
    ```
    A3: 15m
    [the group_interval property determines how long alertmanager will wait after sending a notification ebfore it sends a new notification for a group]

    Q4: What is the purpose of Prometheus `scrape_interval`?
    A4: defines how frequently to scrape a target

    Q5: The metric http_requests tracks the total number of requests across each endpoint and method. What query will return the total number of requests for each path
    ```
    http_requests{method="get", path="/auth"} 3​
    http_requests{method="post", path="/auth"} 1​
    http_requests{method="get", path="/user"} 4​
    http_requests{method="post", path="/user"} 8​
    http_requests{method="post", path="/upload"} 2​
    http_requests{method="get", path="/tasks"} 4​
    http_requests{method="put", path="/tasks"} 6​
    http_requests{method="post", path="/tasks"} 1​
    http_requests{method="get", path="/admin"} 3​
    http_requests{method="post", path="/admin"} 9
    ```
    A5: `sum by(path) (http_requests)`

    Q6: An application is advertising metrics at the path `/monitoring/stats`. What property in the scrape configs needs to be modified?
    A6: `metrics_path: "/monitoring/stats"`

    Q7: Analyze the alertmanager configs below : Based off the alert below, which receiver will send the notification for the alert alert labels: `team: frontend`
    ```yaml
    route:
    group_wait: 20s
    receiver: general
    group_by: ['alertname']
    routes:
    - match:
    org: kodekloud
    receiver: kodekloud-pager
    - match:
    org: apple
    receiver: apple
    ```
    A7: general

    Q8: What type of database does Prometheus use?
    A8: Time-series database

    Q9: Which of the following is Prometheus’ built in dashboarding/visualization feature?
    A9: Console templates

    Q10: What command should be used to verify that a Prometheus config is valid?
    A10: `promtool check config prometheus.yml`

    Q11: What type of data should prometheus monitor?
    A11: numeric

    Q12: What is the default port that Prometheus listens on?
    A12: 9090

    Q13: A car reports the number of miles it has been driven with the metric `car_total_miles` Which query returns what is the average rate of miles the car has driven the past 2 hours. Use a 4m sample range and a query interval of 1m.
    A13: `avg_over_time(rate(car_total_miles[4m]) [2h:1m])`

    Q14: Groups and rules within a group are run sequentially
    A14: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, annotations should be used for cosmetic descriptions of the alerts

    Q15: What method does Prometheus use to collect metrics from targets?
    A15: pull

    Q16: Which of the following is not a form of observability?
    A16: streams

    Q17: How is application instrumentation achieved?
    A17: Client libraries

    Q18: Which query below will give the 95% quantile of the metric `http_file_upload_bytes`?
    A18: `histogram_quantile(0.95, http_file_upload_bytes_bucket)`

    Q19: What is this an example of 99% availability with a median latency less than 300ms?
    A19: SLO

    Q20: What is the default path Prometheus will scrape to collect metrics?
    A20: `/metrics`

    Q21: Where are alert rules defined?
    A21: In a separate rules file on the Prometheus server

    Q22: `kafka_topic_partition_replicas` metric tracks the number of partitions for a topic/partition. Which query will get the number of partitions for the past 2 hours. Result should return a range vector.
    A22: `kafka_topic_partition_replicas[2h]`

    Q23: The metric `http_errors_total` has 3 labels: `path`, `method`, `error`. Which of the following queries will give the total number of errors for a path of `/auth`, method of `POST`, and error code of `401`?
    A23: `http_errors_total{path="/auth", method="POST", code="401"}`

    Q24: What update needs to occur to add an annotation called `description` that prints out the message `redis server <insert instance name> is down!`
    A24: `description: "redis server {{.Labels.instance}} is down!"`

    Q25: Which statement is true regarding Prometheus rules?
    A25: Groups are run in parallel, and rules within a group are run sequentially

    Q26: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "demo"
    metric_relabel_configs:
    - regex: fstype
    action: labeldrop
    ```
    A26: The label `fstype` will be dropped for all metrics

    Q27: The metric `node_filesystem_avail_bytes` reports the available bytes for each filesystem on a node. Which query will return all filesystems that has either less than 1000 available bytes or greater than 50000 bytes
    A27: `node_filesystem_avail_bytes < 1000 or node_filesystem_avail_bytes > 50000`

    Q28: For `metric_relabel_configs` and `relabel_configs`, when matching on multiple source labels, what is the default delimiter
    A28: `;`

    Q29: Which of the following is not a valid method for reloading alertmanager configuration?
    A29: hit the reload config button in alertmanager web ui

    Q30: Which of the following components is responsible for receiving metrics from short lived jobs?
    A30: pushgateway

    Q31: For a histogram metric, what are the different submetrics?
    A31: `__count`, `__bucket`, `__sum`

    Q32: Which query will return whether or not a target is currently able to be scraped?
    A32: `up`

    Q33: What does the double underscore `__` before a label name signify?
    A33: The label is reserver label

    Q34: Which configuration in alertmanager will wait 2 minutes before firing off an alert to prevent unnecessary notifications getting sent?
    A34: `group_wait: 2m`
    [when an alert arrives on alertmanager, it will wait for the amoount of time specified in group_wait to wait for other alerts to arrive before firing off a notification]

    Q35: Which of the following is not a component of the Prometheus solution?
    A35: influxdb

    Q36: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster?
    A36: service discovery

    Q37: The metric `mealplanner_consumed_calories` tracks the number of calories that have been consumed by the user. What query will return the amount of calories that had been consumed 4 days ago?
    A37: `mealplanner_consumed_calories offset 4d`

    Q38: Which of the following would make for a good SLI?
    A38: request failures
    [For good SLIs metrics, use metrics that impact the user's experience. Disk utilization, memory utilization, fan speed, and server temperature are not things that impact the user. Request failures will impact a user’s experience for sure]

    Q39: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "demo"
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: docker_container_crash_total
    action: replace
    target_label: __name__
    replacement: docker_container_restart_total
    ```
    A39: Renames the metric `docker_container_crash_total` to `docker_container_restart_total`

    Q40: What type of metric should be used to track the number of miles a car has driven?
    A40: counter

    Q41: What type of metric should be used for measuring a users heart rate?
    A41: gauge

    Q42: What is the purpose of `repeat_interval` in alertmanager?
    A42: How long to wait before sending a notification again if it has already been sent successfully for an alert

    Q43: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects?
    A43: exporters

    Q44: What are the two attributes that metrics can have?
    A44: TYPE, HELP

    Q45: What query will return all the instances whose active memory bytes is less than 10000?
    A45: `node_memory_Active_bytes < 10000`

    Q46: How many labels does the following time series have `node_fan_speed{instance=“node8”, job=“server”, fan=“2”}`?
    A46: 3

    Q47: In the prometheus configuration, what is the purpose of the `scheme` field?
    A47: Determines if Prometheus will use HTTP or HTTPS

    Q48: The metric `health_consumed_calories` tracks how many calories a user has eaten and `health_burned_calories` tracks the number of calories burned while exercising. To calculate net calories for the day subtract health_burned_calories from health_consumed_calories. Based on the time series below, which expression successfully calculates net calories.
    ```
    health_consumed_calories{job=“health”, meal=“dinner”} 800
    health_burned_calories{job=“health”, activity=“cardio”} 200
    ```
    A48: `health_consumed_calories - ignoring(meal, activity) health_burned_calories`

    Q49: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: example
    relabel_configs:
    - source_labels: [env, team]
    regex: dev;marketing
    action: drop
    ```
    A49: Drops all targets whose `env` label is set to `dev` and `team` label is set to `marketing`

    Q50: What is the name of the Prometheus query language?
    A50: PromQL

    Q51: You are writing an exporter for RabbitMQ and are creating a metric to track the size of the message queue. Which of the following would be an appropriate name for the metric.
    A51: `rabbitmq_message_bytes`

    Q52: What are the different states a Prometheus alert can be in?
    A52: inactive, pending, firing

    Q53: Which statement is true about the rate/irate functions?
    A53: `rate()` calculates average rate over entire interval, `irate()` calculates the rate only between the last two datapoints in an interval

    Q54: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "example"
    metric_relabel_configs:
    - source_labels: [team]
    regex: (.*)
    action: replace
    target_label: organization
    replacement: org-$1
    ```
    A54: renames the `team` label to `organization` and the value of the label will get prepended with `org-`

    Q55: Analayze alertmanager configs below. Based off the following alert which receiver will receive the notification alertname: `node_filesystem_full`, labels: `team: frontend`, `notification: pager`
    ```yaml
    route:
    receiver: general-email
    group_by: [alertname]
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    notification: pager
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    - receiver: auth-email
    matchers:
    - team: auth
    ```
    A55: frontend-pager

    Q56: A database backup service has an SLO that states that 97% of all backup jobs will be completed within 60s. A histogram metric is configured to track the backup process time, which of the following bucket configurations is recommended for the desired SLO?
    A56: 35, 45, 55, 60, 65, 75, 100
    [Since histogram quantiles are approximations, to find out if a SLO has been met, make sure that a bucket is specified at the desired SLO value of 60s. The exact number (60s) must be present in the list.]

    Q57: Which of the following is not a valid time value to be used in a range selector?
    A57: 3hr

    Q58: What type of data does Prometheus collect?
    A58: numeric

    Q59: The `node_cpu_seconds_total` metric tracks the number of seconds cpu has spent in a specific mode. The metric will break it down per cpu using the `cpu` label. Which query will return the total time all cpus on an instance spent in a mode that is not `idle`. Make sure to group the result on a per instance basis.
    ```
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="idle"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="iowait"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="irq"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="nice"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="softirq"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="steal"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="system"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="idle"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="iowait"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="irq"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="nice"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="softirq"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="steal"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="system"}
    ```
    A59: `sumb by(instance) (node_cpu_seconds{mode!="idle"})`

    Q60: The following time series return values with a lot of decimal values. What query will return values rounded down to the closest integer `node_cpu_seconds_total {cpu=“0”, mode=“idle”} 115.12​ {cpu=“0”, mode=“irq”} 87.4482​ {cpu=“0”, mode=“steal”} 44.245`
    A60: `floor(node_cpu_seconds_total)`

    ---
    _Last update: Tue Jan 24 01:20:29 UTC 2023_
    _Author: [@luckylittle](https://github.com/luckylittle)_

    _Last update: Wed Jan 25 02:58:41 UTC 2023_
  3. @luckylittle luckylittle revised this gist Jan 24, 2023. 1 changed file with 307 additions and 12 deletions.
    319 changes: 307 additions & 12 deletions Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -1301,24 +1301,23 @@ route:

    The following alerts get generated by Prometheus with the defined labels.
    alert1
    team: frontend
    env: dev
    team: frontend
    env: dev

    alert2
    team: frontend
    env: dev
    alert2team: frontend
    env: dev

    alert3
    team: frontend
    env: prod
    team: frontend
    env: prod

    alert4
    team: frontend
    env: prod
    team: frontend
    env: prod

    alert5
    team: frontend
    env: staging
    team: frontend
    env: staging
    ```
    A6: 3
    @@ -1703,5 +1702,301 @@ A60: `node_arp_entries{job="node"} > 50`

    ### 2

    Q1: What data type do Prometheus metric values use?
    A1: 64bit floats

    Q2: The metric `node_fan_speed_rpm` tracks the current fan speeds. The `location` label specifies where on the server the fan is located. Which query will return the fan speeds for all fans except the `rear` fan
    A2: `node_fan_speed_rpm{location!="rear"}`

    Q3: With the following alertmanager configs, after a notification has been sent out, a new alert comes in. How long will alertmanager wait before firing a new notification?
    ```yaml
    route:
    receiver: staff
    group_by: ['severity']
    group_wait: 60s
    group_interval: 15m
    repeat_interval: 12h
    routes:
    - matches:
    job: kubernetes
    receiver: infra
    group_by: ['severity']
    ```
    A3: 15m
    [the group_interval property determines how long alertmanager will wait after sending a notification ebfore it sends a new notification for a group]

    Q4: What is the purpose of Prometheus `scrape_interval`?
    A4: defines how frequently to scrape a target

    Q5: The metric http_requests tracks the total number of requests across each endpoint and method. What query will return the total number of requests for each path
    ```
    http_requests{method="get", path="/auth"} 3​
    http_requests{method="post", path="/auth"} 1​
    http_requests{method="get", path="/user"} 4​
    http_requests{method="post", path="/user"} 8​
    http_requests{method="post", path="/upload"} 2​
    http_requests{method="get", path="/tasks"} 4​
    http_requests{method="put", path="/tasks"} 6​
    http_requests{method="post", path="/tasks"} 1​
    http_requests{method="get", path="/admin"} 3​
    http_requests{method="post", path="/admin"} 9
    ```
    A5: `sum by(path) (http_requests)`

    Q6: An application is advertising metrics at the path `/monitoring/stats`. What property in the scrape configs needs to be modified?
    A6: `metrics_path: "/monitoring/stats"`

    Q7: Analyze the alertmanager configs below : Based off the alert below, which receiver will send the notification for the alert alert labels: `team: frontend`
    ```yaml
    route:
    group_wait: 20s
    receiver: general
    group_by: ['alertname']
    routes:
    - match:
    org: kodekloud
    receiver: kodekloud-pager
    - match:
    org: apple
    receiver: apple
    ```
    A7: general

    Q8: What type of database does Prometheus use?
    A8: Time-series database

    Q9: Which of the following is Prometheus’ built in dashboarding/visualization feature?
    A9: Console templates

    Q10: What command should be used to verify that a Prometheus config is valid?
    A10: `promtool check config prometheus.yml`

    Q11: What type of data should prometheus monitor?
    A11: numeric

    Q12: What is the default port that Prometheus listens on?
    A12: 9090

    Q13: A car reports the number of miles it has been driven with the metric `car_total_miles` Which query returns what is the average rate of miles the car has driven the past 2 hours. Use a 4m sample range and a query interval of 1m.
    A13: `avg_over_time(rate(car_total_miles[4m]) [2h:1m])`

    Q14: Groups and rules within a group are run sequentially
    A14: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, annotations should be used for cosmetic descriptions of the alerts

    Q15: What method does Prometheus use to collect metrics from targets?
    A15: pull

    Q16: Which of the following is not a form of observability?
    A16: streams

    Q17: How is application instrumentation achieved?
    A17: Client libraries

    Q18: Which query below will give the 95% quantile of the metric `http_file_upload_bytes`?
    A18: `histogram_quantile(0.95, http_file_upload_bytes_bucket)`

    Q19: What is this an example of 99% availability with a median latency less than 300ms?
    A19: SLO

    Q20: What is the default path Prometheus will scrape to collect metrics?
    A20: `/metrics`

    Q21: Where are alert rules defined?
    A21: In a separate rules file on the Prometheus server

    Q22: `kafka_topic_partition_replicas` metric tracks the number of partitions for a topic/partition. Which query will get the number of partitions for the past 2 hours. Result should return a range vector.
    A22: `kafka_topic_partition_replicas[2h]`

    Q23: The metric `http_errors_total` has 3 labels: `path`, `method`, `error`. Which of the following queries will give the total number of errors for a path of `/auth`, method of `POST`, and error code of `401`?
    A23: `http_errors_total{path="/auth", method="POST", code="401"}`

    Q24: What update needs to occur to add an annotation called `description` that prints out the message `redis server <insert instance name> is down!`
    A24: `description: "redis server {{.Labels.instance}} is down!"`

    Q25: Which statement is true regarding Prometheus rules?
    A25: Groups are run in parallel, and rules within a group are run sequentially

    Q26: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "demo"
    metric_relabel_configs:
    - regex: fstype
    action: labeldrop
    ```
    A26: The label `fstype` will be dropped for all metrics

    Q27: The metric `node_filesystem_avail_bytes` reports the available bytes for each filesystem on a node. Which query will return all filesystems that has either less than 1000 available bytes or greater than 50000 bytes
    A27: `node_filesystem_avail_bytes < 1000 or node_filesystem_avail_bytes > 50000`

    Q28: For `metric_relabel_configs` and `relabel_configs`, when matching on multiple source labels, what is the default delimiter
    A28: `;`

    Q29: Which of the following is not a valid method for reloading alertmanager configuration?
    A29: hit the reload config button in alertmanager web ui

    Q30: Which of the following components is responsible for receiving metrics from short lived jobs?
    A30: pushgateway

    Q31: For a histogram metric, what are the different submetrics?
    A31: `__count`, `__bucket`, `__sum`

    Q32: Which query will return whether or not a target is currently able to be scraped?
    A32: `up`

    Q33: What does the double underscore `__` before a label name signify?
    A33: The label is reserver label

    Q34: Which configuration in alertmanager will wait 2 minutes before firing off an alert to prevent unnecessary notifications getting sent?
    A34: `group_wait: 2m`
    [when an alert arrives on alertmanager, it will wait for the amoount of time specified in group_wait to wait for other alerts to arrive before firing off a notification]

    Q35: Which of the following is not a component of the Prometheus solution?
    A35: influxdb

    Q36: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster?
    A36: service discovery

    Q37: The metric `mealplanner_consumed_calories` tracks the number of calories that have been consumed by the user. What query will return the amount of calories that had been consumed 4 days ago?
    A37: `mealplanner_consumed_calories offset 4d`

    Q38: Which of the following would make for a good SLI?
    A38: request failures
    [For good SLIs metrics, use metrics that impact the user's experience. Disk utilization, memory utilization, fan speed, and server temperature are not things that impact the user. Request failures will impact a user’s experience for sure]

    Q39: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "demo"
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: docker_container_crash_total
    action: replace
    target_label: __name__
    replacement: docker_container_restart_total
    ```
    A39: Renames the metric `docker_container_crash_total` to `docker_container_restart_total`

    Q40: What type of metric should be used to track the number of miles a car has driven?
    A40: counter

    Q41: What type of metric should be used for measuring a users heart rate?
    A41: gauge

    Q42: What is the purpose of `repeat_interval` in alertmanager?
    A42: How long to wait before sending a notification again if it has already been sent successfully for an alert

    Q43: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects?
    A43: exporters

    Q44: What are the two attributes that metrics can have?
    A44: TYPE, HELP

    Q45: What query will return all the instances whose active memory bytes is less than 10000?
    A45: `node_memory_Active_bytes < 10000`

    Q46: How many labels does the following time series have `node_fan_speed{instance=“node8”, job=“server”, fan=“2”}`?
    A46: 3

    Q47: In the prometheus configuration, what is the purpose of the `scheme` field?
    A47: Determines if Prometheus will use HTTP or HTTPS

    Q48: The metric `health_consumed_calories` tracks how many calories a user has eaten and `health_burned_calories` tracks the number of calories burned while exercising. To calculate net calories for the day subtract health_burned_calories from health_consumed_calories. Based on the time series below, which expression successfully calculates net calories.
    ```
    health_consumed_calories{job=“health”, meal=“dinner”} 800
    health_burned_calories{job=“health”, activity=“cardio”} 200
    ```
    A48: `health_consumed_calories - ignoring(meal, activity) health_burned_calories`

    Q49: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: example
    relabel_configs:
    - source_labels: [env, team]
    regex: dev;marketing
    action: drop
    ```
    A49: Drops all targets whose `env` label is set to `dev` and `team` label is set to `marketing`

    Q50: What is the name of the Prometheus query language?
    A50: PromQL

    Q51: You are writing an exporter for RabbitMQ and are creating a metric to track the size of the message queue. Which of the following would be an appropriate name for the metric.
    A51: `rabbitmq_message_bytes`

    Q52: What are the different states a Prometheus alert can be in?
    A52: inactive, pending, firing

    Q53: Which statement is true about the rate/irate functions?
    A53: `rate()` calculates average rate over entire interval, `irate()` calculates the rate only between the last two datapoints in an interval

    Q54: What does the following config do?
    ```yaml
    scrape_configs:
    - job_name: "example"
    metric_relabel_configs:
    - source_labels: [team]
    regex: (.*)
    action: replace
    target_label: organization
    replacement: org-$1
    ```
    A54: renames the `team` label to `organization` and the value of the label will get prepended with `org-`

    Q55: Analayze alertmanager configs below. Based off the following alert which receiver will receive the notification alertname: `node_filesystem_full`, labels: `team: frontend`, `notification: pager`
    ```yaml
    route:
    receiver: general-email
    group_by: [alertname]
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    notification: pager
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    - receiver: auth-email
    matchers:
    - team: auth
    ```
    A55: frontend-pager

    Q56: A database backup service has an SLO that states that 97% of all backup jobs will be completed within 60s. A histogram metric is configured to track the backup process time, which of the following bucket configurations is recommended for the desired SLO?
    A56: 35, 45, 55, 60, 65, 75, 100
    [Since histogram quantiles are approximations, to find out if a SLO has been met, make sure that a bucket is specified at the desired SLO value of 60s. The exact number (60s) must be present in the list.]

    Q57: Which of the following is not a valid time value to be used in a range selector?
    A57: 3hr

    Q58: What type of data does Prometheus collect?
    A58: numeric

    Q59: The `node_cpu_seconds_total` metric tracks the number of seconds cpu has spent in a specific mode. The metric will break it down per cpu using the `cpu` label. Which query will return the total time all cpus on an instance spent in a mode that is not `idle`. Make sure to group the result on a per instance basis.
    ```
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="idle"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="iowait"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="irq"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="nice"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="softirq"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="steal"}
    node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="system"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="idle"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="iowait"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="irq"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="nice"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="softirq"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="steal"}
    node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="system"}
    ```
    A59: `sumb by(instance) (node_cpu_seconds{mode!="idle"})`

    Q60: The following time series return values with a lot of decimal values. What query will return values rounded down to the closest integer `node_cpu_seconds_total {cpu=“0”, mode=“idle”} 115.12​ {cpu=“0”, mode=“irq”} 87.4482​ {cpu=“0”, mode=“steal”} 44.245`
    A60: `floor(node_cpu_seconds_total)`

    ---
    _Last update: Mon Jan 23 05:12:16 UTC 2023_
    _Last update: Tue Jan 24 01:20:29 UTC 2023_
  4. @luckylittle luckylittle revised this gist Jan 23, 2023. 1 changed file with 546 additions and 2 deletions.
    548 changes: 546 additions & 2 deletions Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -1143,8 +1143,100 @@ receivers:
    - you can then view those in the silence tab

    ## Monitoring Kubernetes
    - applications & clusters (control plane components, kubelet/cAdvisor, kube-state-metrics, node-exporter)
    - deploy Prometheus as close to targets as possible
    - make use of preexisting Kube infrastructure
    - to get access to cluster level metrics, we need `kube-state-metrics`
    - every host should run node-exporter on every node (DaemonSet)
    - make use of service discovery via Kube API

    ### Installation via Helm chart
    - source: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
    - makes use of the Prometheus Operator (https://github.com/prometheus-operator/prometheus-operator)
    - couple of custom resources: `Prometheus`, `Prometheus Rule`, `Alermanager Config`, `ServiceMonitor`, `Pod Monitor`
    - `helm repo add prometheus-community https://prometheus-community.github.io/helm-charts`
    - `helm repo update`
    - `helm show values prometheus-community/kube-prometheus-stack > values.yaml`
    - `helm install prometheus-community/kube-prometheus-stack`
    - `kubectl patch ds prometheus-prometheus-node-exporter --type "json" -p '[{"op": "remove", "path" : "/spec/template/spec/containers/0/volumeMounts/2/mountPropagation"}]'` - might need this due to node-exporter bug
    - installs 2 StatefulSets (AM, Prometheus), 3 Deployments (Grafana, kube-prometheus-operator, kube-state-metrics), 1 DaemonSet (node-exporter)
    - SD can discover node, service, pod, endpoint (discovers targets from listed endpoints of a service. For each endpoint address one target is discovered per port. If the endpoint is backed up by a pod, all additional container ports of the pod, not bound to an endpoint port, are discovered as targets as well)

    ### Monitor K8s Application
    - once you have application deployed and listening on some port (i.e. `3000`), you can change the Prometheus value `additionalScrapeConfigs` in the Helm chart and upgrade via `helm upgrade prometheus prometheus-community/kube-prometheus-stack -f new-values.yaml` (this is less ideal option, it is better to use service monitors to apply new scrapes more declaratively)
    - instead, look at CRDS: `kubectl get crd`, specifically `prometheuses`, `servicemonitors` (set of targets to monitor and scrape, they allow to avoid touching config directly and give you a declarative Kube synax to define targets)
    - if you want to scrape e.g. service named `api-service` exposing metrics on `/swagger-stats/metrics`, use:
    ```yaml
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    name: api-service-monitor
    labels:
    release: prometheus # default label that is used by serviceMonitorSelector - it dynamically discovers it
    app: prometheus
    spec:
    jobLabel: job # look for label job in the Service and take the value
    endpoints:
    - interval: 30s # equivalent of scrape_interval
    port: web # matches up with the port 3000 in the Service definition
    path: /swagger-stats/metrics # equivalent of metrics_path (path where the metrics are exposed)
    selector:
    matchLabels:
    app: service-api
    ```
    - but also look at `kind: Prometheuses` and what is under `serviceMonitorSelector` (e.g. `matchLabels`: `release: prometheus`) - this label allows Prometheus fo find service monitors in the cluster and register them so that it can start scraping the app the service monitor is pointing to (can confirm via Web UI - Status - Configuration)
    - to add rules, use CRD called `PrometheusRule` - e.g.:
    ```yaml
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    labels:
    release: prometheus # similar to ServiceMonitor, to add the rule dynamically
    name: api-rules
    spec:
    groups:
    - name: api
    rules:
    - alert: down
    expr: up == 0
    for: 0m
    labels:
    severity: critical
    annotations:
    summary: Prometheus target missing {{$labels.instance}}
    ```
    - to add AM rules, use CRD called `AlertmanagerConfig` - e.g.:
    ```yaml
    apiVersion: monitoring.coreos.com/v1alpha1
    kind: AlertmanagerConfig
    metadata:
    name: alert-config
    labels:
    resource: prometheus # once again, must match alertmanagerConfigSelector - BUT Helm chart does not specify a label, so you need to update this value yourself!
    spec:
    route:
    groupBy: ["severity"]
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: "webhook"
    receivers:
    - name: "webhook"
    webhookConfigs:
    - url: "http://example.com/"
    ```
    - keep in mind the differences betwe a standard AM and K8s one:

    |Standard|Kube|
    |--------|----|
    |group_by|groupBy|
    |group_wait|groupWait|
    |group_interval|groupInterval|
    |repeat_interval|repeatInterval|
    |matchers job: kubernetes|matchers name: job, value: kubernetes|

    ## Conclusion
    - https://kodekloud.com/wp-content/uploads/2022/12/Prometheus_Certified_Associate-1.pdf
    - ​You will have 1.5 hours to complete the exam.​
    - The certification is valid for 3 years.​
    - This exam is online, proctored with multiple-choice questions.
    @@ -1157,7 +1249,459 @@ receivers:
    * ​To ensure your system meets the exam requirements, visit this link: https://syscheck.bridge.psiexams.com/
    * Important exams instructions to check before scheduling the exam: https://docs.linuxfoundation.org/tc-docs/certification/important-instructions-pca

    ### Mock Exam 1 & 2
    ## Mock Exam 1 & 2

    ### 1
    Q1. The metric `node_cpu_temp_celcius` reports the current temperature of a nodes CPU in celsius. What query will return the average temperature across all CPUs on a per node basis? The query should return {instance=“node1”} 23.5 //average temp across all CPUs on node1 {instance=“node2”} 33.5 //average temp across all CPUs on node2.
    ```
    node_cpu_temp_celsius{instance="node1", cpu="0"} 28
    node_cpu_temp_celsius{instance="node1", cpu="1"} 19
    node_cpu_temp_celsius{instance="node2", cpu="0"} 36
    node_cpu_temp_celsius{instance="node2", cpu="1"} 31
    ```
    A1: `avg by(instance) (node_cpu_temp_celsius)
    Q2: What method does Prometheus use to collect metrics from targets?
    A2: pull
    Q3: An engineer forgot to address an alert, based off the alertmanager config below, how long will they need to wait to see the alert again?
    ```yaml
    route:
    receiver: pager
    group_by: [alertname]
    group_wait: 10s
    repeat_interval: 4h
    group_interval: 5m
    routes:
    - match:
    team: api
    receiver: api-pager
    - match:
    team: frontend
    receiver: frontend-pager
    ```
    A3: 4h

    Q4: Which query below will get all time series for metric `node_disk_read_bytes_total` for job=web, and job=node?
    A4: `node_disk_read_bytes_total{job=~"web|node"}`

    Q5: What type of database does Prometheus use?
    A5: Time Series

    Q6: Analyze the alertmanager configs below. For all the alerts that got generated, how many total notifications will be sent out?
    ```yaml
    route:
    receiver: general-email
    group_by: [alertname]
    routes:
    - receiver: frontend-email
    group_by: [env]
    matchers:
    - team: frontend

    The following alerts get generated by Prometheus with the defined labels.
    alert1
    team: frontend
    env: dev

    alert2
    team: frontend
    env: dev

    alert3
    team: frontend
    env: prod

    alert4
    team: frontend
    env: prod

    alert5
    team: frontend
    env: staging
    ```
    A6: 3

    Q7: What is the Prometheus client library used for?
    A7: Instrumenting applications to generate prometheus metrics and to push metrics to the Push Gateway

    Q8: Management has decided to offer a file upload service where the SLO states that 97% of all upload should complete within 30s. A histogram metric is configured to track the upload time, which of the following bucket configurations is recommended for the desired SLO?
    A8: 10, 25, 27, 30, 32, 35, 49, 50
    [since histogram quantiles are approximations, to find out if a SLO has been met make sure that a bucket is specified at the desired SLO value]

    Q9: Which of the following is not a valid method for reloading alertmanager configuration?
    A9: hit the reload config button in alertmanager web ui

    Q10: What two labels are assigned to every metric by default?
    A10: instance, job

    Q11: What configuration will make it so Prometheus doesn’t scrape targets with a label of `team: frontend`?
    ```yaml
    #Option A:
    relabel_configs:
    - source_labels: [team]
    regex: frontend
    action: drop

    #Option B:
    relabel_configs:
    - source_labels: [frontend]
    regex: team
    action: drop

    #Option C:
    metric_relabel_configs:
    - source_labels: [team]
    regex: frontend
    action: drop

    #Option D:
    relabel_configs:
    - match: [team]
    regex: frontend
    action: drop
    ```
    A11: Option A
    [relabel_configs is where you will define which targets Prometheus should scrape]
    Q12: Where should alerting rules be defined?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: database_errors_total
    action: replace
    target_label: __name__
    replacement: database_failures_total
    ```
    A12: separate rules file
    Q13: Which query below will give the 99% quantile of the metric `http_requests_total`?
    A13: `histogram_quantile(0.99, http_requests_total_bucket)`

    Q14: What metric should be used to track the uptime of a server?
    A14: counter

    Q15: Which component of the Prometheus architecture should be used to collect metrics of short-lived jobs?
    A15: push gateway

    Q16: What is the purpose of Prometheus `scrape_interval`?
    A16: Defines how frequently to scrape a target

    Q17: What does the following metric_relabel_config do?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [__name__]
    regex: database_errors_total
    action: replace
    target_label: __name__
    replacement: database_failures_total
    ```
    A17: Renames the metric `database_errors_total` to `database_failures_total`

    Q18: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster?
    A18: service discovery

    Q19: For a histogram metric, what are the different submetrics?
    A19: `__count` [total number of observations], `__bucket` [number of observations for a specific bucket], `__sum` [sum of all observations]

    Q20: What is the default web port of Prometheus?
    A20: 9090

    Q21: Add an annotation to the alert called `description` that will print out the message that looks like this `Instance has low disk space on filesystem, current free space is at %`
    ```yaml
    groups:
    - name: node
    rules:
    - alert: node_filesystem_free_percent
    expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10
    ## Examples of the two metrics used in the alert can be seen below.
    # node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
    # node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}
    # Choose the correct answer:
    # Option A:
    description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%
    # Option B:
    description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%
    # Option C:
    description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%
    # Option D:
    description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
    ```
    A21: Option B

    Q22: What does the double underscore `__` before a label name signify?
    A22: The label is reserved label

    Q23: The metric `http_errors_total` has 3 labels, `path`, `method`, `error`. Which of the following queries will give the total number of errors for a path of `/auth`, method of `POST`, and error code of `401`?
    A23: `http_errors_total{path="/auth", method="POST", code="401"}`

    Q24: What are the different states a Prometheus alert can be in?
    A24: inactive, pending, firing

    Q25: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects?
    A25: exporters

    Q26: Which of the following is not a valid time value to be used in a range selector?
    A26: 2mo

    Q27: Analyze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to `team: api` and `severity: critical`?
    ```yaml
    route:
    receiver: general-email
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    severity: critical
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    routes:
    - matchers:
    severity: critical
    receiver: backend-pager
    - receiver: auth-email
    matchers:
    - team: auth
    routes:
    - matchers:
    severity: critical
    receiver: auth-pager
    receiver: auth-pager
    ```
    A27: general-email

    Q28: A metric to track requests to an api `http_requests_total` is created. Which of the following would not be a good choice for a label?
    A28: email

    Q29: Which query below will return a range vector?
    A29: `node_boot_time_seconds[5m]`

    Q30: Based off the metrics below, which query will return the same result as the query database_write_timeouts / ignoring(error) database_error_total
    ```
    database_write_timeouts{instance="db1", job="db", error="212, type="mysql"} 12
    database_error_total{instance="db1", job="db", type="mysql"} 67
    ```
    A30: `database_write_timeouts / on(instance, job, type) database_error_total`

    Q31: What is the purpose of the for attribute in a Prometheus alert rule?
    A31: Determines how long a rule must be true before firing an alert

    Q32: Which query will give sum of all filesystems on the machine? The metric `node_filesystem_size_bytes` will list out all of the filesystems and their total size.
    ```
    node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", mountpoint="/boot/efi"} 536834048
    node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="192.168.1.168:9100", mountpoint="/"} 13268975616
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run"} 727924736
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/lock"} 5242880
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/snapd/ns"} 727924736
    node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/user/1000"} 727920640
    ```
    A32: `sum(node_filesystem_size_bytes{instance="192.168.1.168:9100"})`

    Q33: What are the 3 components of the prometheus server?
    A33: retrieval node, tsdb, http server

    Q34: What selector will match on time series whose `mountpoint` label doesn’t start with /run?
    ```
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node1", mountpoint="/boot/efi"}​
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node2", mountpoint="/boot/efi"}​
    node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node1", mountpoint="/"}​
    node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node2", mountpoint="/"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/lock"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/snapd/ns"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/user/1000"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/lock"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/snapd/ns"}​
    node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/user/1000"}
    ```
    A34: `node_filesysten_avail_bytes{mountpoint!~"/run.*"}`

    Q35: Which statement is true about the rate/irate functions?
    A35: `rate()` calculates average rate over entire interval, `irate()` calculates the rate only between the last two datapoints in an interval

    Q36: What is the default path Prometheus will scrape to collect metrics?
    A36: `/metrics`

    Q37: The following PromQL expression is trying to divide the the `node_filesystem_avail_bytes` by `node_filesystem_size_bytes` , and `node_filesystem_avail_bytes` / `node_filesystem_size_bytes`. The PromQL expression does not return any results, fix the expression so that it successfully divides the two metric. This is what the two metrics look like before the division operation:
    ```
    node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", class=”SSD” instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}
    node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}
    ```
    A37: `node_filesystem_avail_bytes / ignoring(class) node_filesystem_size_bytes`

    Q38: What are the 3 components of observability?
    A38: logging, metrics, traces

    Q39: Which of the following statements are true regarding Alert `labels` and `annotations`?
    ```yaml
    route:
    receiver: staff
    group_by: ['severity']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    routes:
    - matchers:
    job: kubernetes
    receiver: infra
    group_by: ['severity']
    ```
    A39: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, whereas annotations should be used for cosmetic descriptions of the alerts

    Q40: The metric http_errors_total{code=”404”} tracks the number of 404 errors a web server has seen. Which query returns what is the average rate of 404s a server has seen for the past 2 hours? Use a 2m sample range and a query interval of 1m:
    A40: `avg_over_time(rate(http_errors_total{code="404"}[2m]) [2h:1m])`
    [since we need the average for the past 2 hours, the first value in the subquery will be 2h and the second number is the query interval]

    Q41: Which query will return all time series for the metric `node_network_transmit_drop_total` this is greater than 20 and less than 100?
    A41: `node_network_transmit_drop_total > 20 and node_network_transmit_drop_total < 100`

    Q42: What does the following `metric_relabel_config` do?
    ```yaml
    scrape_configs:
    - job_name: example
    metric_relabel_configs:
    - source_labels: [datacenter]
    regex: (.*)
    action: replace
    target_label: location
    replacement: dc-$1
    ```
    A42: changes the datacenter label to location and prepends the value with dc-

    Q43: What type of data should Prometheus monitor?
    A43: numeric

    Q44: Which type of observability would be used to track a request/transaction as it traverses a system?
    A44: traces

    Q45: Add an annotation to the alert called description that will print out the message that looks like this Instance has low disk space on filesystem , current free space is at %
    ```yaml
    groups:
    - name: node
    rules:
    - alert: node_filesystem_free_percent
    expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10
    # Examples of the two metrics used in the alert can be seen below
    # node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
    # node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}
    # Choose the correct option:
    #Option A:
    description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%
    #Option B:
    description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%
    #Option C:
    description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%
    #Option D:
    description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
    ```
    A45: Option B

    Q46: Regarding histogram and summary metrics, which of the following are true?
    A46: histogram is calculated server side and summary is calculated client side
    [for histograms, quantiles must be calculated server side thus they are less taxin on client libraries, whereas sumary metrics are the opposite]

    Q47: What is this an example of? `Service provider guaranteed 99.999% uptime each month or else customer will be awarded $10k’
    A47: SLA

    Q48: Which of the following is Prometheus’ built in dashboarding/visualization feature?
    A48: Console templates

    Q49: Which query below will give the active bytes on instance 10.1.1.1:9100 45m ago?
    A49: `node_memory_Active_bytes{instance="10.1.1.1:9100"} offset 45m`

    Q50: What type of metric should be used for measuring internal temperature of a server?
    A50: gauge

    Q51: What is the name of the cli utility that comes with Prometheus?
    A51: promtool

    Q52: How can alertmanager prevent certain alerts from generating notification for a temporary period of time?
    A52: Configuring a silence

    Q53: In the scrape configs for a pushgateway, what is the purpose of the `honor_labels: true`
    ```yaml
    scrape_configs:
    - job_name: pushgateway
    honor_labels: true
    static_configs:
    - targets: ["192.168.1.168:9091"]
    ```
    A53: Allows metrics to specify the instance and job labels instead of pulling it from scrape_configs

    Q54: Analayze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: backend and severity: critical
    ```yaml
    route:
    receiver: general-email
    routes:
    - receiver: frontend-email
    matchers:
    - team: frontend
    routes:
    - matchers:
    severity: critical
    receiver: frontend-pager
    - receiver: backend-email
    matchers:
    - team: backend
    routes:
    - matchers:
    severity: critical
    receiver: backend-pager
    - receiver: auth-email
    matchers:
    - team: auth
    routes:
    - matchers:
    severity: critical
    receiver: auth-pager
    receiver: auth-pager
    ```
    A54: backend-pager

    Q55: Which of the following would make for a poor SLI?
    A55: high disk utilization
    [things like CPU, memory, disk utilization are poor as user may not experience any degradation of service during these events]

    Q56: Which of the following is not a valid way to reload Prometheus configuration?
    A56: promtool config reload

    Q57: Which of the following is not something that is tracked in a span within a trace?
    A57: complexity

    Q58: You are writing your own exporter for a Redis database. Which of the following would be the correct name for a metric to represent used memory on the by the Redis instance?
    A58: `redis_mem_used_bytes`
    [the first should be the app, second metric name, third the unit]

    Q59: Which cli command can be used to verify/validate prometheus configurations?
    A59: `promtool check config`

    Q60: Which query will return targets who have more than 50 arp entries?
    A60: `node_arp_entries{job="node"} > 50`

    ### 2

    ---
    _Last update: Fri Jan 20 05:13:26 UTC 2023_
    _Last update: Mon Jan 23 05:12:16 UTC 2023_
  5. @luckylittle luckylittle revised this gist Jan 20, 2023. 1 changed file with 34 additions and 0 deletions.
    34 changes: 34 additions & 0 deletions Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,40 @@
    Prometheus Certified Associate (PCA)
    ------------------------------------

    ## Curriculum

    1. **28%** PromQL
    - Selecting Data
    - Rates and Derivatives
    - Aggregating over time
    - Aggregating over dimensions
    - Binary operators
    - Histograms
    - Timestamp Metrics
    2. **20%** Prometheus Fundamentals
    - System Architecture
    - Configuration and Scraping
    - Understanding Prometheus Limitations
    - Data Model and Labels
    - Exposition Format
    3. **18%** Observability Concepts
    - Metrics
    - Understand logs and events
    - Tracing and Spans
    - Push vs Pull
    - Service Discovery
    - Basics of SLOs, SLAs, and SLIs
    4. **18%** Alerting & Dashboarding
    - Dashboarding basics
    - Configuring Alerting rules
    - Understand and Use Alertmanager
    - Alerting basics (when, what, and why)
    5. **16%** Instrumentation & Exporters
    - Client Libraries
    - Instrumentation
    - Exporters
    - Structuring and naming metrics

    ## Observability Fundamentals

    1. Observability
  6. @luckylittle luckylittle revised this gist Jan 20, 2023. 1 changed file with 186 additions and 1 deletion.
    187 changes: 186 additions & 1 deletion Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -922,6 +922,191 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    * `delete` - same as HTTP DELETE (all metrics for a group are removed)

    ## Alerting
    - let's you define condition that if met trigger alerts
    - these are standard PromQL expressions (e.g. `node_filesystem_avail_bytes < 1000` = 547)
    - Prometheus is only responsible for triggering alerts
    - responsibility of sending notification is offloaded onto **alertmanager** -> Slack, email, SMS etc.
    - alerts are visible in the web gui under "alerts" and they are green if not alerting
    - alerting rules are similar to recording rules, in fact they are in the same location (`rule_files` in `prometheus.yaml`):
    ```yaml
    groups:
    - name: node
    interval: 15s
    rules:
    - record: ...
    expr: ...
    - alert: LowMemory
    expr: node_memory_memFree_percent < 20
    ```
    - The `for` clause tells Prometheus that an expression must evaluate true for specific period of time:
    ```yaml
    - alert: node down
    expr: up{job="node"} == 0
    for: 5m # expects the node to be down for 5 minutes before firing an alert
    ```
    - 3 alert states:
    1. inactive - has not returned any results **green**
    2. pending - it hasn't been long enough to be considered firing (related to `for`) **orange**
    3. firing - active for more than the defined `for` clause **red**

    ### Labels & Annotations
    - optional labels can be added to alerts to provide a mechanism to classify and match alerts
    - important because they can be used when you set up rules in alert manager so you can match on these and group them together
    ```yaml
    - alert: node down
    expr: ...
    labels:
    severity: warning
    - alert: multiple nodes down
    expr: ...
    labels:
    severity: critical
    ```
    - annotations (use Go templating) can be used to provide additional/descriptive information (unlike labels they do not play a part in the alerts identity)
    ```yaml
    - alert: node_filesystem_free_percent
    expr: ...
    annotations:
    description: "Filesystem {{.Labels.device}} on {{.Labels.instance}} is low on space, current available space is {{.Value}}"
    ```
    This is how the templating works:
    - `{{.Labels}}` to access alert labels
    - `{{.Labels.instance}}` to get instance label
    - `{{.Value}}` to get the firin sample value

    ### Alertmanager
    - responsible for receiving alerts generated by Prometheus and converting them to notifications
    - supports multiple Prometheus servers via API
    - workflow:
    1. dispatcher picks up the alerts first,
    2. inhibition allows suppress certain alerts if other alerts exist,
    3. silencing mutes alerts (e.g. maintenance)
    4. routing is responsible what alert gets to send where
    5. notification integrates with all 3rd party tools (email, Slack, SMS, etc.)
    - installation tarball (`alertmanager-0.24.0.linux-amd64.tar.gz`) contains `alertmanager` binary, `alertmanager.yml` config file, `amtool` command line utility and `data` folder where the notification states are stored. The installation is the same as previous tools (add new user, create /etc/alertmanager, create /var/lib/alertmanager, copy executables to /usr/local/bin, change ownerships, create service file, daemon-reload, start, enable). `ExecStart` in systemd expects `--config.file` and `--storage.path`
    - starting is simple `./alertmanager` and listens on 9093 (you can see the interface on https://localhost:9093)
    - restarting AM can be done via HTTP POST to `/-/reload` endpoint, `systemctl restart alertmanager` or `killall -HUP alertmanager`
    - configure Prometheus to use that alertmanager:
    ```yaml
    global: ...
    alerting:
    alertmanagers:
    - static_configs:
    - targets:
    - 127.0.0.1:9093
    - alertmanager2:9093
    ```
    - there are 3 main sections of `alertmanager.yml`:
    * global - applies across all sections which can be overwritten (e.g. `smtp_smarthost`)
    * route - set of rules to determine what alerts get matched up (`match_re`, `matchers`) with what receiver
    - at the top level, there is a default route - any alerts that don't match any of the other routes will use this default
    example route:
    ```yaml
    route:
    routes:
    - match_re: # regular expresion
    job: (node|windows)
    receiver: infra-email
    - matchers: # all alerts with job=kubernetes & severity=ticket labels will match this rule
    job: kubernetes
    severity: ticket
    receiver: k8s-slack # they will be send to this receiver
    ```
    - nested routes are supported:
    ```yaml
    routes:
    - matchers: # parent route
    job: kubernetes # 2. all other alerts with this label will match this main route (k8s-email)
    receiver: k8s-email
    routes: # sub-route for further route matching (AND)
    - matchers:
    severity: pager # 1. if the alert has also label severity=pager, then it will be send to k8s-pager
    receiver: k8s-pager
    ```
    - if you need an alert to match two routes, use `continue`:
    ```yaml
    route:
    routes:
    - receiver: alert-logs # all alerts to be sent to alert-logs
    continue: true
    - matchers:
    job: kubernetes # and then if it also has this label, it will be sent to k8s-email
    receiver: k8s-email
    ```
    - grouping allows to split up your notification by labels (otherwise all alerts results in one big notification):
    ```yaml
    receiver: fallback-pager
    group_by: [team]
    routes:
    - matchers:
    team: infra
    group_by: [region,env] # infra team has alerts grouped based on region and env labels
    receiver: infra-email
    # any child routes underneath here will inherit the grouping policy and group based on same 2 labels region, env
    ```
    * receivers - one or more notifiers to forward alerts to users (e.g. `slack_configs`)
    - make use of global configurations so all of the receivers don't have to manually define the same key:
    ```yaml
    global:
    victorops_api_key: XXX # this will be automatically provided to all receivers
    receivers:
    - name: infra-pager
    victorops_configs:
    - routing_key: some-route-here
    ```
    - you can customize the message by using Go templating:
    * GroupLabels (e.g. `title:` in `slack_configs`: `{{.GroupLabels.severity}} alerts in region {{.GroupLabels.region}}`)
    * CommonLabels
    * CommonAnnotations
    * ExternalURL
    * Status
    * Receiver
    * Alerts (e.g. `text:` in `slack_configs`: `{{.Alerts | len}} alerts:`)
    * Labels
    * Annotations (`{{range .Alerts}}{{.Annotations.description}}{{"\n"}}{{end}}`)
    * Status
    * StartsAt
    * EndsAt
    - Example alertmanager.yml config:
    ```yaml
    global:
    smtp_smarthost: 'localhost:25'
    smtp_from: '[email protected]'
    route:
    group_by: ['alertname']
    group_wait: 10s
    group_interval: 2m
    repeat_interval: 1h
    receiver: 'general-email'
    routes:
    - matchers:
    - team=global-infra
    receiver: global-infra-email
    - matchers:
    - team=internal-infra-email
    receiver: internal-infra-email
    receivers:
    - name: 'web.hook'
    webhook_configs:
    - url: 'http://127.0.0.1:5001/'
    - name: global-infra-email
    email_configs:
    - to: [email protected]
    require_tls: false
    - name: internal-infra-email
    email_configs:
    - to: [email protected]
    require_tls: false
    - name: general-email
    email_configs:
    - to: [email protected]
    require_tls: false
    ```

    ### Silences
    - alerts can be silence to prevent generating notifications for a period of time (like maintenance windows)
    - in the "new silence" button - specify start, end/duration, matchers (list of labels), creator, comment
    - you can then view those in the silence tab

    ## Monitoring Kubernetes

    @@ -941,4 +1126,4 @@ IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    ### Mock Exam 1 & 2

    ---
    _Last update: Thu Jan 19 05:14:42 UTC 2023_
    _Last update: Fri Jan 20 05:13:26 UTC 2023_
  7. @luckylittle luckylittle revised this gist Jan 19, 2023. 1 changed file with 420 additions and 1 deletion.
    421 changes: 420 additions & 1 deletion Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -521,5 +521,424 @@ A: ``
    - query at a specific time, just add another `--data 'time=169386192'`
    - response back as JSON

    ## Dasboarding & Visualization
    - several different ways:
    * expression browser with graph tab (built-in)
    * console templates (built-in)
    * 3rd party like Grafana
    - expression browser has limited functionality, only for ad-hoc queries and quick debugging, cannot create custm dashboards, not good for day-to-day monitoring, but can have multiple panels and compare graphs

    ### Console Templates
    - allow to create custom HTML pages using Go templating language (typically `{{` and `}}`)
    - Prometheus metrics, queries and charts can be embedded in the templates
    - `ls /etc/prometheus/consoles` to see the `*.html` and example (to see it, go to https://localhost:9090/consoles/index.html.example)
    - boilerplate will typically contain:
    ```html
    {{ template "head" . }}
    {{ template "prom_content_head" . }}
    <h1>Memory details</h1>
    active memory: {{ template "prom_query_drilldown" (args "node_memory_Active_bytes") }}
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}
    ```
    - an example of inserting a chart:
    ```html
    {{ template "head" . }}
    {{ template "prom_content_head" . }}
    <h1>Memory details</h1>
    active memory: {{ template "prom_query_drilldown" (args "node_memory_Active_bytes") }}
    <div id="graph"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#graph"),
    expr: "rate(node_memory_Active_bytes[2m])"
    })
    </script>
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}
    ```
    - another example:
    ```html
    {{ template "head" . }}
    {{ template "prom_content_head" . }}
    <h1>Node Stats</h1>
    <h3>Memory</h3>
    <strong>Memory utilization:</strong> {{ template "prom_query_drilldown" (args "100- (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100)") }}
    <br/>
    <strong>Memory Size:</strong> {{ template "prom_query_drilldown" (args "node_memory_MemTotal_bytes/1000000" "Mb") }}
    <h3>CPU</h3>
    <strong>CPU Count:</strong> {{ template "prom_query_drilldown" (args "count(node_cpu_seconds_total{mode='idle'})") }}
    <br/>
    <strong>CPU Utilization:</strong> {{ template "prom_query_drilldown" (args "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/56") }}
    <!-->
    Expression explanation: The expression will take the current rate of all cpu modes except idle because idle means cpu isn’t being used. It will then sum them up and multiply them by 100 to give a percentage. This final number is divided by 8 (if this server/node has 8 CPUs, we want to get the utilisation per CPU, so adjust this value as needed).
    </!-->
    <div id="cpu"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#cpu"),
    expr: "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/2",
    })
    </script>
    <h3>Network</h3>
    <div id="network"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#network"),
    expr: "rate(node_network_receive_bytes_total[2m])",
    })
    </script>
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}
    ```

    ## Application Instrumentation
    - the Prometheus client libraries provide an easy way to add instrumentation to your code in order to track and expose metrics for Prometheus
    - they do 2 things:
    * Track metrics in the Prometheus expected format
    * Expose metrics via `/metrics` path so they can be scraped
    - official and unofficial libraries
    - Example for Python:
    * You have an existing API in Flask, run `pip install prometheus_client`
    * In your code, import it: `from prometheus_client import Counter`
    * Initialize counter object: `REQUESTS = Counter('http_requests_total', 'Total number of requests')`
    * When do we want to increment this? Within all of the `@app.get("/path")` like this: `REQUESTS.inc()`
    * We can also get total requests per path using different counter objects, but that is not recommended. Instead we can use labels:
    * `REQUESTS = Counter('http_requests_total', 'Total number of requests', labelnames=['path'])`
    * `REQUESTS.labels('/cars').inc()`
    * Then you can do the same approach for different HTTP method: `labelnames=['path', 'method']` and `REQUESTS.labels('/cars', 'post').inc()`
    * How to expose to `/metrics` endpoint though?
    ```python
    from prometheus_client import Counter, start_http_server
    if __name__ == '__main__':
    start_http_server(8000) # start the metrics server on port
    app.run(port='5001') # this is the Flask app
    ```
    * `curl 127.0.0.1:8000` will show the metrics
    * However, you can also expose the metrics from Flask route and have Flash app on `http://localhost:5001` and Prometheus on `http://localhost:5001/metrics` like e.g. `app.wsgi_app = DispatcherMiddleware(app.wsgi_app, { '/metrics': make_wsgi_app() })`
    - complete working example:
    ```python
    from flask import Flask
    from prometheus_client import Counter, start_http_server, Gauge
    REQUESTS = Counter('http_requests_total', 'Total number of requests', labelnames=['path', 'method'])
    ERRORS = Counter('http_errors_total',
    'Total number of errors', labelnames=['code'])
    IN_PROGRESS = Gauge('inprogress_requests',
    'Total number of requests in progress')
    def before_request():
    IN_PROGRESS.inc()
    def after_request(response):
    IN_PROGRESS.dec()
    return response
    app = Flask(__name__)
    @app.get("/products")
    def get_products():
    REQUESTS.labels('products', 'get').inc()
    return "product"
    @app.post("/products")
    def create_product():
    REQUESTS.labels('products', 'get').inc()
    return "created product", 201
    @app.get("/cart")
    def get_cart():
    REQUESTS.labels('products', 'get').inc()
    return "cart"
    @app.post("/cart")
    def create_cart():
    REQUESTS.labels('products', 'get').inc()
    return "created cart", 201
    @app.errorhandler(404)
    def page_not_found(e):
    ERRORS.labels('404').inc()
    return "page not found", 404
    if __name__ == '__main__':
    start_http_server(8000)
    app.run(debug=False, host="0.0.0.0", port='6000')
    ```

    ### Implementing histogram & summary in your code (example)

    ```python
    # add histogram metric to track latency/response time for each request
    LATENCY = Histogram('request_latency_seconds', 'Request Latency', labelnames=['path', 'method'])
    # get before_request time via `request.start_time = time.time()`
    # calculate after_request as `request_latency = time.time() minus request.start_time` and pass it to:
    LATENCY.labels(request.method, request.path).observe(request_latency)
    ```

    - client libraries can let you specify bucket sizes (e.g. `buckets=[0.01, 0.02, 0.1]`)
    - to configure summary, it is the exact same, just use `LATENCY = Summary('......)`

    ### Implementing gauge metric in your code (example)

    ```python
    # track the number of active requests getting processed at the moment
    IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
    # before_request will then increment IN_PROGRESS.inc()
    # but after_request when it's done, then decrement IN_PROGRESS.dec()
    ```

    ### Best practices
    - use snake_case naming, all lowercase, e.g. `library_name_unit_suffix`
    - first word should be app/library name it is used for
    - next add what is it used for
    - add unit (`_bytes`) at the end, use unprefixed base units (not microseconds or kilobytes)
    - avoid `_count`, `_sum`, `_bucket` suffixes
    - examples: `process_cpu_seconds`, `http_requests_total`, `redis_connection_errors`, `node_disk_read_bytes_total`
    - not good: `container_docker_restarts`, `http_requests_sum`, `nginx_disk_free_kilobytes`, `dotnet_queue_waiting_time`
    - three types of services/apps:
    * online - immediate response is expected (tracking queries, errors, latency etc)
    * offline - no one is actively waiting for response (amount of queue, wip, processing rate, errors etc)
    * batch - similar to offline but regular, needs push gw (time processing, overall runtime, last completion time)


    ## Service Discovery
    - allows Prometheus to dynamically update/populate/remove a list of endpoints to scrape
    - several built-ins: file, ec2, azure, gce, consul, nomad, k8s...
    - in the Web ui: "status" - "service discovery"

    ### File SD
    - list of jobs/targets can be imported from a json/yaml file(s)
    - example:
    ```yaml
    scrape_configs:
    - job_name: file-example
    file_sd_configs:
    - files:
    - file-sd.json
    - '*.json'
    ```
    - then the `file-sd.json` would look like e.g.:
    ```json
    [
    {
    "targets": [ "node1:9100", "node2:9100" ],
    "labels": {
    "team": "dev",
    "job": "node"
    }
    }
    ]
    ```

    ### AWS
    - just need to configure EC2 discovery in the config:
    ```yaml
    scrape_configs:
    - job_name: ec2
    ec2_sd_configs: # IAM with at least AmazonEC2ReadOnly policy
    - region: <region>
    access_key: <access key>
    secret_key: <secret key>
    ```
    - automatically extracts metadata for each EC2 instance
    - defaults to using private IPs

    ### Re-labeling
    - classify Prometheus targets & metrics by rewriting their label set
    - e.g. rename instance from `node1:9100` to just `node1`, drop metrics, drop labels etc
    - 2 options:
    * `relabel_configs` in `Prometheus.yml` which occurs **before** scrape and only has access to labels added by SD mechanism
    * `metric_relabel_configs` in `Prometheus.yml` which occurs **after** the scrape

    #### relabel_configs
    - example #1: `__meta_ec2_tag_env = dev | prod`
    ```yaml
    - job_name: aws
    relabel_configs:
    - source_labels: [__meta_ec2_tag_env] # array of labels to match on
    regex: prod # to match on specific value of that label
    action: keep|drop|replace # keep=continue to scrape BUT in that case if regex is not match it will NOT be scraped (there is implicit invisible catchall at the end!), drop=no longer scrape this target
    ```
    - example #2: when there are more than 1 source labels (array) they will be joined by a `;`:
    ```yaml
    relabel_configs:
    - source_labels: [env, team] # if the target has {env=dev} and {team=marketing}, we will keep it
    regex: dev;marketing
    action: keep # everything else will be dropped
    # separator: "-" # optional, change the delimiter between labels use the separate property
    ```

    - target labels = labels that are added to the labels of every time series returned from a scrape, relabeling will drop all auto-discovered labels (starting with `__`). In other words: Target labels are assigned to every metric from that specific target. Discovered labels are labels that start with a `__` will be dropped after the initial relabeling process and will not get assigned as target labels.
    - example #3 of saving `__address__=192.168.1.1:80` label in target label, but need to transform into `{ip=192.168.1.1}`:
    ```yaml
    relabel_configs:
    - source_labels: [__address__]
    regex: (.*):.* # assign everything before the `:` into a group referenced with `$1` below
    target_label: ip # name of the new label
    action: replace
    replacement: $1
    ```
    - example #4 of combining labels `env="dev"` & `team="web"` will turn into `info="web-dev"`
    ```yaml
    relabel_configs:
    - source_labels: [team, env]
    regex: (.*);(.*) # parenthesis allow you to use the values as $ below
    action: replace
    target_label: info
    replacement: $1-$2
    ```
    - example #5 Re-label so the label `team` name changes to the `organization` and the value gets prepended with `org-` text:
    ```yaml
    relabel_configs:
    - source_labels: [team]
    regex: (.*)
    action: replace
    target_label: organization
    replacement: org-$1
    ```
    - to drop the label, use `action: labeldrop` based on the `regex`:
    ```yaml
    - regex: size
    action: labeldrop
    ```
    - the opposite of labeldrop is `labelkeep` - but keep in mind ALL other labels will be dropped!
    ```yaml
    - regex: instance|job
    action: labelkeep
    ```
    - to modify the label name (not the value), use `labelmap` like this:
    ```yaml
    - regex: __meta_ec2_(.*) # match any of these ec2 discovered labels - e.g. __meta_ec2_ami="ami-abcdefgh123456"
    action: labelmap
    replacement: ec2_$1 # we will prepend it with `ec2` - e.g. ec2_ami="ami-abcdefgh123456"
    ```
    #### metric_relabel_configs
    - takes place after the perform the scrape and has access to scraped metrics (not just the labels)
    - configuration is identical to `relabel_configs`
    - example #1:
    ```yaml
    - job_name: example
    metric_relabel_configs: # this will drop a metric http_errors_total
    - source_labels: [__name__]
    regex: http_errors_total
    action: drop # or keep, which will drop every other metrics
    ```
    - example #2:
    ```yaml
    - job_name: example
    metric_relabel_configs: # rename a metric name from http_errors_total to http_failures_total
    - source_labels: [__name__]
    regex: http_errors_total
    action: replace
    target_label: __name__ # whats the new name of the label key
    replacement: http_failures_total # replacement is the new name of the value / the name of the metric
    ```
    - example #3:
    ```yaml
    - job_name: example
    metric_relabel_configs: # drop a label named code
    - regex: code
    action: labeldrop # drop a label for a metric
    ```
    - example #4:
    ```yaml
    - job_name: example
    metric_relabel_configs: # strips of the forward slash and rename {path=/cars} -> {endpoint=cars}. Keep in mind there will now be path as well as endpoint. Use drop to get rid of label path showing the same information.
    - source_labels: [path]
    regex: \/(.*) # any text after the forward slash (wrapping it in paranthesis gives you access with $)
    action: replace
    target_label: endpoint
    replacement: $1 # match the original value
    ```

    ## Push Gateway
    - when process is already exited before the scrape occured
    - middle man between batch job and Prometheus server
    - Prometheus will scrape metrics from the PG
    - installation: pushgateway-1.4.3.linux-amd64.tar.gz from the releases page, untar, run `./pushgateway`
    - create a new user `sudo useradd --no-create-home --shell /bin/false pushgateway`
    - copy the binary to /usr/local/bin, change owner to pushgateway, configure service file (same as the Prometheus)
    - systemct daemon-reload, restart, enable
    - `curl localhost:9091/metrics`
    - configure prometheus to scrape gateway. Same as other targets, but needs `honor_labels: true` (allows the metrics to specify custom labels like `job1`, `job2` etc)
    - for sending the metrics, you send via HTTP POST request: `http://<pushgateway_addr>:<port>/metrics/job/<job_name>/<label1>/<value1>/<label2>/<value2>...` where job_name will be the job label of the metrics pushed, labels/values used as a grouping key, allows for grouping metrics together to update/delete multiple metrics at once. When sending a POST request, only metrics with the same name as the newly pushed, are replaced (this only applies to metrics in same group).
    1. see the original metrics:
    ```
    processing_time_seconds{quality="hd"} 120
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    ```
    2. POST the `processing_time_seconds{quality="hd"} 999`
    3. result:
    ```
    processing_time_seconds{quality="hd"} 999
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    ```
    - example: push metric `example_metric 4421` with a job label of `{job="db_backup"}`: `echo "example_metric 4421 | curl --data-binary @-http://localhost:9091/metrics/job/db_backup` (`@-` tells curl to read the binary data from stdin)
    - another example with multiple metrics at once:
    ```bash
    cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/video_processing/instance/mp4_node1
    processing_time_seconds{quality="hd"} 120
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    EOF
    ```
    - When using HTTP PUT request however, the bahivor is different. All metrics within a specific group get replaced by the new metrics being pushed (deletes preexisting):
    1. start with:
    ```
    processing_time_seconds{quality="hd"} 999
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    ```
    2. PUT the `processing_time_seconds{quality="hd"} 666`
    3. result:
    ```
    processing_time_seconds{quality="hd"} 666
    ```
    - Delete HTTP request will delete all metrics within a group (not going to touch any metrics in the other groups): `curl -X DELETE http://localhost:9091/metrics/job/archive/app/web` will only delete all with `{app="web"}`

    #### Client library
    - Python: `from prometheus_client import CollectorRegistry, pushadd_to_gateway`, then initialize `registry = CollectorRegistry()`. You can then push via `pushadd_to_gateway('user2:9091', job='batch', registry=registry)`
    - 3 functions within a library to push metrics:
    * `push` - same as HTTP PUT (any existing metrics for this job are removed and the pushed metrics added)
    * `pushadd` - same as HTTP POST (overrides existing metrics with the same names, but all other metrics in group remain uncahnged)
    * `delete` - same as HTTP DELETE (all metrics for a group are removed)

    ## Alerting

    ## Monitoring Kubernetes

    ## Conclusion
    - ​You will have 1.5 hours to complete the exam.​
    - The certification is valid for 3 years.​
    - This exam is online, proctored with multiple-choice questions.
    - One retake is available for this exam.
    - Important links:
    * Prometheus Certified Associate(PCA) registration link: https://training.linuxfoundation.org/certification/prometheus-certified-associate/
    * Exam curriculum: https://github.com/cncf/curriculum/blob/master/PCA_Curriculum.pdf
    * ​Certification FAQs: https://docs.linuxfoundation.org/tc-docs/certification/frequently-asked-questions-pca
    * Candidate Handbook: https://docs.linuxfoundation.org/tc-docs/certification/lf-handbook2
    * ​To ensure your system meets the exam requirements, visit this link: https://syscheck.bridge.psiexams.com/
    * Important exams instructions to check before scheduling the exam: https://docs.linuxfoundation.org/tc-docs/certification/important-instructions-pca

    ### Mock Exam 1 & 2

    ---
    _Last update: Tue Jan 17 05:45:09 UTC 2023_
    _Last update: Thu Jan 19 05:14:42 UTC 2023_
  8. @luckylittle luckylittle revised this gist Jan 17, 2023. 1 changed file with 247 additions and 5 deletions.
    252 changes: 247 additions & 5 deletions Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,8 @@
    Prometheus Certified Associate (PCA)
    ------------------------------------

    ## Observability Fundamentals

    1. Observability
    - the ability to understand and measure the state of a system based on data generated by the system
    - allows to generate actionable outputs from **unexpected** scenarions
    @@ -33,6 +35,8 @@ b. SLO (service level objectives) = target value or range for an SLI

    c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO

    ## Prometheus Fundamentals

    3. Prometheus fundamentals
    - use cases:
    * collect metrics from different locations like West DC, central DC, East DC, AWS etc.
    @@ -190,20 +194,26 @@ basic_auth_users:
    - metric have a `TYPE` (counter, gauge, histogram, summary) and `HELP` (description of the metric is) attribute:
    * counter can only go up, how many times did X happen
    * gauge can go up or down, what is the current value of X
    * histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s)
    * summary is similar to histogram and tells us how many observation fell below X, do no thave to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s)
    * histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s) - `request_latency_seconds_bucket{le="0.05"} 50`. Buckets are cumulative (i.e. all request in the le=0.03 bucket will include all requests less a 0.03s which includes all requests that fall into the buckets below it (e.g 0.02, 0.01) - e.g. to calculate the histogram's quantiles, we would use `histogram_quantile`, approximation of the value of a specific quantile: 75% of all requests have what latency? `histogram_quantile(0.75, request_latency_seconds_bucket)`. To get an accurate value, make sure there is a bucket at the specific value that needs to be met. Every time you add a bucket, it will slower the performance of the Prometheus!
    * summary is similar to histogram and tells us how many observation fell below X, do no thave to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s). Similarly to histogram, there will be `_count` and `_sum` metrics as well as quantiles like 0.7, 0.8, 0.9 (instead of buckets).

    |histogram|summary|
    |---------|-------|
    |bucket sizes can be picked|quantile must be defined ahead of time|
    |less taxing on client libraries|more taxing on client libraries|
    |any quantile can be selected|only quantiles predefined in client can be used|
    |Prometheus server must calculate quantiles|very minimal server-side cost|


    Q:How many total unique time series are there in this output?
    ```
    node_arp_entries{instance="node1" job="node"} 200
    node_arp_entries{instance="node2" job="node"} 150

    node_cpu_seconds_total{cpu="0", instance="node"1", mode="iowait"}
    node_cpu_seconds_total{cpu="1", instance="node"1", mode="iowait"}
    node_cpu_seconds_total{cpu="0", instance="node"1", mode="idle"}
    node_cpu_seconds_total{cpu="1", instance="node"1", mode="idle"}
    node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}

    node_memory_Active_bytes{instance="node1" job="node"} 419124
    node_memory_Active_bytes{instance="node2" job="node"} 55589
    ```
    @@ -236,6 +246,38 @@ A: Help, Type
    Q: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782 ; What is the metric name?
    A: http_request_total
    Q: For the http_request_total metric, what is the query/metric name that would be used to get the count of total requests on node node01:3000?
    A: `http_request_total_count{instance="node01:3000"}`
    Q: Construct a query to return the total number of requests for the /events route with a latency of less than 0.4s across all nodes.
    A: `http_request_total_bucket{route="/events",le="0.4"}`
    Q: Construct a query to find out how many requests took somewhere between 0.08s and 0.1s on node node02:3000.
    A:
    Q: Construct a query to calculate the rate of http requests that took less than 0.08s. Use a time window of 1m across all nodes.
    A:
    Q: Construct a query to calculate the average latency of a request over the past 4 minutes. Use the formula below to calculate average latency of request: rate of sum-of-all-requests / rate of count-of-all-requests
    A:
    Q: Management would like to know what is the 95th percentile for the latency of requests going to node node01:3000. Construct a query to calculate the 95th percentile.
    A:
    Q: The company is now offering customers an SLO stating that, 95% of all requests will be under 0.15s. What bucket size will need to be added to guarantee that the histogram_quantile function can accurately report whether or not that SLO has been met?
    A: 0.15
    Q: A summary metric http_upload_bytes has been added to track the amount of bytes uploaded per request. What are percentiles being reported by this metric?
    (A) 0.02, 0.05, 0.08, 0.1, 0.13, 0.18, 0.21, 0.24, 0.3, 0.35, 0.4
    (B) 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99
    (C) events, tickets
    (D) 200, 201, 400, 404
    A:
    Q:
    A:
    10. Expression browser
    - Web UI for Prometheus server to query data
    - `up` - returns which targets are in up state (you can see an `instance` and `job` and value on the right - `0` and `1`)
    @@ -279,5 +321,205 @@ A: http_request_total
    * `docker-compose up` or `docker compose up`
    * `curl localhost:8080/metrics`
    ## PromQL
    - short for Protmetheus Query Language
    - data returned can be visualized in dashboards
    - used to build alerting rules to notify about thersholds
    ### Data Types
    1. String (currently unused)
    2. Scalar - numeric floating point value (e.g. `54.743`)
    3. Instant vector - set of time series containing a single sample for each time series sharing the same timestamp (e.g. `node_cpu_seconds_total` finds all unique labels and value for each and they all will going to be all at single point in time)
    4. Range vector - set of time series containing a range of data points over time for each time series (e.g. `node_cpu_seconds_total[3m]` finds all unique labels, but all values and timestamps from the past 3 minutes)
    ### Selectors
    - if we only want to return a subset of times series for a metric = label matchers:
    * exact match `=` (e.g. `node_filesystem_avail_bytes{instance="node1"}`)
    * negative equality `!=` (e.g. `node_filesystem_avail_bytes{device!="tmpfs"}`)
    * regular expression `=~` (e.g. starts with /dev/sda - `node_filesystem_avail_bytes{device=~"/dev/sda.*"}`)
    * negative regular expression `!~` (e.g. mountpoint does not start with /boot - `node_filesystem_avail_bytes{mountpoint!~"/boot.*"}`)
    - we can combine multiple selectors with comma (e.g. `node_filesystem_avail_bytes{instance="node1",device!="tmpfs"}`)
    ### Modifiers
    - to get historic data, use an `offset` modifier after the label matching (e.g. get value 5 minutes ago - `node_memory_active_bytes{instance="node1"} offset 5m`)
    - to get to the exact point in time (e.g. get value on September 15 - `node_memory_active_bytes{instance="node1"} @1663265188`)
    - you can use both modifiers and order does not matter (e.g. `@1663265188 offset 5m` = `offset 5m @1663265188`)
    - you can also add range vectors (e.g. get 2 minutes worth of data 10 minutes before September 15 `[2m] @1663265188 offset 5m`)
    ### Operators
    - between instant vectors and scalars
    - types:
    1. Arithemtic `+`, `-`, `*`, `/`, `%`, `^` (e.g. `node_memory_Active_bytes / 1024` - but it drops the metric name in the output as it is no longer the original metric!)
    2. Comparison `==`, `!==`, `>`, `<`, `>=`, `<=`, `bool` (e.g. `node_network_flags > 100`, `node_network_receive_packets_total >= 220`, `node_filesystem_avail_bytes < bool 1000` returns `0` or `1` mostly for generating alerts)
    3. Logical `OR`, `AND`, `UNLESS` (e.g. `node_filesystem_avail_bytes > 1000 and node_filesystem_avail_bytes < 3000`). Unless oeprator results in a cevtor consisting of elements ofn the left side for which there are no elements on the right side (e.g. return all vectors greater than 1000 unless they are greater than 30000 `node_filesystem_avail_bytes > 1000 unless node_filesystem_avail_bytes > 30000`)
    4. more than one operator follows the order of precedence from highest to lowest, while operators on the same precedence level are performed from the left (e.g. `2 * 3 % 2 = (2 * 3) % 2`), however power is performed from the left (e.g. `2 ^ 3 ^ 2 = 2 ^ (3 ^ 2)`):
    high ^ `^`
    | `*`, `/`, `%`, `atan2`
    | `+`, `-`
    | `==`, `!=`, `<=`, `<`, `>=`, `>`
    | `and`, `unless`
    low | `or`
    Q: Construct a query to return all filesystems that have over 1000 bytes available on all instances under web job.
    A: `node_filesystem_avail_bytes{job="web"} > 1000`
    Q: Which of the following queries you will use for loadbalancer:9100 host to return all the interfaces that have received less than or equal to 10000 bytes of traffic?
    A: `node_network_receive_bytes_total{instance="loadbalancer:9100"} <= 10000`
    Q: node_filesystem_files tracks the filesystem's total file nodes. Construct a query that only returns time series greater than 500000 and less than 10000000 across all jobs
    A: `node_filesystem_files > 500000 and node_filesystem_files < 10000000`
    Q: The metric node_filesystem_avail_bytes lists the available bytes for all filesystems, and the metric node_filesystem_size_bytes lists the total size of all filesystems. Run each metric and see their outputs. There are three properties/labels these will return, device, fstype, and mountpoint. Which of the following queries will show the percentage of free disk space for all filesystems on all the targets under web job whose device label does not match tmpfs?
    A: `node_filesystem_avail_bytes{job="web", device!="tmpfs"}*100 / node_filesystem_size_bytes{job="web", device!="tmpfs"}`
    ### Vector matching
    - between 2 instant vectors (e.g. to get the percentage of free space `node_filesystem_avail_bytes / node_filesystem_size_bytes * 100` )
    - samples with exactly the same lables get matched together (e.g. instance and job and mountpoint must be the same to get match) - every element in the vector on the left tries to find a single matching element on the right
    - to perform operation on 2 vectors with differing labels like http_errors code="500", code="501", code="404", moethod="put" etc. use the `ignoring` keyword (e.g. `http_errors{code="500"} / ignoring(code) http_requests`)
    - if the entries with e.g. methods `put` and `del` have no match in both metrics `http_errors` and `http_requests`, they will not show up in the results!
    - to get results on all labels to match on, we use the `on` keyword (e.g. `http_errors{code="500"} / on(method) http_requests`)
    |vector1|+ vector2|= resulting vector|
    |-------|-------|--------------------|
    |{cpu=0,mode=idle}|{cpu=1,mode=steal}|{cpu=0}|
    |{cpu=1,mode=iowait}|{cpu=2,mode=user}|{cpu=1}|
    |{cpu=2,mode=user}|{cpu=0,mode=idle}|{cpu=2}|
    - Resulting vector will have matching elements with all labels listed in `on` or all labels `not/ignored`: e.g. `vector1{}+on(cpu) vector2{}` or `vector1{}+ignore(mode) vector2{}`
    - Another example is: `http_errors_total / ignoring(error) http_requests_total` = `http_errors_total / on(instance, job, path) http_requests_total`
    Q: Which of the following queries can be used to track the total number of seconds cpu has spent in `user` + `system` mode for instance `loadbalancer:9100`?
    A: `node_cpu_seconds_total{instance="loadbalancer:9100", mode="user"} + ignoring(mode) node_cpu_seconds_total{instance="loadbalancer:9100", mode="system"}`
    Q: Construct a query that will find out what percentage of time each cpu on each instance was spent in mode user. To calculate the percentage in mode user, get the total seconds spent in mode user and divide that by the sum of the time spent across all modes. Further, multiply that result by 100 to get a percentage.
    A: `node_cpu_seconds_total{mode="user"}*100 /ignoring(mode, job) sum by(instance, cpu) (node_cpu_seconds_total)`
    #### Many-to-one vector matching
    - error executing the query `multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)`
    - is where each vector elements on the one side can match with multiple elements on the many side (e.g. `http_errors + on(path) group_left http_requests`) - `group_left` tells PromQL that elements from the right side are now matched with multiple elements from the left (`group_right` is the opposite of that - depending on which side is the many and which side is one)
    |many|+ one|= resulting vector|
    |-------|-------|--------------------|
    |{error=400,path=/cats} 2| |{error=400,path=/cats} 4|
    |{error=500,path=/cats} 5|{path=/cats} 2|{error=500,path=/cats} 7|
    |{error=400,path=/dogs 1|{path=/dogs} 7|{error=400,path=/dogs} 8|
    |{error=500,path=/dogs 7| |{error=500,path=/dogs} 14|
    Q: The api job collects metrics on an API used for uploading files. The API has 3 endpoints /images /videos, and /songs, which are used to upload respective file types. The API provides 2 metrics to track: http_uploaded_bytes_total - tracks the number of uploaded bytes and http_upload_failed_bytes_total - tracks the number of bytes failed to upload. Construct a query to calculate the percentage of bytes that failed for each endpoint. The formula for the same is http_upload_failed_bytes_total*100 / http_uploaded_bytes_total.
    A: `http_upload_failed_bytes_total*100 / ignoring(error) group_left http_uploaded_bytes_total`
    ### Aggregation operators
    - allow you to take an instan vector and aggregate its elements resulting in a new instant vector with fewer elements
    - `sum`, `min`, `max`, `avg`, `group`, `stddev`, `stdvar`, `count`, `count_values`, `bottomk`, `topk`, `quantile`
    - for example `sum(http_requests)`, `max(http_requests)`
    - `by` keyword allows you to choose which labels to aggregate along (e.g. `sum by(path) (http_requests)`, `sum by(method) (http_requests)`, `sum by(instance) (http_requests)`, `sum by(instance, method) (http_requests)`)
    - `without` keyword does the opposite of `by` and tells the query which labels not to include in aggregation (e.g. `sum without(cpu, mode) (node_cpu_seconds_total)`)
    Q: On loadbalancer:9100 instance, calculate the sum of the size of all filesystems. The metric to get filesystem size is node_filesystem_size_bytes
    A: `sum(node_filesystem_size_bytes{instance="loadbalancer:9100"})`
    Q: Construct a query to find how many CPUs instance loadbalancer:9100 have. You can use the node_cpu_seconds_total metric to find out the same.
    A: `count(sum by (cpu) (node_cpu_seconds_total{instance="loadbalancer:9100"}))`
    Q: Construct a query that will show the number of CPUs on each instance across all jobs.
    A: `?`
    Q: Use the node_network_receive_bytes_total metric to calculate the sum of the total received bytes across all interfaces on per instance basis
    A: `sum by(instance)(node_network_receive_bytes_total)`
    Q: Which of the following queries will be used to calculate the average packet size for each instance?
    A: `sum by(instance)(node_network_receive_bytes_total) / sum by(instance)(node_network_receive_packets_total)`
    ### Functions
    - sorting, math, label transformations, metric manipulation
    - use the `round` function to round the query's result to the nearest integer value
    - truncate/round up to the closest integer: `ceil(node_cpu_seconds_total)`
    - round down: `floor(node_cpu_seconds_total)`
    - absolute value for negative numbers: `abs(1-node_cpu_seconds_total)`
    - date & time: `time()`, `minute()` etc.
    - vector function takes a scalar value and converts it into an instanst vector: `vector(4)`
    - scalar function returns the value of the single element as a scalar (otherwise returns `NaN` if the input vector does not have exactly one element): `scalar(process_start_time_seconds)`
    - sorting: `sort` (ascending) and `sort_desc` (descending)
    - rate at which a counter metric increases: `rate` and `irate` (e.g. group together data points by 60 seconds, get last value minus first value in each of these 60s groups and divide it by 60: `rate(http_errors[1m])`; e.g. similar to rate, but you get the last value and the second to last data points: `irate(http_errors[1m])`)
    |rate|irate|
    |----|-----|
    |looks at the firs and last data points within a range|looks at the last two data points within a range|
    |effectively an average rate over the range|instant rate|
    |best for slow moving counters and alerting rules|should be user for graphing volatile fast-moving counters|
    Note:
    - make sure there is at least 4 sample withing the time range (e.g. 15s scrape interval 60s window gives 4 samples)
    - when combining rate with an aggregation operator, always take `rate()` first, then aggregate (so it can detect counter resets)
    - to get the rate of increase of the sum of latency actoss all requests: `rate(requests_latency_seconds_sum[1m])`
    - to calculate the average latency of a request over the past 5m: `rate(requests_latency_seconds_sum[5m]) / rate(requests_latency_seconds_count[5m])`
    Q: Management wants to keep track of the rate of bytes received by each instance. Each instance has two interfaces, so the rate of traffic being received on them must be summed up. Calculate the rate of received node_network_receive_bytes_total using 2-minute window, sum the rates across all interfaces, and group the results by instance. Save the query in /root/traffic.txt.
    A: `sum by(instance) (rate(node_network_receive_bytes_total[2m]))`
    ### Subquery
    - Syntax: `<instant_query> [<range>:<resolution>] [offset <duration>]`
    - Example: `rate(http_requests_total[1m]) [5m:30s]` - where sample range is 1m, query range is data from the last 5m and query step for subquery is 30s (gap between)
    - maximum value over a 10min of a gauge metric (`max_over_time(node_filesystem_avail_bytes[10m])`)
    - for counter metrics, we need to find the max value of the rate over the past 5min (e.g. maximum rate of request from the last 5 minutes with a 30s query interval and a sample range of 1m: `max_over_time(rate(http_requests_total[1m]) [5m:30s]`)
    Q: There were reports of a small outage of an application in the past few minutes, and some alerts pointed to potential high iowait on the CPUs. We need to calculate when the iowait rate was the highest over the past 10 minutes. [Construct a subquery that will calculate the rate at which all cpus spent in iowait mode using a 1 minute time window for the rate function. Find the max value of this result over the past 10 minutes using a 30s query step for the subquery.]
    A: ``
    Q: Construct a query to calculate the average over time (avg_over_time) rate of http_requests_total over the past 20m using 1m query step.
    A: ``
    ### Recording rules
    - allow Prometheus to periodically evaluate PromQL expression and store the resulting times series generated by them
    - speeding up your dashboards
    - provide aggregated results for use elsewhere
    - recording rules go in a seperate file called a rule file:
    ```yaml
    global: ...
    rule_files:
    - rules.yml # globs can be used here, like /etc/prometheus/rule_files.d/*.yml
    scrape_configs: ...
    ```
    - Prometheus server must be restarted after this change
    - syntax of the `rules.yml` file:
    ```yaml
    groups: # groups running in parallel
    - name: <group name 1>
    interval: <evaluation interval, global by default>
    rules: # however, rules evaluated sequentially
    - record: <rule name 1>
    expr: <promql expression 1>
    labels:
    <label name>: <label value>
    - record: <rule name 2> # you can also reference previous rule(s)
    expr: <promql expression 1>
    labels:
    - name: <group name 2>
    ...
    ```
    - example of the `rules.yml` file:
    ```yaml
    groups:
    - name: example1 # it will show up in the WebGui under "status" - "rules"
    interval: 15s
    rules:
    - record: node_memory_memFree_percent
    expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_memTotal_bytes)
    - record: node_filesystem_free_percent
    expr: 100 * node_filesystem_free_bytes / node_filesystem_size_bytes
    ```
    - best practices for rule naming: `aggregation_level:metric_name:operations`, e.g. we have a http_errors counter with two instrumentation labels "method" and "path". All the rules for a specific job should be contained in a single group. It will look like:
    ```yaml
    - record: job_method_path:http_errors:rate5m
    expr: sum without(instance) (rate(http_errors{job="api"}[5m]))
    ```

    ### HTTP API
    - execute queries, gather information on alert, rules, service discovery related configs
    - send the POST request to `http://<prometheus_ip>/api/v1/query`
    - example: `curl http://<prometheus_ip>:9090/api/v1/query --data 'query=node_arp_entries{instance="192.168.1.168:9100"}'`
    - query at a specific time, just add another `--data 'time=169386192'`
    - response back as JSON

    ---
    _Last update: Mon Jan 16 04:27:57 UTC 2023_
    _Last update: Tue Jan 17 05:45:09 UTC 2023_
  9. @luckylittle luckylittle revised this gist Jan 16, 2023. 1 changed file with 105 additions and 4 deletions.
    109 changes: 105 additions & 4 deletions Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -16,11 +16,11 @@ c. Traces - follow operations (trace-id) as they travel through different hops

    2. SLO/SLA/SLI

    a. SLI = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)
    a. SLI (service level indicators) = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)
    - not all metrics make for good SLIs, you want to find metrics that accurately measure a **user's** experience
    - high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

    b. SLO = target value or range for an SLI
    b. SLO (service level objectives) = target value or range for an SLI
    - examples:
    SLI - Latency
    SLO - Latency < 100ms
    @@ -31,7 +31,7 @@ b. SLO = target value or range for an SLI
    - may be tempted to set to aggressive values
    - goal is not to achieve perfection, but make customers happy

    c. SLA = contract between a vendor and a user that guarantees SLO
    c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO

    3. Prometheus fundamentals
    - use cases:
    @@ -178,5 +178,106 @@ basic_auth_users:
    password: <PLAIN TEXT PASSWORD!>
    ```

    9. Metrics
    - 3 properties:
    * name - general feature of a system to be measured, may contain ASCII, numbers, underscores (`[a-zA-Z_:][a-zA-Z0-9_:]*`), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g. `__name__=node_cpu_seconds_total`)
    * {labels (key/value pairs)} - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores (`[a-zA-Z0-9_]*`). Labels surrounded by `__` are considered internal to Prometheus. Every matric is assigned 2 labels by default (`instance` and `job`).
    * metric value
    - Example = `node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86`: labels provude us information on which cpu this metric is for
    - when Prometheus scrapes a target and retrieves metrics, it als ostores the time at which the metric was scraped
    - Example = 1668215300 (unix epoch timestamp, since Jan 1st 1970 UTC)
    - time series = stream of timestamped values sharing the same metric and set of labels
    - metric have a `TYPE` (counter, gauge, histogram, summary) and `HELP` (description of the metric is) attribute:
    * counter can only go up, how many times did X happen
    * gauge can go up or down, what is the current value of X
    * histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s)
    * summary is similar to histogram and tells us how many observation fell below X, do no thave to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s)

    Q:How many total unique time series are there in this output?
    ```
    node_arp_entries{instance="node1" job="node"} 200
    node_arp_entries{instance="node2" job="node"} 150

    node_cpu_seconds_total{cpu="0", instance="node"1", mode="iowait"}
    node_cpu_seconds_total{cpu="1", instance="node"1", mode="iowait"}
    node_cpu_seconds_total{cpu="0", instance="node"1", mode="idle"}
    node_cpu_seconds_total{cpu="1", instance="node"1", mode="idle"}
    node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}

    node_memory_Active_bytes{instance="node1" job="node"} 419124
    node_memory_Active_bytes{instance="node2" job="node"} 55589
    ```
    A: 9
    Q: What metric should be used to report the current memory utilization?
    A: gauge
    Q: What metric should be used to report the amount of time a process has been running?
    A: counter
    Q: Which of these is not a valid metric?
    A: 404_error_count
    Q: How many labels does the following time series have? http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234
    A: 5
    Q: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this?
    A: histogram
    Q: What are the two labels every metric is assigned by default?
    A: instance, job
    Q: What are the 4 types of prometheus metrics?
    A: counter, gauge, histogram, summary
    Q: What are the two attributes provided by a metric?
    A: Help, Type
    Q: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782 ; What is the metric name?
    A: http_request_total
    10. Expression browser
    - Web UI for Prometheus server to query data
    - `up` - returns which targets are in up state (you can see an `instance` and `job` and value on the right - `0` and `1`)
    11. Prometheus on Docker
    - Pull image `prom/prometheus`
    - Configure `prometheus.yml`
    - Expose ports, bind mounts
    - `docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus`
    12. PromTools
    - check & validate configuration before applying (e.g to production)
    - prevent downtime while config issues are being identified
    - validate metrics passed to it are correctly formatted
    - can perform queris on a Prom server
    - debugging & profiling a Prom server
    - perform unit tests agains Recording/Alerting rules
    - `promtool check config /etc/prometheus/prometheus.yml`
    13. Container metrics
    - metrics can be scraped from containerized envs
    - docker engine metrics (how much CPU does Docker use etc. no metrics specific to a container!):
    * `vi /etc/docker/daemon.json`:
    ```json
    {
    "metrics-addr": "127.0.0.1:9323",
    "experimental": true
    }
    ```
    * `sudo systemctl restart docker`
    * `curl localhost:9323/metrics`
    * prometheus job update:
    ```yaml
    scrape_configs:
    - job_namce: "docker"
    static_configs:
    - targets: ["12.1.13.4:9323"]
    ```
    - cAdvisor (how much memory does each container use? container uptime? etc.):
    * `vi docker-compose.yml` to pull `gcr.io/cadvisor/cadvisor`
    * `docker-compose up` or `docker compose up`
    * `curl localhost:8080/metrics`
    ---
    _Last update: Fri Jan 13 05:28:57 UTC 2023_
    _Last update: Mon Jan 16 04:27:57 UTC 2023_
  10. @luckylittle luckylittle revised this gist Jan 13, 2023. 1 changed file with 102 additions and 1 deletion.
    103 changes: 102 additions & 1 deletion Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -77,5 +77,106 @@ c. SLA = contract between a vendor and a user that guarantees SLO
    * does not DDoS the metrics server
    * definitive list of targets to monitor (central source of truth)

    5. Prometheus Installation
    - Download tar from http://prometheus.io/download
    - untarred folder contains console_libraries, consoles, prometheus (binary), prometheus.yml (config) and promtool (CLI utility)
    - Run `./prometheus`
    - Open http://localhost:9090
    - Execute / query `up` in the console to see the one target (itself) - should work OK so we can turn it into a systemd service
    - Create a user `sudo useradd --no-create-home --shell /bin/false prometheus`
    - Create a config folder `sudo mkdir /etc/prometheus`
    - Create `/var/lib/prometheus` for the data
    - Move executables `sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin`
    - Move config file `sudo cp prometheus.yaml /etc/prometheus/`
    - Copy the consoles folder `sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/`
    - Change owner for these folders & executables `sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool`
    - The command (ExecStart) will then look like this: `sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries`
    - Create a service file with this information `/etc/systemd/system/prometheus.service` and reload `sudo systemctl daemon-reload`
    - Start the daemon `sudo systemctl start prometheus ; sudo systemctl enable prometheus`

    6. Node exporter
    - Download tar from http://prometheus.io/download
    - untarred folder contains basically just the binary `node_exporter`
    - Run the `./node_exporter` and then `curl localhost:9100/metrics`
    - Run in the background & start on boot using the systemd
    - `sudo cp node_exporter /usr/local/bin`
    - `sudo useradd --no-create-home --shell /bin/false node_exporter`
    - `sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter`
    - `sudo vi /etc/systemd/system/node_exporter.service`
    - `sudo systemctl daemon-reload`
    - `sudo systemctl start node_exporter ; sudo systemctl enable node_exporter`

    7. Prometheus configuration
    - Sections:
    a. `global` - default parameters, it can be overriden by the same variables in sub-sections
    b. `scrape_configs` - define targets and `job_name`, which is a collection of instances that need to be scraped
    c. `alerting`
    d. `rule_files`
    e. `remote_read` & `remote_write`
    f. `storage`
    - Some examples:

    ```yaml
    scrape_configs:
    - job_name: 'nodes' # call it whatever
    scrape_interval: 30s # from the target every X seconds
    scrape_timeout: 3s # timeouts after X seconds
    scheme: https # http or https
    metrics_path: /stats/metrics # non-default path that you send requests to
    static_configs:
    - targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
    # basic_auth # this is the next section
    ```
    - Reload the config `sudo systemctl restart prometheus`

    8. Encryption & Authentication
    - between Prometheus and targets

    On the targets, you need to generate the keys:
    - `sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"` - this will generate key & crt pair
    - config will have to be customized:

    ```yaml
    tls_server_config:
    cert_file: node_exporter.crt
    key_file: node_exporter.key
    ```
    - `./node_exporter --web.config=config.yml`
    - `curl -k https://localhost:9100/metrics`

    On the server:
    - copy the `node_exporter.crt` from the target to the Prometheus server
    - update the `scheme` to `https` in the `prometheus.yml` and add `tls_config` with `ca_file` (e.g. /etc/prometheus/node_exporter.crt that we copied in the previous step) and `insecure_skip_verify` if self-signed
    - restart prometheus service

    ```yaml
    scrape_configs:
    - job_name: "node"
    scheme: https
    tls_config:
    ca_file: /etc/prometheus/node_exporter.crt
    insecure_skip_verify: true
    ```

    Authentication is done via generated hash (`sudo apt install apache2-utils` or httpd-tools etc) and `htpasswd -nBC 12 "" | tr -d ':\n'` (will prompt for password and spits out the hash):
    - add the `basic_auth_users` and username + hash underneath it:

    ```yaml
    # /etc/node_exporter/config.yml
    basic_auth_users:
    prometheus: $2y$12$daXru320983rnofkwehj4039F
    ```

    - restart node_exporter service
    - update Prometheus server's config with the same auth:

    ```yaml
    - job_name: "node"
    basic_auth:
    username: prometheus
    password: <PLAIN TEXT PASSWORD!>
    ```

    ---
    _Last update: Thu Jan 12 23:34:51 UTC 2023_
    _Last update: Fri Jan 13 05:28:57 UTC 2023_
  11. @luckylittle luckylittle created this gist Jan 12, 2023.
    81 changes: 81 additions & 0 deletions Prometheus Certified Associate (PCA).md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,81 @@
    Prometheus Certified Associate (PCA)
    ------------------------------------

    1. Observability
    - the ability to understand and measure the state of a system based on data generated by the system
    - allows to generate actionable outputs from **unexpected** scenarions
    - to better understad the internal of your system
    - greater need for observability in distributed systems & microservices
    - troubleshooting - e.g. why are error rates high?
    - 3 pillars are logging, metrics, traces:

    a. Logs - records of event that have occurred and encapsulate info about the specific event
    b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions
    c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace
    - Prometheus only handles metrics, not logs or traces!

    2. SLO/SLA/SLI

    a. SLI = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)
    - not all metrics make for good SLIs, you want to find metrics that accurately measure a **user's** experience
    - high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

    b. SLO = target value or range for an SLI
    - examples:
    SLI - Latency
    SLO - Latency < 100ms
    SLI - Availability
    SLO - 99.99% uptime
    - should be directly related to the customer experience
    - purpose is to quantify reliability of a product to a customer
    - may be tempted to set to aggressive values
    - goal is not to achieve perfection, but make customers happy

    c. SLA = contract between a vendor and a user that guarantees SLO

    3. Prometheus fundamentals
    - use cases:
    * collect metrics from different locations like West DC, central DC, East DC, AWS etc.
    * high memory on the hosting MySQL db and notify operations team via email
    * find out which uploaded video length the application starts to degrade
    - open source monitoring tool that collects metrics data and provide tools to visualize the data
    - allows to generate alerts when treshold reached
    - collects data by scraping targets who expose metrics through HTTP endpoint
    - stored in time series db and can be queried with built-in PromQL
    - what can it monitor:
    * CPU/memory
    * disk space
    * service uptime
    * app specific data - number of exceptions, latency, pending requests
    * networking devices, databases etc.
    - exclusively monitor numeric time-series data
    - does **not** monitor events, system logs, traces
    - originally sponsored by SoundCloud
    - written in Go

    4. Prometheus Architecture
    - 3 core components:
    * retrieval (scrapes metric data)
    * TSDB (stores metric data)
    * HTTP server (accepts PromQL query)
    - lot of other components making up the whole solution:
    * exporters (mini-processes running on the targets), retrieval component **pulls** the metrics from
    * pushgateway (short lived job sends the data to it and retrieved from there)
    * service discovery is all about providing list of targets so you don't have to hardocre those values
    * alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
    * Prometheus Web UI or Grafana etc.
    - collects by sending HTTP request to `/metrics` endpoint of each target, path can be changed
    - several native exporters:
    * node exporters (Linux)
    * Windows
    * MySQL
    * Apache
    * HAProxy
    * client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
    - Pull based is pros:
    * easier to tell if the target is down
    * does not DDoS the metrics server
    * definitive list of targets to monitor (central source of truth)

    ---
    _Last update: Thu Jan 12 23:34:51 UTC 2023_