Skip to content

Instantly share code, notes, and snippets.

@vdt
Forked from luckylittle/Mock Exam 1.md
Created February 27, 2024 13:47
Show Gist options
  • Save vdt/fddcaeaa05028365fa825403b0ae70e6 to your computer and use it in GitHub Desktop.
Save vdt/fddcaeaa05028365fa825403b0ae70e6 to your computer and use it in GitHub Desktop.
Prometheus Certified Associate (PCA)

Prometheus Certified Associate (PCA)

  1. Observability
  • the ability to understand and measure the state of a system based on data generated by the system
  • allows to generate actionable outputs from unexpected scenarions
  • to better understad the internal of your system
  • greater need for observability in distributed systems & microservices
  • troubleshooting - e.g. why are error rates high?
  • 3 pillars are logging, metrics, traces:

a. Logs - records of event that have occurred and encapsulate info about the specific event b. Metrics - numerical value information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions c. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace

  • Prometheus only handles metrics, not logs or traces!
  1. SLO/SLA/SLI

a. SLI = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)

  • not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
  • high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

b. SLO = target value or range for an SLI

  • examples: SLI - Latency SLO - Latency < 100ms SLI - Availability SLO - 99.99% uptime
  • should be directly related to the customer experience
  • purpose is to quantify reliability of a product to a customer
  • may be tempted to set to aggressive values
  • goal is not to achieve perfection, but make customers happy

c. SLA = contract between a vendor and a user that guarantees SLO

  1. Prometheus fundamentals
  • use cases:
    • collect metrics from different locations like West DC, central DC, East DC, AWS etc.
    • high memory on the hosting MySQL db and notify operations team via email
    • find out which uploaded video length the application starts to degrade
  • open source monitoring tool that collects metrics data and provide tools to visualize the data
  • allows to generate alerts when treshold reached
  • collects data by scraping targets who expose metrics through HTTP endpoint
  • stored in time series db and can be queried with built-in PromQL
  • what can it monitor:
    • CPU/memory
    • disk space
    • service uptime
    • app specific data - number of exceptions, latency, pending requests
    • networking devices, databases etc.
  • exclusively monitor numeric time-series data
  • does not monitor events, system logs, traces
  • originally sponsored by SoundCloud
  • written in Go
  1. Prometheus Architecture
  • 3 core components:
    • retrieval (scrapes metric data)
    • TSDB (stores metric data)
    • HTTP server (accepts PromQL query)
  • lot of other components making up the whole solution:
    • exporters (mini-processes running on the targets), retrieval component pulls the metrics from
    • pushgateway (short lived job sends the data to it and retrieved from there)
    • service discovery is all about providing list of targets so you don't have to hardocre those values
    • alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
    • Prometheus Web UI or Grafana etc.
  • collects by sending HTTP request to /metrics endpoint of each target, path can be changed
  • several native exporters:
    • node exporters (Linux)
    • Windows
    • MySQL
    • Apache
    • HAProxy
    • client librares to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
  • Pull based is pros:
    • easier to tell if the target is down
    • does not DDoS the metrics server
    • definitive list of targets to monitor (central source of truth)

Last update: Thu Jan 12 23:34:51 UTC 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment