Skip to content

Instantly share code, notes, and snippets.

@mikesparr
Last active November 14, 2022 09:52
Show Gist options
  • Save mikesparr/2e19871317efc3e1b5b8d7a624d1e9ba to your computer and use it in GitHub Desktop.
Save mikesparr/2e19871317efc3e1b5b8d7a624d1e9ba to your computer and use it in GitHub Desktop.
Notes about SRE and DevOps engineering for study for certification exam

Resources

Study

"Class SRE implements DevOps" is a specific way to do DevOps.

  • Job/Role
  • SRE System
  • CI/CD and tools to implement
  • Operations in Google Cloud and tools to implement

Suggested background (pre-reqs)

Certified either with ACE or PCA (more depth and tradeoff thinking). Also some development or operations experience running systems.

Layered understanding (BOPT)

  • Business
    • external forces
  • Organizational (teams and techniques)
    • internal forces
  • Process/Techniques
    • human considerations
  • Technology/Tools
    • nuts and bolts (making things happen)

Job/Role description

Professional Cloud DevOps Engineers are responsbile for efficient development operations that can balance service reliability and delivery speed. They are skilled at using Google Cloud Platform to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.


What is (the business of) software development?

Alignment of development and operations, but doing so to add value to the business. The investment in software should have ROI (returned value > investment).

Value (increase)

  • Sales/Marketing - attracting new clients to bring in more revenue
  • Client Support - fulfilling obligations to clients and keeping them happy
  • Supplier Integration - interfacting or integrating with partners and suppliers, to run the business
  • Internal Automation - improving employee efficiency and happiness

Costs (decrease)

  • Initial Development - planning and building the very first version of the system
  • Operations - that system needs to keep running and those computers need to stay safe
  • Maintenance (Dev) - packages need to be updated and any discovered security issues must be fixed
  • Enhancements (Dev) - running a system invariably offers new info and opportunities for enhancement

How does software deliver value?

A 50%-good solution that people actually have solves more problems and survives longer than a 99%-good solution that nobody has. Shipping is a feature. A really important feature. Your product must have it.

  • Joel Spolsky (co-founder of Stack Overflow)

  • Deltas

    • fundamental unit of software development is a code change
  • Changes

    • value
    • cost
    • risk!
  • "Responsible software development requires risk mitigation in advance."

  • Team dynamic

    • every person is on the team
    • team needs to work together
    • integrating work from multiple people is key (hence continuous integration [CI])

What is development process data flow?

People who develop and operate a system responsible, get different and accountable for the system running properly, get diff result. Changing incentives for developers, they will do more to improve quality, shrinking impact, smaller code changes.

  • Codebase (dev)
    • Feedback (product)
      • Idea on Backlog (product)
        • Code change (dev)
          • Build (dev)
            • Deployable build (dev/ops = shared responsibility)
              • Running system (operations)
                • Released feature (feature flags, gradually with canaries)

Cost of change is only understood by developers, so product and dev need to work together to assess cost versus value delivered.


Operations

  • setting things up, initially
  • security things
  • deploying new versions of the software/system
  • scaling to meet demand
  • patching infrastructure
  • backing up
  • addressing outages
  • recovering from back up
  • etc.

Dev vs Ops

  • Dev
    • buying a machine
    • judged by features (not by quality)
  • Ops
    • running the machine
    • judged by availability (regardless of system quality)

What is a DevOps Engineer?

DevOps is not a person. DevOps is a way to structure a team.

DevOps team will have shared responsibility for all of

  • developing changes to their system
  • operating their system
  • ensuring quality of their system
  • managing risk (together)

Some misconceptions and job postings include various meanings (or all)

  • just another (newer) name for operations/sysadmin
  • CI/CD
  • a dev that does ops
  • an ops person that does dev - like scripting
  • an ops person that does dev - more than scripting

What is a Site Reliability Engineer (SRE)?

"What happens when a software engineer is tasked with what used to be called operations." - Benjamin Treynor Sloss, founder of Google's SRE team

Developing software to automate tasks all throughout the software development cycle

  • not just ops
  • not just CI/CD
  • definitely also includes quality management

An intentional development risk manager. The true subject of Google's "Professional Cloud DevOps Engineer" certification.

Common Problems

Scale

Problem

  • more users than expected
  • bad actors (e.g. DDoS)
  • bad handling Solution
  • architect service for scale
  • design scaling into ops, too
  • build in protections

Randomness

Problem

  • bad design/assumptions
  • intermittent failures
  • uncommon events (corner cases)
  • bad failure handling Solution
  • quality control/assurance
  • code reviews
  • automated testing
  • gradual rollouts

Change (70% of what breaks running system)

Problem

  • code changes
  • config changes
  • infrastructure changes Solution
  • automated CI/CD (no manual steps)
  • progressive rollouts (including canaries)
  • timely monitoring
  • quick response (automatic)
  • safe rollbacks (automatic)
  • minimizing impact

Tensions (all about tradeoffs)

"100% is always the wrong availability target"

  • use data to make your informed decisions

SRE principles

1.1 Balance change, velocity, & reliability of the service

  • Discover SLIs (availability, latency, etc.)
  • Define SLOs and understand SLAs
  • Agree to consequences of not meeting the error budget
  • Construct feedback loops to decide what to build next

1.2 Manage service life cycle

  • Manage a service (e.g., introduce a new service, deploy it, maintain & retire it)
  • Plan for capacity (e.g., quotas and limits management)

1.3 Ensure healthy communication and collaboration for operations

  • Prevent burnout (e.g., set up automation processes to prevent burnout)
  • Foster a learning culture
  • Foster a culture of blamelessness (team responsibility to understand what happened and what to do differently in future)

CI/CD

2.1 Design CI/CD pipelines

  • Immutable artifacts with Container Registry
  • Artifact repositories with Container Registry
  • Deployment strategies with Cloud Build, Spinnaker
  • Deployment to hybrid & multi-cloud environments with Anthos, Spinnaker, K8s
  • Artifact versioning strategy with Cloud Build, Container Registry
  • CI/CD pipeline triggers with
    • Cloud source repositories
    • Cloud Build GitHub App
    • Cloud Pub/Sub
  • Testing a new version with Spinnaker
  • Configure deployment processes (e.g., approval flows)

2.2 Implement CI/CD pipelines

  • CI with Cloud Build
  • CD with Cloud Build
  • Open source tooling (e.g., Jenkins, Spinnaker, GitLab, Concourse)
  • Audit and tracing of deployments (e.g., CSR, Cloud Build, Cloud Audit Logs)

2.3 Manage configuration and secrets

  • Secure storage methods
  • Secret rotation and config changes

2.4 Manage infrastructure as code (IaC)

  • Terraform / Cloud Deployment Manager
  • Infrastructure code versioning
  • Make infrastructure changes safer
  • Immutable architecture (help prevent drift)

2.5 Deploy CI/CD tooling

  • Centralized tools vs. multiple tools (single vs. multi-tenant)
  • Security of CI/CD tooling

2.6 Managing different development environments (e.g., staging, production, etc.)

  • Decide on number of environments and their purpose
  • Create environments dynamically per feature branch with GKE, Cloud Deployment Manager
    • don't want things going on for long time in branch (opposite of CI)
  • Local development environments with Docker, Cloud Code, Skaffold

2.7 Secure the deployment pipeline

  • Vulnerability analysis with Container Registry
  • Binary Authorization
  • IAM policies per environment

Operations (Ops)

Implementing service monitoring strategies

3.1 Manage application logs

3.2 Manage application metrics with Stackdriver Monitoring

3.3 Manage Stackdriver Monitoring platform

3.4 Manage Stackdriver Logging platform

3.5 Implementing logging and monitoring access controls

Optimizing service performance

4.1 Idenfity service performance issues

  • cloud trace, profiler, monitoring

4.2 Debug application code

  • debugger (break points in code)

4.3 Optimize resource utilization

  • architecture

Managing service incidents

5.1 Coordinate roles & implement comm. channels during a service incident

  • on call, preventing burnout

5.2 Investigate incident symptoms impacting users with Stackdriver IRM

5.3 Mitigate incident impact on users

5.4 Resolve issues (e.g., Cloud Build, Jenkins)

5.5 Document issue in a postmortem (e.g., 5 Whys - not required but good to know)

Key Pillars of DevOps

  • Reduce organization silos
    • bridge teams together
    • increase communication
    • share company vision
  • Accept failure as normal
    • try to anticipate
    • incidents bound to occur
    • failures help team learn
  • Implement gradual change (opens door to continuous change culture)
    • small updates are better
    • easier to review
    • easier to rollback
  • Leverage tooling & automation
    • reduce manual tasks
    • heart of CI/CD pipelines
    • fosters speed and consistency
  • Measure everything
    • critical gauge of succes
    • ci/cd needs full monitoring
    • synthetic, proactive monitoring

Site Reliability Engineering

What happens when a software engineer is tasked with what used to be called operations - Ben Traynor

Why reliability?

  • most important: does the product work
  • reliability is the absence of errors
  • unstable service likely indicates variety of issues
  • must attend to reliability all the time - not just when your hair is on fire

"Class SRE implements DevOps"

  • DevOps is the "what"
  • SRE is the "how"

Reduce organization silos (DevOps pillar)

Share ownership (SRE pillar)

  • developers + operations
  • implement same tooling
  • share same techniques

Accept failure as normal (DevOps pillar)

No-fault post mortems & SLOs (SRE pillar)

  • no two failures the same
  • track incidents (SLIs)
  • map SLIs to objectives (SLOs)

Implement gradual change (DevOps pillar)

Reduce cost of failures (SRE pillar)

  • limited canary rollouts
  • impact fewest users
  • automate everything possible

Leverage tooling & automation (DevOps pillar)

Automate this year's job away (SRE pillar)

  • automation is force multiplier
  • autonomous automation best
  • centralizes mistakes

Measure everything (DevOps pillar)

Measure toil and reliability (SRE pillar)

  • key to SLOs and SLAs
  • reduce toil, up engineering
  • monitor all over time

Goal: make better software, faster

  1. Define availability
  • SLO
  1. Determine level of availability
  • SLI
  1. Detail what happens when availability fails
  • SLA

Service Level Indicators (SLIs)

A carefully defined quantitive measure of some aspect of the level of service that is provided

  • SLIs are metrics over time, specific to user journey such as request/response, data processing, or storage, that show how well a service is doing

Examples:

  • Request latency - how long it takes to return a response
  • Failure rate - fraction of all rates received (unsuccessful requests / all requests)
  • Batch throughput - proportion of time = data processing rate > than a threshold

User journey: sequence of tasks central to user experienc and crucial to service

Request/response journey:

  • availability - proportion of valid requests served successfully
  • latency - proportion of valid requests served faster than a threshold
  • quality - proportion of valid requests served maintaining quality

Data processing journey:

  • freshness - proportion of valid data updated more recently than a threshold
  • correctness - proportion of valid data producing correct output
  • throughput - proportion of time where the data processing rate is faster than a threshold

4 Golden Signals:

  • Latency - time it takes for the service to fulfill a request
  • Errors - rate at which your service fails
  • Traffic - how much demand is directed at your service
  • Saturation - measure of how close to fully utilized the service's resources are

SLI = ( good events / valid events ) * 100

  • Bad SLI: variance and overlap in metrics prior to and during outages are problematic (jagged line)
  • Good SLI: stable signal with a strong correlation to outage is best (consistent line)

Best practices

  • limit number of SLIs (3-5 per user journey, avoid contradictions)
  • reduce complexity (not all metrics make good SLIs, avoid false positives and impact response time)
  • prioritize journeys (select most valuable to users, identify user-centric events)
  • aggregate similar SLIs (collect data over time, turn into a rate/avg/percentile)
  • bucket to distinguise response classes (not all requests same, may be human/background/apps/bots, combine or bucket SLIs)
  • collect data at load balancer (most efficient method, closer to user's experience)

Service Level Objectives (SLOs)

Service Level Objectives specify a target level for the reliability of your service. Need buy-in across the organization.

100% reliability isn't a good objective?

  • closer you get to 100%, more expensive
  • technically complex
  • users don't need 100% to be acceptable
  • leaves room for new features (error budgets)

SLOs are tied to your SLIs

  • measured by SLI
  • can be a single target value or range of values
  • SLI <= SLO

SLI example

  • metrics over time which detail the health of a service
  • site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO example

  • agreed-upon bounds how often SLIs must be met
  • 95% percentile homepage SLI will succeed 99.9% over next year

Make SLOs achievable!

  • based on past performance
  • if not historical data, collect some
  • keep in mind: measurement <> user satisfaction

How about Aspirational SLOs?

  • typically higher than achievable
  • set a reasonable target and begin measuring
  • compare user feedback to SLOs

Service Level Agreement (SLAs)

An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. Should reliability fail, there are consequences.

SLA characteristics

  • a business-level agreement
  • can be explicit or implicit
  • explicit contract contain consequences:
    • refund for services paid for
    • service cost reduction on sliding scale
  • may be offered on a per-service basis

Example SLA on GCP service

  • describes SLOs
  • sliding scale

SLIs (Drive) SLOs (Inform) SLA

  • latency SLI (set SLO 200ms and SLA 300ms)

SLO

  • internal targets that guide prioritization
  • represents desired user experience
  • missing objective should also have consequences

SLA

  • set level just enough to keep customers
  • incentivizes minimum level of service
  • looser than corresponding objectives

Error Budget

A quantitative measure shared between the product and SRE teams to balance innovation and stability.

  1. Management buys into SLO
  2. SLO used to determine uptime for quarter
  3. Monitoring service measures actual uptime
  4. Calculate difference between SLO and uptime
  5. Push new release if error budget allows

"Risky business" so why risk it?

  • balances innovation and reliability
  • manages release velocity
  • developers oversee own risk
    • will push for more stability and slow down velocity
  • if error budget exceeded:
    • releases temp halted
    • system testing and dev expanded
    • performance improved

Error budget = 100% - SLO

  • example: SLO = 99.8%
  • 100% - 99.8% = .2% budget
    • 0.002 X 30 day / month
    • X 24 hrs / day
    • X 60 min / hour
    • = 86.4 minutes per month

What about global services?

  • time-based error budgets not valid
  • better to define availability in terms of request success rate
  • referred to as aggregate availability
    • Availabiliy = successful requests / total requests
    • example: 99.9% = 8.76 hr/mo; 2.16 hr/qtr; 43 min/mo; 10 min/wk; 1 min/day

What are they good for then? (Piggy bank)

  • releasing new features
  • expected system changes
  • inevitable failure in networks
  • planned downtime
  • risky experiments
  • unforseen circumstances

Defining and reducing toil

Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

  • Manual - extends to include running a script - which although saves time still must be run by hand
  • Repetitive - if a task is repeated multiple times, not just once or twice, then the work is toil
  • Automatable - should the task be done by a machine just as well as by a person, you can consider it toil
  • Tactical - is not proactive or strategy-driven. Rather it's reactive and interrupt-driven, e.g., pager alerts
  • Devoid of enduring value - work that does not change the state, or doesn't add permanent improvement
  • Scales linearly as service grows - tasks that scale up with service size are toil

What is not toil? Overhead, as not tied to production service

  • Email
  • Commuting
  • Expense reports
  • Meetings

Toil reduction benefits

  • increased engineering time
  • higher team morale, lower burnout
  • increased process standardization
  • enhanced team technical skills
  • fewer human error outages
  • shorter incident response times

3 top tips for reduction of toil

  • identify toil
  • estimate time to automate (make sure benefit > cost)
  • measure everything (account for cost of context switching; don't overdo it)

Generating SRE Metrics

Monitoring

Collecting, processing, aggregating, and displaying real-time quantitive data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.

Why monitor?

  • analyzing long-term trends (helps setting SLOs)
  • comparing over time groups
  • alerting (real time)
  • exposing in dashboards
  • debugging
  • raw input for business analytics

White-box

  • metrics exposed by the system
  • focusing on predicting problems
  • heavy use recommended
  • best for detecting imminent issues

Black-box

  • testing externally visible behavior as a user would see it
  • symptom oriented, active problems
  • moderate use of critical issues
  • best for paging of incidents

Metrics - numerical measurements representing attributes and events

  • GCP Cloud Monitoring
  • Collects a larger number of metrics from every service at Google
  • Provides much less granular information, but in near real time
  • better for alerts and dashboards
  • real-time nature means engineers notified of problems rapidly
  • its most critical to visualize the data in a dashboard

Logging - append-only record of events

  • GCP Cloud Logging
  • Can contain large volumes of highly-granular data
  • Inherent delay between when an event occurs and when it's visible in logs
  • Logs can be processed with a batch syste, interrogated with ad hoc queries, and visualized with dashboards
  • use logs to find root cause of an issue, as the information needed is often not available as a metric
  • for non-time-sensitive reporting, generate detailed reports using logs processing systems
  • logs will nearly always produce more accurate data than metrics

Alerting

Alerts give timely awareness to problems in your cloud application so you can resolve the problems quickly.

Set up monitoring

  • conditions are continuously monitored
  • monitoring can track SLOs
  • can look for missing metric
  • can watch thresholed

Track metrics over time

  • track if condition persists for given amount of time
  • time window (due to technical constraints) less than 24 hrs

Notify when condition is passed

  • incident created and displayed
  • alerts can be sent via
    • Email
    • SMS text message
    • App, i.e., Slack
    • Cloud Pub/Sub

Error budget burn rate formula

  • 100% - SLO X (Events over set time)
  • example: 2% SLO (100% - 98% = 2% = .002)
    • .002 X 12,000/30 = BR (burn rate)
    • .002 X 400 = .8
    • 400/24 = 16.67 fails/hr

Slow burn alerting policy

  • warns that rate of consumption could exhaust error budget before end of compliance period
  • less urgent than fast-burn condition
  • requires longer lookback period (24 hour max)
  • threshold should be slightly higher than baseline

Fast burn alerting policy

  • warns of sudden, large change in consumption that, if uncorrect, will exhaust error budget quickly
  • shorter lookback period (i.e. 1-2 hours or less)
  • set threshold higher than baseline (i.e. 10X; too low results in false positives)

Establishing an SLO policy

  • select SLO to monitor
  • construct a condition for alerting policy
  • identify notification channel
  • provide documentation
  • create alerting policy

SRE tools

DevTools

  • Kubernetes Engine - managed, production-ready environment for running containerized applications
  • Container Registry - single place for team to securely manage Docker images, used by Kubernetes
  • Cloud Build - service that executes your builds in a series of steps where each step is run in a Docker container
  • Cloud Source Repositories - fully managed private Git repos with integrations for CI, delivery, and deployment
  • Spinnaker for GCP - integrates spinnaker with other GCP services, allowing you to extend your CI/CD pipeline
  • Cloud Monitoring - visibility into the performance, uptime, and overall health of cloud-powered applications
  • Cloud Logging - allows you to store, search, analyze, monitor, and alert on log data and events from Google Cloud
  • Cloud Debugger - lets you inspect state of a running app in real time, without stopping or slowing it down
  • Cloud Trace - distributed tracing system that collects latency data from apps and displays in console
  • Cloud Profiler - continuously gathers CPU and memory allocation information from your production applications

Incidents

Handling incident response

Good example

  1. initiate protocol and appoints incident commander
  2. works with one operations team and passes control at end of day, if issue persists
  3. in the loop from the start, can coordinate public and departmental responses
  4. freelancing agent not wanted

Management characteristics

  1. separation of responsibilities
  • specific roles should be designated to team members, each with full autonomy in their role
  • roles should include
    • incident commander
    • operational team
    • communication lead
    • planning lead
  1. established command post
  • post could be physical location or a comm venue, such as a Slack Channel
  1. live incident state document
  • shared document that reflects current state of incident, updated as necesary and retained for postmortem
  1. clear, real-time handoff
  • in the day is ending and issue remains unresolved, an explicit handoff to another incident commander must take place

3 questions to determine an incident

  • is another team needed
  • outage visible to users
  • unresolved after an hour

"Yes" to any of above is an incident.

Best practices:

  • development and document procedures
  • prioritize damage and restore service
  • trust team members in specified roles
  • if overwhelmed, get help
  • consider response alternatives
  • practice procedure routinely
  • rotate roles among team members

Managing service lifecycle

Architecture and Design

  • integrate best practices for dev team
  • recommend best infrastructure systems
  • co-design part of service with dev team
  • avoid costly re-designs

Active Development

  • SRE begins productionizing the service
  • planning for capacity
  • adding resources for redundancy
  • planning for spikes and overloads
  • implementing load balancing
  • adding monitoring, alerting and performance tuning

Limited availability

  • measure increasing performance
  • evaluate reliability
  • define SLOs
  • build capacity models
  • establish incident responses, shared with dev team

General availability

  • Production Readiness Review (PRR) passed
  • SRE handle majority of op work
  • incident responses
  • track operational load and SLOs

Depreciation

  • SREs operate existing system
  • support transition with dev team
  • work with dev team on designing new system; adjust staffing accordingly

"SRE principles aim to maximize the engineering velocity of developer teams while keeping products reliable." - SRE Workbook

Postmortem

Agenda

  • get metadata
  • recreate timeline
  • generate report

"No blame"!!!

Production meeting collaboration

  • upcoming production changes (near-term horizon visibility)
  • metrics (review current SLOs)
  • outages (summary of postmortem or update on status)
  • paging events (tactical view of pages and details that followed [valid or not])
  • non-paging events (what events didn't get paged that should have)
  • non-paging event 2 and 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment