Resources

Study

"Class SRE implements DevOps" is a specific way to do DevOps.

Job/Role
SRE System
CI/CD and tools to implement
Operations in Google Cloud and tools to implement

Suggested background (pre-reqs)

Certified either with ACE or PCA (more depth and tradeoff thinking). Also some development or operations experience running systems.

Layered understanding (BOPT)

Business
- external forces
Organizational (teams and techniques)
- internal forces
Process/Techniques
- human considerations
Technology/Tools
- nuts and bolts (making things happen)

Job/Role description

Professional Cloud DevOps Engineers are responsbile for efficient development operations that can balance service reliability and delivery speed. They are skilled at using Google Cloud Platform to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.

What is (the business of) software development?

Alignment of development and operations, but doing so to add value to the business. The investment in software should have ROI (returned value > investment).

Value (increase)

Sales/Marketing - attracting new clients to bring in more revenue
Client Support - fulfilling obligations to clients and keeping them happy
Supplier Integration - interfacting or integrating with partners and suppliers, to run the business
Internal Automation - improving employee efficiency and happiness

Costs (decrease)

Initial Development - planning and building the very first version of the system
Operations - that system needs to keep running and those computers need to stay safe
Maintenance (Dev) - packages need to be updated and any discovered security issues must be fixed
Enhancements (Dev) - running a system invariably offers new info and opportunities for enhancement

How does software deliver value?

A 50%-good solution that people actually have solves more problems and survives longer than a 99%-good solution that nobody has. Shipping is a feature. A really important feature. Your product must have it.

Joel Spolsky (co-founder of Stack Overflow)
Deltas
- fundamental unit of software development is a code change
Changes
- value
- cost
- risk!
"Responsible software development requires risk mitigation in advance."
Team dynamic
- every person is on the team
- team needs to work together
- integrating work from multiple people is key (hence continuous integration [CI])

What is development process data flow?

People who develop and operate a system responsible, get different and accountable for the system running properly, get diff result. Changing incentives for developers, they will do more to improve quality, shrinking impact, smaller code changes.

Codebase (dev)
- Feedback (product)
  - Idea on Backlog (product)
    - Code change (dev)
      - Build (dev)
        
        Deployable build (dev/ops = shared responsibility)
        
        Running system (operations)
        
        Released feature (feature flags, gradually with canaries)

Cost of change is only understood by developers, so product and dev need to work together to assess cost versus value delivered.

Operations

setting things up, initially
security things
deploying new versions of the software/system
scaling to meet demand
patching infrastructure
backing up
addressing outages
recovering from back up
etc.

Dev vs Ops

Dev
- buying a machine
- judged by features (not by quality)
Ops
- running the machine
- judged by availability (regardless of system quality)

What is a DevOps Engineer?

DevOps is not a person. DevOps is a way to structure a team.

DevOps team will have shared responsibility for all of

developing changes to their system
operating their system
ensuring quality of their system
managing risk (together)

Some misconceptions and job postings include various meanings (or all)

just another (newer) name for operations/sysadmin
CI/CD
a dev that does ops
an ops person that does dev - like scripting
an ops person that does dev - more than scripting

What is a Site Reliability Engineer (SRE)?

"What happens when a software engineer is tasked with what used to be called operations." - Benjamin Treynor Sloss, founder of Google's SRE team

Developing software to automate tasks all throughout the software development cycle

not just ops
not just CI/CD
definitely also includes quality management

An intentional development risk manager. The true subject of Google's "Professional Cloud DevOps Engineer" certification.

Common Problems

Scale

Problem

more users than expected
bad actors (e.g. DDoS)
bad handling Solution
architect service for scale
design scaling into ops, too
build in protections

Randomness

Problem

bad design/assumptions
intermittent failures
uncommon events (corner cases)
bad failure handling Solution
quality control/assurance
code reviews
automated testing
gradual rollouts

Change (70% of what breaks running system)

Problem

code changes
config changes
infrastructure changes Solution
automated CI/CD (no manual steps)
progressive rollouts (including canaries)
timely monitoring
quick response (automatic)
safe rollbacks (automatic)
minimizing impact

Tensions (all about tradeoffs)

"100% is always the wrong availability target"

use data to make your informed decisions

SRE principles

1.1 Balance change, velocity, & reliability of the service

Discover SLIs (availability, latency, etc.)
Define SLOs and understand SLAs
Agree to consequences of not meeting the error budget
Construct feedback loops to decide what to build next

1.2 Manage service life cycle

Manage a service (e.g., introduce a new service, deploy it, maintain & retire it)
Plan for capacity (e.g., quotas and limits management)

1.3 Ensure healthy communication and collaboration for operations

Prevent burnout (e.g., set up automation processes to prevent burnout)
Foster a learning culture
Foster a culture of blamelessness (team responsibility to understand what happened and what to do differently in future)

CI/CD

2.1 Design CI/CD pipelines

Immutable artifacts with Container Registry
Artifact repositories with Container Registry
Deployment strategies with Cloud Build, Spinnaker
Deployment to hybrid & multi-cloud environments with Anthos, Spinnaker, K8s
Artifact versioning strategy with Cloud Build, Container Registry
CI/CD pipeline triggers with
- Cloud source repositories
- Cloud Build GitHub App
- Cloud Pub/Sub
Testing a new version with Spinnaker
Configure deployment processes (e.g., approval flows)

2.2 Implement CI/CD pipelines

CI with Cloud Build
CD with Cloud Build
Open source tooling (e.g., Jenkins, Spinnaker, GitLab, Concourse)
Audit and tracing of deployments (e.g., CSR, Cloud Build, Cloud Audit Logs)

2.3 Manage configuration and secrets

Secure storage methods
Secret rotation and config changes

2.4 Manage infrastructure as code (IaC)

Terraform / Cloud Deployment Manager
Infrastructure code versioning
Make infrastructure changes safer
Immutable architecture (help prevent drift)

2.5 Deploy CI/CD tooling

Centralized tools vs. multiple tools (single vs. multi-tenant)
Security of CI/CD tooling

2.6 Managing different development environments (e.g., staging, production, etc.)

Decide on number of environments and their purpose
Create environments dynamically per feature branch with GKE, Cloud Deployment Manager
- don't want things going on for long time in branch (opposite of CI)
Local development environments with Docker, Cloud Code, Skaffold

2.7 Secure the deployment pipeline

Vulnerability analysis with Container Registry
Binary Authorization
IAM policies per environment

Operations (Ops)

Implementing service monitoring strategies

3.1 Manage application logs

3.2 Manage application metrics with Stackdriver Monitoring

3.3 Manage Stackdriver Monitoring platform

3.4 Manage Stackdriver Logging platform

3.5 Implementing logging and monitoring access controls

Optimizing service performance

4.1 Idenfity service performance issues

cloud trace, profiler, monitoring

4.2 Debug application code

debugger (break points in code)

4.3 Optimize resource utilization

architecture

Managing service incidents

5.1 Coordinate roles & implement comm. channels during a service incident

on call, preventing burnout

5.2 Investigate incident symptoms impacting users with Stackdriver IRM

5.3 Mitigate incident impact on users

5.4 Resolve issues (e.g., Cloud Build, Jenkins)

5.5 Document issue in a postmortem (e.g., 5 Whys - not required but good to know)

Key Pillars of DevOps

Reduce organization silos
- bridge teams together
- increase communication
- share company vision
Accept failure as normal
- try to anticipate
- incidents bound to occur
- failures help team learn
Implement gradual change (opens door to continuous change culture)
- small updates are better
- easier to review
- easier to rollback
Leverage tooling & automation
- reduce manual tasks
- heart of CI/CD pipelines
- fosters speed and consistency
Measure everything
- critical gauge of succes
- ci/cd needs full monitoring
- synthetic, proactive monitoring

Site Reliability Engineering

What happens when a software engineer is tasked with what used to be called operations - Ben Traynor

Why reliability?

most important: does the product work
reliability is the absence of errors
unstable service likely indicates variety of issues
must attend to reliability all the time - not just when your hair is on fire

"Class SRE implements DevOps"

DevOps is the "what"
SRE is the "how"

Reduce organization silos (DevOps pillar)

Share ownership (SRE pillar)

developers + operations
implement same tooling
share same techniques

Accept failure as normal (DevOps pillar)

No-fault post mortems & SLOs (SRE pillar)

no two failures the same
track incidents (SLIs)
map SLIs to objectives (SLOs)

Implement gradual change (DevOps pillar)

Reduce cost of failures (SRE pillar)

limited canary rollouts
impact fewest users
automate everything possible

Leverage tooling & automation (DevOps pillar)

Automate this year's job away (SRE pillar)

automation is force multiplier
autonomous automation best
centralizes mistakes

Measure everything (DevOps pillar)

Measure toil and reliability (SRE pillar)

key to SLOs and SLAs
reduce toil, up engineering
monitor all over time

Goal: make better software, faster

Define availability

Determine level of availability

Detail what happens when availability fails

Service Level Indicators (SLIs)

A carefully defined quantitive measure of some aspect of the level of service that is provided

SLIs are metrics over time, specific to user journey such as request/response, data processing, or storage, that show how well a service is doing

Examples:

Request latency - how long it takes to return a response
Failure rate - fraction of all rates received (unsuccessful requests / all requests)
Batch throughput - proportion of time = data processing rate > than a threshold

User journey: sequence of tasks central to user experienc and crucial to service

Request/response journey:

availability - proportion of valid requests served successfully
latency - proportion of valid requests served faster than a threshold
quality - proportion of valid requests served maintaining quality

Data processing journey:

freshness - proportion of valid data updated more recently than a threshold
correctness - proportion of valid data producing correct output
throughput - proportion of time where the data processing rate is faster than a threshold

4 Golden Signals:

Latency - time it takes for the service to fulfill a request
Errors - rate at which your service fails
Traffic - how much demand is directed at your service
Saturation - measure of how close to fully utilized the service's resources are

SLI = ( good events / valid events ) * 100

Bad SLI: variance and overlap in metrics prior to and during outages are problematic (jagged line)
Good SLI: stable signal with a strong correlation to outage is best (consistent line)

Best practices

limit number of SLIs (3-5 per user journey, avoid contradictions)
reduce complexity (not all metrics make good SLIs, avoid false positives and impact response time)
prioritize journeys (select most valuable to users, identify user-centric events)
aggregate similar SLIs (collect data over time, turn into a rate/avg/percentile)
bucket to distinguise response classes (not all requests same, may be human/background/apps/bots, combine or bucket SLIs)
collect data at load balancer (most efficient method, closer to user's experience)

Service Level Objectives (SLOs)

Service Level Objectives specify a target level for the reliability of your service. Need buy-in across the organization.

100% reliability isn't a good objective?

closer you get to 100%, more expensive
technically complex
users don't need 100% to be acceptable
leaves room for new features (error budgets)

SLOs are tied to your SLIs

measured by SLI
can be a single target value or range of values
SLI <= SLO

SLI example

metrics over time which detail the health of a service
site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO example

agreed-upon bounds how often SLIs must be met
95% percentile homepage SLI will succeed 99.9% over next year

Make SLOs achievable!

based on past performance
if not historical data, collect some
keep in mind: measurement <> user satisfaction

How about Aspirational SLOs?

typically higher than achievable
set a reasonable target and begin measuring
compare user feedback to SLOs

Service Level Agreement (SLAs)

An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. Should reliability fail, there are consequences.

SLA characteristics

a business-level agreement
can be explicit or implicit
explicit contract contain consequences:
- refund for services paid for
- service cost reduction on sliding scale
may be offered on a per-service basis

Example SLA on GCP service

describes SLOs
sliding scale

SLIs (Drive) SLOs (Inform) SLA

latency SLI (set SLO 200ms and SLA 300ms)

SLO

internal targets that guide prioritization
represents desired user experience
missing objective should also have consequences

SLA

set level just enough to keep customers
incentivizes minimum level of service
looser than corresponding objectives

Error Budget

A quantitative measure shared between the product and SRE teams to balance innovation and stability.

Management buys into SLO
SLO used to determine uptime for quarter
Monitoring service measures actual uptime
Calculate difference between SLO and uptime
Push new release if error budget allows

"Risky business" so why risk it?

balances innovation and reliability
manages release velocity
developers oversee own risk
- will push for more stability and slow down velocity
if error budget exceeded:
- releases temp halted
- system testing and dev expanded
- performance improved

Error budget = 100% - SLO

example: SLO = 99.8%
100% - 99.8% = .2% budget
- 0.002 X 30 day / month
- X 24 hrs / day
- X 60 min / hour
- = 86.4 minutes per month

What about global services?

time-based error budgets not valid
better to define availability in terms of request success rate
referred to as aggregate availability
- Availabiliy = successful requests / total requests
- example: 99.9% = 8.76 hr/mo; 2.16 hr/qtr; 43 min/mo; 10 min/wk; 1 min/day

What are they good for then? (Piggy bank)

releasing new features
expected system changes
inevitable failure in networks
planned downtime
risky experiments
unforseen circumstances

Defining and reducing toil

Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

Manual - extends to include running a script - which although saves time still must be run by hand
Repetitive - if a task is repeated multiple times, not just once or twice, then the work is toil
Automatable - should the task be done by a machine just as well as by a person, you can consider it toil
Tactical - is not proactive or strategy-driven. Rather it's reactive and interrupt-driven, e.g., pager alerts
Devoid of enduring value - work that does not change the state, or doesn't add permanent improvement
Scales linearly as service grows - tasks that scale up with service size are toil

What is not toil? Overhead, as not tied to production service

Email
Commuting
Expense reports
Meetings

Toil reduction benefits

increased engineering time
higher team morale, lower burnout
increased process standardization
enhanced team technical skills
fewer human error outages
shorter incident response times

3 top tips for reduction of toil

identify toil
estimate time to automate (make sure benefit > cost)
measure everything (account for cost of context switching; don't overdo it)

Generating SRE Metrics

Monitoring

Collecting, processing, aggregating, and displaying real-time quantitive data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.

Why monitor?

analyzing long-term trends (helps setting SLOs)
comparing over time groups
alerting (real time)
exposing in dashboards
debugging
raw input for business analytics

White-box

metrics exposed by the system
focusing on predicting problems
heavy use recommended
best for detecting imminent issues

Black-box

testing externally visible behavior as a user would see it
symptom oriented, active problems
moderate use of critical issues
best for paging of incidents

Metrics - numerical measurements representing attributes and events

GCP Cloud Monitoring
Collects a larger number of metrics from every service at Google
Provides much less granular information, but in near real time
better for alerts and dashboards
real-time nature means engineers notified of problems rapidly
its most critical to visualize the data in a dashboard

Logging - append-only record of events

GCP Cloud Logging
Can contain large volumes of highly-granular data
Inherent delay between when an event occurs and when it's visible in logs
Logs can be processed with a batch syste, interrogated with ad hoc queries, and visualized with dashboards
use logs to find root cause of an issue, as the information needed is often not available as a metric
for non-time-sensitive reporting, generate detailed reports using logs processing systems
logs will nearly always produce more accurate data than metrics

Alerting

Alerts give timely awareness to problems in your cloud application so you can resolve the problems quickly.

Set up monitoring

conditions are continuously monitored
monitoring can track SLOs
can look for missing metric
can watch thresholed

Track metrics over time

track if condition persists for given amount of time
time window (due to technical constraints) less than 24 hrs

Notify when condition is passed

incident created and displayed
alerts can be sent via
- Email
- SMS text message
- App, i.e., Slack
- Cloud Pub/Sub

Error budget burn rate formula

100% - SLO X (Events over set time)
example: 2% SLO (100% - 98% = 2% = .002)
- .002 X 12,000/30 = BR (burn rate)
- .002 X 400 = .8
- 400/24 = 16.67 fails/hr

Slow burn alerting policy

warns that rate of consumption could exhaust error budget before end of compliance period
less urgent than fast-burn condition
requires longer lookback period (24 hour max)
threshold should be slightly higher than baseline

Fast burn alerting policy

warns of sudden, large change in consumption that, if uncorrect, will exhaust error budget quickly
shorter lookback period (i.e. 1-2 hours or less)
set threshold higher than baseline (i.e. 10X; too low results in false positives)

Establishing an SLO policy

select SLO to monitor
construct a condition for alerting policy
identify notification channel
provide documentation
create alerting policy

SRE tools

DevTools

Kubernetes Engine - managed, production-ready environment for running containerized applications
Container Registry - single place for team to securely manage Docker images, used by Kubernetes
Cloud Build - service that executes your builds in a series of steps where each step is run in a Docker container
Cloud Source Repositories - fully managed private Git repos with integrations for CI, delivery, and deployment
Spinnaker for GCP - integrates spinnaker with other GCP services, allowing you to extend your CI/CD pipeline
Cloud Monitoring - visibility into the performance, uptime, and overall health of cloud-powered applications
Cloud Logging - allows you to store, search, analyze, monitor, and alert on log data and events from Google Cloud
Cloud Debugger - lets you inspect state of a running app in real time, without stopping or slowing it down
Cloud Trace - distributed tracing system that collects latency data from apps and displays in console
Cloud Profiler - continuously gathers CPU and memory allocation information from your production applications

Incidents

Handling incident response

Good example

initiate protocol and appoints incident commander
works with one operations team and passes control at end of day, if issue persists
in the loop from the start, can coordinate public and departmental responses
freelancing agent not wanted

Management characteristics

separation of responsibilities

specific roles should be designated to team members, each with full autonomy in their role
roles should include
- incident commander
- operational team
- communication lead
- planning lead

established command post

post could be physical location or a comm venue, such as a Slack Channel

live incident state document

shared document that reflects current state of incident, updated as necesary and retained for postmortem

clear, real-time handoff

in the day is ending and issue remains unresolved, an explicit handoff to another incident commander must take place

3 questions to determine an incident

is another team needed
outage visible to users
unresolved after an hour

"Yes" to any of above is an incident.

Best practices:

development and document procedures
prioritize damage and restore service
trust team members in specified roles
if overwhelmed, get help
consider response alternatives
practice procedure routinely
rotate roles among team members

Managing service lifecycle

Architecture and Design

integrate best practices for dev team
recommend best infrastructure systems
co-design part of service with dev team
avoid costly re-designs

Active Development

SRE begins productionizing the service
planning for capacity
adding resources for redundancy
planning for spikes and overloads
implementing load balancing
adding monitoring, alerting and performance tuning

Limited availability

measure increasing performance
evaluate reliability
define SLOs
build capacity models
establish incident responses, shared with dev team

General availability

Production Readiness Review (PRR) passed
SRE handle majority of op work
incident responses
track operational load and SLOs

Depreciation

SREs operate existing system
support transition with dev team
work with dev team on designing new system; adjust staffing accordingly

"SRE principles aim to maximize the engineering velocity of developer teams while keeping products reliable." - SRE Workbook

Postmortem

Agenda

get metadata
recreate timeline
generate report

"No blame"!!!

Production meeting collaboration

upcoming production changes (near-term horizon visibility)
metrics (review current SLOs)
outages (summary of postmortem or update on status)
paging events (tactical view of pages and details that followed [valid or not])
non-paging events (what events didn't get paged that should have)
non-paging event 2 and 3

mikesparr/01-sre-background.md

Resources

Study

Suggested background (pre-reqs)

Layered understanding (BOPT)

Job/Role description

What is (the business of) software development?

Value (increase)

Costs (decrease)

How does software deliver value?

What is development process data flow?

Operations

Dev vs Ops

What is a DevOps Engineer?

What is a Site Reliability Engineer (SRE)?

Common Problems

Scale

Randomness

Change (70% of what breaks running system)

Tensions (all about tradeoffs)

SRE principles

1.1 Balance change, velocity, & reliability of the service

1.2 Manage service life cycle

1.3 Ensure healthy communication and collaboration for operations

CI/CD

2.1 Design CI/CD pipelines

2.2 Implement CI/CD pipelines

2.3 Manage configuration and secrets

2.4 Manage infrastructure as code (IaC)

2.5 Deploy CI/CD tooling

2.6 Managing different development environments (e.g., staging, production, etc.)

2.7 Secure the deployment pipeline

Operations (Ops)

3.1 Manage application logs

3.2 Manage application metrics with Stackdriver Monitoring

3.3 Manage Stackdriver Monitoring platform

3.4 Manage Stackdriver Logging platform

3.5 Implementing logging and monitoring access controls

4.1 Idenfity service performance issues

4.2 Debug application code

4.3 Optimize resource utilization

5.1 Coordinate roles & implement comm. channels during a service incident

5.2 Investigate incident symptoms impacting users with Stackdriver IRM

5.3 Mitigate incident impact on users

5.4 Resolve issues (e.g., Cloud Build, Jenkins)

5.5 Document issue in a postmortem (e.g., 5 Whys - not required but good to know)

Key Pillars of DevOps

Site Reliability Engineering

Reduce organization silos (DevOps pillar)

Accept failure as normal (DevOps pillar)

Implement gradual change (DevOps pillar)

Leverage tooling & automation (DevOps pillar)

Measure everything (DevOps pillar)

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Service Level Agreement (SLAs)

Error Budget

Defining and reducing toil

Generating SRE Metrics

Monitoring

Alerting

SRE tools

Incidents

Handling incident response

Managing service lifecycle

Postmortem