You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Class SRE implements DevOps" is a specific way to do DevOps.
Job/Role
SRE System
CI/CD and tools to implement
Operations in Google Cloud and tools to implement
Suggested background (pre-reqs)
Certified either with ACE or PCA (more depth and tradeoff thinking). Also some development or operations experience running systems.
Layered understanding (BOPT)
Business
external forces
Organizational (teams and techniques)
internal forces
Process/Techniques
human considerations
Technology/Tools
nuts and bolts (making things happen)
Job/Role description
Professional Cloud DevOps Engineers are responsbile for efficient development operations that can balance service reliability and delivery speed. They are skilled at using Google Cloud Platform to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.
What is (the business of) software development?
Alignment of development and operations, but doing so to add value to the business. The investment in software should have ROI (returned value > investment).
Value (increase)
Sales/Marketing - attracting new clients to bring in more revenue
Client Support - fulfilling obligations to clients and keeping them happy
Supplier Integration - interfacting or integrating with partners and suppliers, to run the business
Internal Automation - improving employee efficiency and happiness
Costs (decrease)
Initial Development - planning and building the very first version of the system
Operations - that system needs to keep running and those computers need to stay safe
Maintenance (Dev) - packages need to be updated and any discovered security issues must be fixed
Enhancements (Dev) - running a system invariably offers new info and opportunities for enhancement
How does software deliver value?
A 50%-good solution that people actually have solves more problems and survives longer than a 99%-good solution that nobody has. Shipping is a feature. A really important feature. Your product must have it.
Joel Spolsky (co-founder of Stack Overflow)
Deltas
fundamental unit of software development is a code change
Changes
value
cost
risk!
"Responsible software development requires risk mitigation in advance."
Team dynamic
every person is on the team
team needs to work together
integrating work from multiple people is key (hence continuous integration [CI])
What is development process data flow?
People who develop and operate a system responsible, get different and accountable for the system running properly, get diff result. Changing incentives for developers, they will do more to improve quality, shrinking impact, smaller code changes.
Implement gradual change (opens door to continuous change culture)
small updates are better
easier to review
easier to rollback
Leverage tooling & automation
reduce manual tasks
heart of CI/CD pipelines
fosters speed and consistency
Measure everything
critical gauge of succes
ci/cd needs full monitoring
synthetic, proactive monitoring
Site Reliability Engineering
What happens when a software engineer is tasked with what used to be called operations - Ben Traynor
Why reliability?
most important: does the product work
reliability is the absence of errors
unstable service likely indicates variety of issues
must attend to reliability all the time - not just when your hair is on fire
"Class SRE implements DevOps"
DevOps is the "what"
SRE is the "how"
Reduce organization silos (DevOps pillar)
Share ownership (SRE pillar)
developers + operations
implement same tooling
share same techniques
Accept failure as normal (DevOps pillar)
No-fault post mortems & SLOs (SRE pillar)
no two failures the same
track incidents (SLIs)
map SLIs to objectives (SLOs)
Implement gradual change (DevOps pillar)
Reduce cost of failures (SRE pillar)
limited canary rollouts
impact fewest users
automate everything possible
Leverage tooling & automation (DevOps pillar)
Automate this year's job away (SRE pillar)
automation is force multiplier
autonomous automation best
centralizes mistakes
Measure everything (DevOps pillar)
Measure toil and reliability (SRE pillar)
key to SLOs and SLAs
reduce toil, up engineering
monitor all over time
Goal: make better software, faster
Define availability
SLO
Determine level of availability
SLI
Detail what happens when availability fails
SLA
Service Level Indicators (SLIs)
A carefully defined quantitive measure of some aspect of the level of service that is provided
SLIs are metrics over time, specific to user journey such as request/response, data processing, or storage, that show how well a service is doing
Examples:
Request latency - how long it takes to return a response
Failure rate - fraction of all rates received (unsuccessful requests / all requests)
Batch throughput - proportion of time = data processing rate > than a threshold
User journey: sequence of tasks central to user experienc and crucial to service
Request/response journey:
availability - proportion of valid requests served successfully
latency - proportion of valid requests served faster than a threshold
quality - proportion of valid requests served maintaining quality
Data processing journey:
freshness - proportion of valid data updated more recently than a threshold
correctness - proportion of valid data producing correct output
throughput - proportion of time where the data processing rate is faster than a threshold
4 Golden Signals:
Latency - time it takes for the service to fulfill a request
Errors - rate at which your service fails
Traffic - how much demand is directed at your service
Saturation - measure of how close to fully utilized the service's resources are
SLI = ( good events / valid events ) * 100
Bad SLI: variance and overlap in metrics prior to and during outages are problematic (jagged line)
Good SLI: stable signal with a strong correlation to outage is best (consistent line)
Best practices
limit number of SLIs (3-5 per user journey, avoid contradictions)
reduce complexity (not all metrics make good SLIs, avoid false positives and impact response time)
prioritize journeys (select most valuable to users, identify user-centric events)
aggregate similar SLIs (collect data over time, turn into a rate/avg/percentile)
bucket to distinguise response classes (not all requests same, may be human/background/apps/bots, combine or bucket SLIs)
collect data at load balancer (most efficient method, closer to user's experience)
Service Level Objectives (SLOs)
Service Level Objectives specify a target level for the reliability of your service. Need buy-in across the organization.
100% reliability isn't a good objective?
closer you get to 100%, more expensive
technically complex
users don't need 100% to be acceptable
leaves room for new features (error budgets)
SLOs are tied to your SLIs
measured by SLI
can be a single target value or range of values
SLI <= SLO
SLI example
metrics over time which detail the health of a service
site homepage latency requests < 300ms over last 5 minutes @ 95% percentile
SLO example
agreed-upon bounds how often SLIs must be met
95% percentile homepage SLI will succeed 99.9% over next year
Make SLOs achievable!
based on past performance
if not historical data, collect some
keep in mind: measurement <> user satisfaction
How about Aspirational SLOs?
typically higher than achievable
set a reasonable target and begin measuring
compare user feedback to SLOs
Service Level Agreement (SLAs)
An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. Should reliability fail, there are consequences.
SLA characteristics
a business-level agreement
can be explicit or implicit
explicit contract contain consequences:
refund for services paid for
service cost reduction on sliding scale
may be offered on a per-service basis
Example SLA on GCP service
describes SLOs
sliding scale
SLIs (Drive) SLOs (Inform) SLA
latency SLI (set SLO 200ms and SLA 300ms)
SLO
internal targets that guide prioritization
represents desired user experience
missing objective should also have consequences
SLA
set level just enough to keep customers
incentivizes minimum level of service
looser than corresponding objectives
Error Budget
A quantitative measure shared between the product and SRE teams to balance innovation and stability.
Management buys into SLO
SLO used to determine uptime for quarter
Monitoring service measures actual uptime
Calculate difference between SLO and uptime
Push new release if error budget allows
"Risky business" so why risk it?
balances innovation and reliability
manages release velocity
developers oversee own risk
will push for more stability and slow down velocity
if error budget exceeded:
releases temp halted
system testing and dev expanded
performance improved
Error budget = 100% - SLO
example: SLO = 99.8%
100% - 99.8% = .2% budget
0.002 X 30 day / month
X 24 hrs / day
X 60 min / hour
= 86.4 minutes per month
What about global services?
time-based error budgets not valid
better to define availability in terms of request success rate
referred to as aggregate availability
Availabiliy = successful requests / total requests
Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Manual - extends to include running a script - which although saves time still must be run by hand
Repetitive - if a task is repeated multiple times, not just once or twice, then the work is toil
Automatable - should the task be done by a machine just as well as by a person, you can consider it toil
Tactical - is not proactive or strategy-driven. Rather it's reactive and interrupt-driven, e.g., pager alerts
Devoid of enduring value - work that does not change the state, or doesn't add permanent improvement
Scales linearly as service grows - tasks that scale up with service size are toil
What is not toil? Overhead, as not tied to production service
Email
Commuting
Expense reports
Meetings
Toil reduction benefits
increased engineering time
higher team morale, lower burnout
increased process standardization
enhanced team technical skills
fewer human error outages
shorter incident response times
3 top tips for reduction of toil
identify toil
estimate time to automate (make sure benefit > cost)
measure everything (account for cost of context switching; don't overdo it)
Generating SRE Metrics
Monitoring
Collecting, processing, aggregating, and displaying real-time quantitive data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.
Why monitor?
analyzing long-term trends (helps setting SLOs)
comparing over time groups
alerting (real time)
exposing in dashboards
debugging
raw input for business analytics
White-box
metrics exposed by the system
focusing on predicting problems
heavy use recommended
best for detecting imminent issues
Black-box
testing externally visible behavior as a user would see it
symptom oriented, active problems
moderate use of critical issues
best for paging of incidents
Metrics - numerical measurements representing attributes and events
GCP Cloud Monitoring
Collects a larger number of metrics from every service at Google
Provides much less granular information, but in near real time
better for alerts and dashboards
real-time nature means engineers notified of problems rapidly
its most critical to visualize the data in a dashboard
Logging - append-only record of events
GCP Cloud Logging
Can contain large volumes of highly-granular data
Inherent delay between when an event occurs and when it's visible in logs
Logs can be processed with a batch syste, interrogated with ad hoc queries, and visualized with dashboards
use logs to find root cause of an issue, as the information needed is often not available as a metric
for non-time-sensitive reporting, generate detailed reports using logs processing systems
logs will nearly always produce more accurate data than metrics
Alerting
Alerts give timely awareness to problems in your cloud application so you can resolve the problems quickly.
Set up monitoring
conditions are continuously monitored
monitoring can track SLOs
can look for missing metric
can watch thresholed
Track metrics over time
track if condition persists for given amount of time
time window (due to technical constraints) less than 24 hrs
Notify when condition is passed
incident created and displayed
alerts can be sent via
Email
SMS text message
App, i.e., Slack
Cloud Pub/Sub
Error budget burn rate formula
100% - SLO X (Events over set time)
example: 2% SLO (100% - 98% = 2% = .002)
.002 X 12,000/30 = BR (burn rate)
.002 X 400 = .8
400/24 = 16.67 fails/hr
Slow burn alerting policy
warns that rate of consumption could exhaust error budget before end of compliance period
less urgent than fast-burn condition
requires longer lookback period (24 hour max)
threshold should be slightly higher than baseline
Fast burn alerting policy
warns of sudden, large change in consumption that, if uncorrect, will exhaust error budget quickly
shorter lookback period (i.e. 1-2 hours or less)
set threshold higher than baseline (i.e. 10X; too low results in false positives)
Establishing an SLO policy
select SLO to monitor
construct a condition for alerting policy
identify notification channel
provide documentation
create alerting policy
SRE tools
DevTools
Kubernetes Engine - managed, production-ready environment for running containerized applications
Container Registry - single place for team to securely manage Docker images, used by Kubernetes
Cloud Build - service that executes your builds in a series of steps where each step is run in a Docker container
Cloud Source Repositories - fully managed private Git repos with integrations for CI, delivery, and deployment
Spinnaker for GCP - integrates spinnaker with other GCP services, allowing you to extend your CI/CD pipeline
Cloud Monitoring - visibility into the performance, uptime, and overall health of cloud-powered applications
Cloud Logging - allows you to store, search, analyze, monitor, and alert on log data and events from Google Cloud
Cloud Debugger - lets you inspect state of a running app in real time, without stopping or slowing it down
Cloud Trace - distributed tracing system that collects latency data from apps and displays in console
Cloud Profiler - continuously gathers CPU and memory allocation information from your production applications
Incidents
Handling incident response
Good example
initiate protocol and appoints incident commander
works with one operations team and passes control at end of day, if issue persists
in the loop from the start, can coordinate public and departmental responses
freelancing agent not wanted
Management characteristics
separation of responsibilities
specific roles should be designated to team members, each with full autonomy in their role
roles should include
incident commander
operational team
communication lead
planning lead
established command post
post could be physical location or a comm venue, such as a Slack Channel
live incident state document
shared document that reflects current state of incident, updated as necesary and retained for postmortem
clear, real-time handoff
in the day is ending and issue remains unresolved, an explicit handoff to another incident commander must take place
3 questions to determine an incident
is another team needed
outage visible to users
unresolved after an hour
"Yes" to any of above is an incident.
Best practices:
development and document procedures
prioritize damage and restore service
trust team members in specified roles
if overwhelmed, get help
consider response alternatives
practice procedure routinely
rotate roles among team members
Managing service lifecycle
Architecture and Design
integrate best practices for dev team
recommend best infrastructure systems
co-design part of service with dev team
avoid costly re-designs
Active Development
SRE begins productionizing the service
planning for capacity
adding resources for redundancy
planning for spikes and overloads
implementing load balancing
adding monitoring, alerting and performance tuning
Limited availability
measure increasing performance
evaluate reliability
define SLOs
build capacity models
establish incident responses, shared with dev team
General availability
Production Readiness Review (PRR) passed
SRE handle majority of op work
incident responses
track operational load and SLOs
Depreciation
SREs operate existing system
support transition with dev team
work with dev team on designing new system; adjust staffing accordingly
"SRE principles aim to maximize the engineering velocity of developer teams while keeping products reliable." - SRE Workbook
Postmortem
Agenda
get metadata
recreate timeline
generate report
"No blame"!!!
Production meeting collaboration
upcoming production changes (near-term horizon visibility)
metrics (review current SLOs)
outages (summary of postmortem or update on status)
paging events (tactical view of pages and details that followed [valid or not])
non-paging events (what events didn't get paged that should have)