## welcome ##

Jason Dixon:

* this monitorama is 2x the size of last year & berlin
* conference buddies, if you see someone with a heart sticker introduce yourself to them
* everyone give a high five or free hug
* why do this? this isn't a ruby conference
* empathy and culture is important, especially for ops
* between engineers, ops, and management
* and for the community here
* share the love
* sponsors are great bla bla
* breaks and lunch bla bla

## Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools! - Adrian Cockcroft ##

http://www.slideshare.net/adriancockcroft/monitorama-please-no-more

* keynote
* formerly of netflix

* graph of enterprise IT cloud adoption
* from left to right: ignore, ignore, ignore, no, no, I said No dammit, oh no, oh fuck
* rest of world = half way through cloud adoption
* you are here = trying to play catch up

20 years exp:

* 94 "SE Toolkit"
* 98 Sun Perf. Tuning
* 99 Resource Mgmt.
* 00 Capacity Planning for Web Services
* 07 Outstanding Contrib. to Computer Metrics
* 04-08 Capacity Planning Workshops
* 14 Monitorama!

state of the art in 2008:

* cacti, ganglia, nagios, zenoss, mrtg, Wireshark
* low number of machines
* it was subversive to think that open source could replace expensive enterprise tools
* created "SE", a C interpeter which could extract solaris performance information and output it all in a standard format
* created "virtual adrian", a simple rule based system for automated monitoring of disk, memory, etc. in solaris (to watch systems while he was on vacation)

why no more monitoring tools?

* we have too many
* we need more analysis tools, can we get an analysorama conference?
* rule #1: we spend too much time collecting, storing, and displaying metrics
* if you spend 50% of your time on this it's too much
* we need more automation, more analysis
* monitoring should not be tacked on, it should be a default

what's wrong with minutes?

* not enough resolution to catch problems
* it takes 5-8 minutes before you start seeing alerts
* if you had second resolution, you can see the difference in 5 seconds
* if your rollbacks are quick, you can revert a bad change in 5 seconds
* compare a 10 second outage to a 10 minute outage

* from continuous delivery we know that small incremental changes are best
* so we need the same from monitoring
* instant detection and rollback within seconds should be a goal

* SaaS tools that do this: VividCortex, boundary
* how does netflix do it? hystrix and turbine, websockets, streaming metrics, 1 second resolution & 15 seconds of history, circuit breakers, pages go to who is directly responsible for a specific component or change
* rule #2: metric collection -> display latency should be < human attention span (10s)

what's wrong with milliseconds?

* in a lot of JVM instrumentation, ms is the standard
* the problem with ms is that a lot of datacenter and hardware communication needs nanosecond resolution

* rule #3: validate your measurement system has enough accuracy and precision
* if there's a difference between something taking X and Y nanoseconds in your system, and all you have are a bunch of 1ms data points, you can't identify the problem

what's wrong with monoliths?

* monolithic monitoring tools are easy to deploy, but when they go down, you then have no monitoring
* there needs to be a pool of aggregators, displayers, etc.
* easier to do upgrades, more resilient to downtime
* anything monolithic has performance problems, scalability problems, SPOFs, can't tell the difference between monitoring system going down vs. actual system going down

* in-band monitoring: running monitoring on the same process, server, data center, etc. as the system itself
* SaaS monitoring: send to a third party
* both: an outage can't take out both monitoring systems, HA monitoring
* they might not being monitoring exactly the same stuff, but they should have some overlap

* rule #4: monitoring needs to be as (or more) available & scalable than the underlying system

continuous delivery:

* high rate of change
* new machines being spun up and shut down all the time (in netflix's case)
* short baselines for alert threshold analysis

* ephemeral configuration
* short lifetimes make it hard to aggregate historical data
* hand tweaked solutions do not work, it would take too much effort

microservices:

* complex flow of requests
* how do you monitor end-to-end when the dependencies and flow of requests is so complex and dynamic?

* Gilt Groupe: went from a handful of services to 450 services over the course of a year
* "death star" microservice pattern: everything is calling everything else in one big tangled graph of dependencies
* how to you visualize this? we need more hierarchy & grouping

closed loop control systems:

* how did netflix do autoscaling?
* on every deploy during peak time, double the number of servers
* using load average, which is not the best metric to use
* lots of overshoots
* new solution: scryer
* predictive autoscaler, FFT based algorithm, builds a forward predicted model to set the autoscale level
* scales ahead of time, then corrects as necessary
* using the old method it was hard to do this analysis, because the data was so chunky (from the doubling)

code canaries:

* ramp up of deployment, looks for errors, if there are problems it emails the responsible team and stops rolling out the code

monitoring tools for developers:

* most monitoring tools are built for ops / sysadmin (DBA vs. network admin vs. sysadmin vs. storage admin)
* fiefdoms of different teams and tools, different levels of access, hard to collaborate, hard to integrate and extend
* state of the art is to move towards APM, analytics, integrated tools for all teams
* deep linking & embedding, extensible tools
* business transactions, response time, runtime (e.g. JVM) metrics

challenges with dynamic ephemeral cloud apps:

* dedicated hardware: arrives infrequently, disappears infrequently, sticks around for years, unique IPs and MAC addresses
* cloud assets: arrive in bursts, stick around for a few hours, recycles the IP and MACs of machines that were just shut down!
* in the cloud model, you need to have a historical record of everything that ever happened in your infrastructure (Netflix Edda)

traditional arch:

* business logic
* DB master & slave
* some fabric in between
* storage

new cloud systems:

* business logic
* NoSQL nodes
* cloud object store

* not all hosted cloud services have detailed monitoring / metrics exposed
* you depend on web services to integrate with cloud services
* span zones & regions, monitoring now needs to span zones & regions too
* NoSQL introduces new failure modes

5 rules:

* 1. analysis > collection
* 2. key business metric monitoring should be second resolution
* 3. precision and accuracy -> more confidence
* 4. monitoring must be more scalable than the underlying system
* 5. start building distributed, ephemeral cloud native applications

Q&A:

* Q: you mentioned better visualization for microservices, like what?
  * A: a user hits the homepage -> what services are hit?, there is no arch. diagram anymore, part of viz. involves seeing which zones and regions are hit, manual tagging & hierarchy of components, owners, etc. it's useful to for instance limit to just the services my team owns or depends on, aspect-oriented view, but it's not a solved problem, most OSS monitoring tools have good backends but less good UIs, cloudweaver looks interesting

* Q: canary system, what types of checks are you running?
  * A: error rate, CPU time, response time, jmeter functional tests, business metrics, and you need to do the comparison on freshly spun up nodes (e.g. 3 old vs. 3 new copies of the code on freshly spun up machines)

## Computers are a Sadness, I am the Cure - James Mickens ##

* (this talk was just entertainment, no practical information)
* i'm here to take you on a quest
* everything i'm going to tell you is 100% true
* bla bla

* distributed systems send messages back and forth
* most messages fail because god hates us
* so we send more

* 10 years ago the MapReduce paper was like alien technology
* it was so simple and seductive, you just specified a map and reducer function, ran it on commodity machines, it was amazing
* that was 10 years ago
* let's stop talking about MapReduce
* say "word count" one more time
* let's also stop talking about "the cloud"
* the problem with all this social cloud stuff is that i hate most people
* there are two kinds of people: people who have actually build cloud software and others
* others: cloud is great!, 99.9999999%!, everyone is happy, everything is a solved problem!
* real cloud people: it's a nightmare, hardware fails, SLAs are misleading, IO is queued up, packets get sent to a black hole, it's madness

* why does anything happen at all in the cloud?
* it's like an old timey map with dragons in the middle
* this is why we need monitoring & analysis

* a message of hope: give up
* look at the CAP theorem, you can't have it all
* if your email goes down, then your reaction should be to want to use email less, go do something else
* can't take your test at your MOOC? take it later, your MOOC degree will be just as worthless

* let's be serious though
* some things we do need to care about

* (nosql rant i didn't fully write down, nosql = bane from batman, throw out all the rules and laws, chaos)

* conventional wisdom: america needs more programmers
* reality: we need fewer programmers
* technology is not the future, no more stupid apps, painting is the future, go do that, leave me alone
* if you are a VC who funds this kind of stuff, i hope you become poor

* let's be serious about security
* threat model: mossad or not-mossad
* either you are being attacked by mossad or you're not
* "not attacked by mossad" = where you want to be, just keep using strong passwords and don't click on weird links
* "you are being attacked by mossad" = no defenses, you're going to die
* america's mental model of the CIA, FBI, etc. are that they are bunch of boy scouts
* in reality: drones, exoskeletons, cable splicing submarines
* they're not going to send boy scouts, they're not going to fight close range musket battles, they're going to use their advantage of having access to all the infrastructure you depend on
* how do you defend against that with rocks and pencils and leaves?

* easy attacks are easy
* "Mary" from "Central University" working as a "Rectuier" with an attactive profile picture wants to be my friend on Facebook
* obviously i don't know mary
* BUT WHAT IF I DO KNOW MARY
* most important goal in security: eliminate men as a gender
* possible solution: dude overflow detected -> trigger bear trap and the guy from the SAW movie

summary:

* ozzy osbourne crazy train = cloud computing
* bane = nosql
* bla bla

Q&A:

* Q: can i be your friend on facebook?
  * A: there is a background check, and i will wait 2-3 days to show i'm not desparate, i encourage you to submit an application though, i love judging people

## Simple math to get some signal out of your noisy sea of data - Toufic Boubez ##

* i lied! there are no simple tricks
* too good to be true = it probably is

* background:
* CTO Metafor Software
* CTO Layer 7 Technologies
* CTO Saffron Technologies

* let's start with the "Wall of Charts"
  * hire a new guy: shove him in front of the wall of charts
  * we collect 1000s of metrics, pick 10, and put them in a dashboard
  * this is meaningless
  * WoC leads to alert fatigue
  * alert fatigue is one of the largest problem in ops
  * watching WoCs cannot scale
  * at some point, you will need a person or a team dedicated to watching the WoCs
  * so we need to turn this work over to the machines

* to the rescue: anomaly detection
  * definition: detect events or patterns which do not match expectation
  * definition for devops: alert when one of our graphs starts looking wonky

* who else is doing anomaly detection?
  * manufacturing QC has been doing this for a long time
  * measure the diameter, weight, etc. of the flux capacitors and throw the outliers away
  * assumptions: normal, gaussian distrbution; data is "stationary", it doesn't change much over time
  * the "three-sigma rule": 68% of the values lie within 1 std dev of mean, 95% lie within 2, 99.7% lie within 3
  * mark those percentages as the "red lines" on the graphs and take action when a value falls outside of a red line

* if you implement 3-sigma rule alerts in the data center:
  * a. you get alerted all the time, or
  * b. you don't get alerted when there's a real problem
* the assumptions from manufacturing (gaussian, stationary) don't apply to the data center

* static thresholds are ineffective
* if data is moving, we need a moving threshold, that's a smart idea

* the "big idea" of moving averages: the next value should be consistent with the recent trend
  * finite window of past values, ignore the whole history
  * calculate a predicted value
  * "smoothed" version of time series
  * compare squared error rates between smooth vs. raw data
  * now you can compute the 3-sigma values based on that smoothed data
* what about spikes, outliers, etc.? windows can be skewed

* ok, now we use a weighted moving average, less weight on data that is further away
  * not good enough, doesn't handle trends, exponential smoothing
  * double exponential smoothing (DES)
  * triple exponential smoothing (TES)
  * Holt-Winters (seasonal effects)

* result:
  * a. you are woken up a lot less, but still woken up
  * b. it still doesn't catch some problems

* are we doomed?
* no
* smoothing works on certain kinds of data
* smoothing works when deviations are normally distributed
* there are lots of non-gaussian techniques, we're only going to scratch the surface in this talk

* trick #1: histograms
  * (better: kernel densities, but histograms work and are simple)
  * if you have a bunch of different time series of the same metric, build a histogram for each series
  * start by looking at the distribution of your data, understand what it looks like before you start your analysis

* trick #2: kolmogorov-smirnov test
  * it sounds cool and it works
  * compares two probability distributions
  * requires no assumptions about the underlying distribution
  * measures max dist. between two cumulative dists.
  * good for comparing day-to-day, week-to-week, seasonal affects
  * "are these two series similar or not?"
  * KS with windowing
    * example: KS for week 1 vs. week 2 and week 2 vs. week 3 (where week 3 is during christmas and we experienced a problem)
    * 1 vs. 2: small distance
    * 2 vs. 3: huge distance
  * the case where 3-sigma static threshold failed is now extremely clear with KS

* trick #3: diffing / derivatives
  * often when your data is not stationary, the derivative is
  * e.g. random walks
  * most frequently, the first difference is sufficient: dS(t) <- S(t+1) - S(t)
  * once you have the stationary data set, gaussian techniques work better
  * real example: CPU time
  * the distribution is totally non-gaussian, very noisy and random looking
  * but.. first difference, it totally is gaussian!

* you're not doomed if you know your data
* understand the statistical properties of your data
* data center data is typically non gaussian
* so don't use smoothing
* use histograms, KD, and derivatives instead

Q&A:

* Q: is your point to make everything gaussian?
  * A: no! sorry if i conveyed this message, KS does not involve gaussian, there are lots good non-gaussian techniques

## The Care and Feeding of Monitoring - Katherine Daniels ##

* a story
  * pagerduty tells us our site is down
  * so we checked, and it was down
  * then... a minute later, it's back
  * hmm. ok.
  * then.. a few minutes later
  * down again
  * and up again

* this is.. The Blip, a randomly occurring outage that fixes itself

* so what's happening?
  * 500 rate.. nothing
  * API errors.. nothing
  * error rate... nothing
* what are we missing from our monitoring?

* monitor all the things!
  * we're missing something, just start randomly adding metrics until we find it
  * then you get.. this..
  * zenoss screenshot that's all red from down checks
* we're trying to find a needle in a haystack and just added more hay
* this is why you don't do a full body diagnostic scan for medical patients, the more you look for, the more you might find, and they might not all be actual issues
* so, we need to monitor only some of the things..

* first looked at the load balancers, because everything dropped out of the LB at once
* tried provisioning a new ELB, switching availability zones
* looked at access logs
* everything worked the same, still getting the blip
* how about the healthcheck?
  * the healthcheck was hitting something called "healthD", a healthcheck service that failed when one or both of two important backend components went down
  * and there weren't any logs or monitoring for healthD itself

* looking inside healthD showed that one of the two services, api2, had a problem
  * it seems a certain misbehaving user was triggering bad requests
  * so we went into api2 and added metrics per response type
  * found the response type that stood out
  * decreased timeouts from 60 seconds to 5 seconds
  * optimized some slow queries
  * deleted some old slow / unused API methods
* now the site was back to normal

why didn't we have monitoring for this?
* 1. black boxes, mysteries
  * any X-as-a-Service that you depend on (e.g. ELBs) are black boxes and need some special care for monitoring
* 2. technical debt / bad technical decision
  * why did the healthcheck require both services to be up?
  * why did we even have two separate APIs?
  * long ago someone decided to do a rewrite, but the old system remained
  * we can only move foward at this point, we can't shut down either system, so we need to monitor both

* what to monitor:
  * monitor all services
  * monitor responsiveness (network, API, web server)
  * system metrics (memory used, CPU used, disk space)
  * application metrics (read lock time, write lock time, error rate, API response time)

* don't get into a situation where you have to say "oh yeah that check is red but it's OK, don't worry"
* as someone mentioned earlier, your monitoring needs to scale above your application
  * load test your monitoring, make sure it can keep up and responds properly with increased load

* monitoring should not be a silo, it shouldn't be an ops problem
  * monitoring should be built in to the application from the beginning
  * work with developers
  * ask: "what does it mean for this application to work properly? what does it look like when it breaks?"
* monitoring shouldn't be a reactive last minute thing

## Car Alarms and Smoke Alarms - Dan Slimmon ##

* Sr. Plat Engineer at Exosite, which does internet of things
  * we recently made a better mousetrap that texts you when it goes off, so if you have a building full of mouse traps you only need to check the one that was tripped

* we wear many hats in ops
* but data science is becoming a very important hat
* people believe you when you have graphs

* signal to noise ratio
* example: plagiarism detection
  * let's say we make a system that has a 90% chance of positive plagiarism detection
  * 20% chance of negative result
  * and 30% of kids currently plagiarize

some questions:

* 1. given a random paper, what's the prob you get a negative result?
  * 59%
* 2. what's the probability that the system will catch a plagiarized answer?
  * 90%, duh, we already knew that, why'd i ask you that?
* 3. if you get a positive result, what's the probability the result really is plagiarized?
  * 65.8%

* this is an unintuitively terrible result
* we originally heard 90% chance
* but now in the real world it's down to 65.8%, that's pretty useless

* sensitivity and specificity
  * sensitivity: % of actual positives that are identified as such
  * specificity: % of actual negatives that are identified as such
  * high sensitivity: freaks the the fuck out when anything might be considered slightly bad
  * high specificity: if it says you cheated, sorry, you definitely cheated

* here's the graph if you want to look at it again: http://imgur.com/LkxcxLt.png

* how does this relate to ops?
  * positive predictive value is the probabiilty that: when you get paged, something is actually wrong
  * consider your service has 99.9% uptime, and your check is 99% accurate
  * that sounds pretty good right?
  * P(TP) = 0.01%
  * P(FP) = 0.99%
  * PPV = P(TP) / (P(TP) + P(FP)) = 9.1%
  * if you get paged, you only have a 1 in 10 chance that something is actually wrong
  * that's horrible

* car alarms
  * when you hear a car alarm, is your immediate reaction to run and check to make sure everything is ok?
  * the majority of car alarms sounding don't indicate a problem, they go off all the time for no reason
  * they have low specificity, high sensitivity

* smoke alarms
  * when you hear a smoke alarm in a building, you don't have the same reaction
  * you don't sit around and say "do you guys smell smoke? i think i'm just gonna wait here"
  * you get out of the building and wait for the fire department to give the OK

* why do we have such noisy checks?
  * undetected outages are embarrassing, so we focus on sensitivity
  * that's a normal, good reaction to have
  * but understand that the relation between the alert threshhold and PPV
  * looser threshold = less alerting, higher PPV, more uninterrupted sleep (but a chance you'll miss a real problem)
  * strict threshold = more alerting, lower PPV, more false positives

* sensitivity / specificity don't need to be competing concerns

* instead of a line, you need a surface
* hysteresis is a great way to get these additional degrees of freedom
* state machines
* time series analysis (like mentioned earlier, smoothing, histograms, derivatives, etc.)

* as your data changes (e.g. your service becomes more or less reliable) or your checks become more reliable
* your sensitivity & specificity will change too, sometimes wildly, so you can't just set it once and forget about it

* a lot of nagios configs conflate the detection vs. indentification of a problem
* for example, say you have these 4 checks for your website:
  * 1. apache process count
  * 2. swap usage
  * 3. site responding to HTTP
  * 4. requests per second
* "your alerting should only tell you whether work is getting done"
* if your site is still up, but apache isn't running, that's great news! (haha)
* so cross off #1 and #2
* and #3 and #4 can be combined into one check, if your RPS is good, then it must be responding

* here's a tool that i want: something like nagios that monitors services instead of hosts
* when a service is down, only then do you kick off a bunch of host level diagnostics
* if the tool was aware of these SNR concepts (specificity, etc.), and had some built in knobs to tune, that would be even better

* other useful stuff:
  * bischeck
  * see links in slides

Q&A:

* Q: is it foolish to tweak these knobs manually? shouldn't this be automated?
  * A: i haven't found anything to automate this yet, manually tweaking is the only way i've found so far

## Metrics 2.0 - Dieter Plaetinck ##

* works at vimeo
* video transcoding & storage
* lots of metrics, lots of graphite

* when a user uploads, it first runs a few checks to determine which data center to route your upload to
* graphite is used to make a feedback loop to make sure that kind of automated system is working properly
* but this talk is going to be about problems, mostly with graphite

* a timeseries looks like this: (unixtime, value)
* timeseries are labelled like "mysql.database1.queries_per_second"
* it is difficult to navigate the hierarchies
* it is difficult to find how and why a metric is being generated
* timeseries don't have units, they don't describe their behavior (e.g. semantics like which time period they cover)
* unclear, inconsistent formats
* metrics are tightly coupled to the source and lack context
* one metric name can have multiple meanings

* complexity = lots of sources * lots of people * multiple aggregators

* it's a time sink
  * everything has to be done explicitly, even when this data could be determined implicitly (units, legend, axes, titles, etc.)
  * in graphite, different subtrees may contain the same types of data, so this makes it hard to compare across the hierarchy
  * as you gather more metrics, these problems get worse

* metrics 2.0 tries to solve these problems
* metrics have a self describing format

compare graphite:

    stats.timers.dfs5.proxy_server.object.GET.200.timing.upper_90

to metrics2.0:

    {
        server: dfvimeodfsproxy5,
        http_method: GET,
        http_code: 200,
        unit: ms,
        metric_type: gauge,
        stat: upper_90,
        swift_type: object
    }

* metrics20 allows you to use more characters to label your metrics (e.g. "/" for "Req/s")
* metrics20 allows you to add extra metadata to your metrics
  * for example, src/from parameters, so you can track where a metric is being submitted from
* conceptual model -> wire protocol (compatible with graphite/statsd/carbon) -> storage

* metrics20.org

* units are extremely useful:
  * MB/s, Err/d, Req/h, ...
  * B Err Warn Conn Job File Req ...
  * we allow you to use SI + IEEE standard units

* easier to learn, more flexible

Carbon-tagger:

* middleware between old graphite instance and new metrics20 instance
* adapts old format to new format (adding metadata, units, etc.)

Statsdaemon:

* similar to etsy statsd, drop-in compatible
* if you send a bunch of bytes B over time, it automatically figures out this is B/s
* if you send a bunch of milliseconds ms over time, it automatically calculates percentiles/min/max/mean/etc.

Graph-Explorer:

* dashboard system with a new query syntax

New query syntax:

* proxy-server swift server:regex unit=ms
* automatically does group-by based on metadata
* automatic legends, axes, tagging (these are all manual in graphite)

    stat=upper_90
    from datatime to datetime
    avg over <timespec> (5M, 1h, 1d, ...)

Some examples:

Which is slower, PUT or GET?

    stack ...
    http_method:(PUT|GET)
    swift_type=object

Show http performance per server:

    http_method:(PUT|GET)
    group by unit, server

grab all job stats (note how no timeseries names are explicitly given, this finds all timeseries that have a unit of "Jobs/second"):

    transcode unit=Job/s
    avg over <time>
    from <datetime> to <datetime>

another example:

    ...didn't catch it...

another example, but now grouped by zone:

    ...
    group by zone

network bandwidth by server:

    unit=MB/s network dfvimeorpc sum by server[]

cumulative total of bandwidth over time

    (automatic integration)

rate of change:

    (automatic derivatives)

bonus features:

* graphs are interactive (inspect, zoom)
* set up rules & alerts
  * imagine a disk space check which can alert you on both individual machines and cluster-wide
* email alerts (with embedded graphs)
* emit events (see anthracite), add notes / events to graphs, events have full text search
* better dashboards: allow you to dynamically append a fragment of a query to every query in the dashboard (e.g. switching between different group-by clauses)
* easier to define colors

future work:

* these three features are all about condensing series into smaller sets of data:
  * aggregation rules
  * graphite API functions like summarize, etc.
  * consolidateBy & graph renderers (i.e. at the pixel level to generate images)
* a lot of mistakes show up from these operations
* with metrics20 we shouldn't need to do this anymore, the graphs themselves should know how to do this
* maybe we can automatically display mean/lower/upper/upper90/lower90 on graphs
* facet based suggestions
* imagine if you consistently emitted metrics with "unit=Err/s" across your entire stack, i.e. this was a standard in every piece of infrastructure / system / application, if you did this, you could have complete visibility into errors across your entire infrastructure, plus super easy drill-down

Q&A:

* Q: openstack has a technology called "cata"(?), used by ceilometer, it's a standard, has 5 W's metadata, etc. have you looked at that?
  * A: i haven't, i tried searching for something like this but didn't find anything, sounds interesting, definitely will look at it

* Q: does carbon-tagger cause performance problems?
  * A: we have 170k metrics at vimeo and it's performed fine. both tools i mentioned are written in go

## Our Most Wicked Problem - Ashe Dryden ##

* lack of diversity in tech is a wicked problem
* http://en.wikipedia.org/wiki/Wicked_problem
* it's like playing tetris with only one piece
* whites and asians are overrepresented in tech vs. the general population
* women, black, and hispanic are underrepresented
* 56% of women leave tech after entering, twice the attrition rate of men, and we don't have stats on other groups

* why is it a wickedly hard problem?
* incomplete or contradictory knowledge
* not enough research
* people & opinions involved
* people have different opinions on this subject
* economic problems
* not all schools can get computers & internet access & teachers for tech
* there is a pay difference between certain groups
* there is no solution
* just like poverty, the problem can never be totally solved
* there's no right or wrong solution
* we don't even know what the solution is yet
* the solvers of this problem can also be the creators of the problems
* what contributes? society, class, family & community, education, industry

* what can i do?
  * if you're a parent, raise your children to be respectful of others
  * get involved in education
  * listen to the people who are affected
  * have empathy
  * collaborate
  * change your behavior
  * use your power & influence to change things, talk to your boss, talk to your colleagues, talk to strangers, reach out, speak out on behalf of others

Q&A:

* Q: i'm a pro-feminist man, and i understand why you can't depend on the repressed group to solve the problem, but if i use my voice then i'm going to be speaking for women and reinforce the problem, what can i do?
  * A: instead of speaking on behalf of others, speak for yourself to create space for others

* Q: what is low hanging fruit in this problem?
  * A: talk to your friends, if someone says something that doesn't sound right to you, that sounds harmful, say something to them, and explain to them instead of criticize them

* Q: is it difficult because success has no definition for this problem?
  * A: yes

## StatsG at New York Times - Eric Buth ##

* works at the New York Times in the interactive news department

* what does our department do?
* i sometimes can't do a good job of explaining it, maybe some examples would be better

* "The Guantanamo Docket"
  * interactive timeline showing what has happened to the gitmo detainees from 2002 to 2014
  * click on detainee's name to bring up their bio, documents, articles, etc.

* "Watching Syria's War"
  * timeline of video clips & articles

* Sochi 2014
  * neat tables and graphs of olympic results (medal counts, etc.)

* haiku.nytimes.com
  * finds accidental haikus written in articles

* Blackout Poetry
  * article starts off completely redacted, then you click on words to reveal them and create a poem

* and lots more...

what's in common?

* i don't know actually, we're kind of responsible for whatever we say yes to doing
* we're separate from the larger NYTimes organization
* we have our own infrastructure, we don't have to deal with the larger more "corporate" parts like the CMS, mobile app, etc.
* we don't have as much traditional releases, milestones, etc.
* heterogeneity
* over 100 active apps
* short turnarounds
* collaborations with other departments
* everything is different, for a good reason
* another example: the Dialect Quiz
  * someone threw together a node.js app last minute
  * ended up being their highest traffic feature ever

* if you work in systems, this might lead you to become an embittered jerk
  * everyone tells you their project is the most important thing ever and then it launches and you're stuck maintaining it forever
  * if you are in the position to say "no", you start to say "no" all the time
  * no new technologies, no new languages, more conservative choices
  * ops is vaguely managerial, you are partially in charge of leading technology projects, to make sure projects succeed, to give technical advice, to help organize the systems and keep them running
  * so if you have a bad run, if have some bad experiences, you tend to start saying no to everything
  * a year ago i tried to make a change in this behavior

* what if your relationship was the opposite?
  * what if you tried to say "yes" to everything?
  * this is actually the reason behind having an interactive news dept., to do this kind of stuff
  * even though it can be a pain in the ass

* if someone's enthusiastic about something, and you shut them down, that's not good for either side
* wasted enthusiasm is a very bad thing
* if you don't embrace that enthusiasm, they will go elsewhere

so how do you handle so many heterogenous systems?

* have preferences and offer alteratives (e.g. nginx instead of apache)
* pick technologies that are widely applicable (e.g. varnish works in front of everything)
* what are you logging? how are you logging?
* can you set this up without my help?
* everything needs to be self-serve
* including metrics gathering

* old way: boilerplate / sample code / examples
* new way: be reasonable, follow a few guidelines, and you're free to run whatever you want

* we had an old log aggregation system, which was unmaintained
* statsd replaced that system
* because statsd is:
  * self reporting, zero config
  * get what you asked for
  * easy to integrate with everything
  * easy to explain
  * doesn't over-solve the problem

* well.. we did decide to over-solve the problem a bit.. and wrote statsG
  * easier to run
  * automate data retention
  * eliminate flushing
  * safely expose self-serve data retrieval

* go is a good choice for this kind of application
  * running binaries is a big advantage
  * (gave a few other reasons i missed)

* redis also sounded like a good fit
  * redis is good at sets, this sounds like a set management problem
  * redis has automatic expiration

* lua for scripting redis
  * having a scripting language inside the DB allows you to do aggregation inside the DB itself, which is very easy and super fast

* result:
  * consumes JSON data
  * interactive graphs with 10 second resolution
  * dashboards are totally driven by developers
  * Winter Olympics was a big success story, the developers wrote all their own monitoring by themselves

* problems:
  * UDP is awesome ("free" message sending), but is incredibly difficult to debug, filling up buffers/queues and dropping messages is always a worry
  * redis is very powerful, but redundancy and scaling are a problem

* rolling your own solution is OK, but it's not for everyone
* if you feel enthusiastic about something, and you want to put the time into it, then you can roll your own
* this allows you to get to the root of the problem and you might learn something really valuable
* for us, it was having the ability to make metrics completely driven by developers

* cool bonus:
  * nytlabs.github.io/streamtools/
  * this project is going back to using log data and building up subscribe-able streams of log events
  * using a visual interface

Q&A:

* Q: for that streamtools project, once you consume the data, what can you do with it?
  * A: you can do anything, different plugins for sending to redis, sending to console, forwarding the message along to another service

## The cost and complexity of reactive monitoring - Chris Baker ##

* (this talk was mostly just a war story, not much real info to take away)
* data guy @ Dyn

* how many people have ever been in the situation where they were staring at a pile data wondering "how did this problem happen?"
* how did we get there?

* scale 1: how much money do we have? (money to buy infrastructure & tools vs. extremely strapped)
* scale 2: cutting edge vs. classic (new and shiny vs. nagios)
* scale 3: neckbeard vs. handwaver (refusal to work with new tools vs. oh please new tools save me)
* scale 4: time (lots of time budgeted vs. project manager hovering over you)
* scale 5: legacy (totes cloud brah vs. you down with PDP & ancient pyramids?)

* cost = price & manhours

* probability of user churn (customer leaves) vs. problem duration vs. problem severity
  * time to identify
  * time to mitigate
  * time to resolve
  * impact vs. identification vs. diagnosis vs. resolution
  * if you fix a problem before it occurs, there is no customer impact, this is where you want to be

* make more metrics to track this
* metrics all the way down!
* have metrics to track your metrics

* but the end goal is to solve problems in CI / testing instead of production

* time to identify: time motion study (cool industrial study, makes us feel good to compare ourselves to industry)
  * first you have to realize there is an issue
  * you should notice before your customer does
  * where do you look first?

* example: customer reports that API is unavailable
  * so, the customer knew about this before we did
  * when did the problem really start?
  * here's where the complexity begins
  * when you're under pressure, your problem solving ability changes
  * humans are fallible, you're very likely to come up with any idea under pressure, then start to investigate or build evidence for that idea
  * if you started using some brand new database monitoring software, and then something breaks, you're going to start being suspicious of that new monitoring software... even though in this case it's not the cause
  * all the while time is still ticking
  * vendor plug / shout out to VividCortex, this actually solved the problem! it highlighted the problem for us!
  * we found the problem! or did we???
  * (i guess this is turning into a war story now?)
  * well, vividcortex showed us problems, but it didn't fix the customer's problem
  * so.. back to square one

* reactive monitoring is the result of a bigger problem
* humans are not good at this kind of problem solving
* the crunch to provide an answer often leads you to the wrong answer

* part 2
  * i work in DNS
  * and we know there's a certain traffic pattern during the holidays, traffic increases, we run into new problems every year because of this
  * but this year.. hmm.. everything is green, no pages, all graphs look amazing, everyone is relaxed & off-guard because things are going so well
  * we're handling huge spikes of traffic with no problem
  * when everything looks this good then something is probably wrong
  * you need someone on your team to be the pessimist, to think that everything is broken all the time...
  * who is driving these spikes? CDNs? marketing campaigns? botnets? round up the usual suspects
  * how are we collecting this data? how does this data go from the real world into our monitoring system?

* your dashboard is the sausage produced by the sum of your monitoring
* if there's sawdust and rats in the input, it's going to show up in the output

* interesting aspects of DNS traffic:
  * recursive resolution (series of misses & lookups, terminating at the root)
  * TTL = time to live
  * RCODE = response codes, 0 = good, 1 = format error, 2 = server failure, 3 = name error, 4 = not impl., 5 = refused, 6-15 = bla bla
  * if you're not monitoring RCODEs, you don't know whether there's rat bits in your sausage
  * certain RCODEs don't use TTL/caching
  * TTLs are a rule people, and we have rules for a reason!
  * why monitor RCODE 5? it tells you all kinds of useful stuff
  * well.. we weren't monitoring RCODE 5
  * pretty obvious in retrospect

(i'm not quite sure what the main point of this talk is, it was more of a fun war story i guess)

Q&A:

* Q: is it difficult carrying all this weight as a devops thought leader on your shoulders? (some kind of in-joke in the DevOps twitter community?)
  * A: when i think about it.. atlas shrugged

## From Zero To Visibility - Bridget Kromhout ##

* having aspect ratio problems
* yes, definitely aspect ratio problems

* I work at 8thbridge
  * small dev team, one person ops team (me)

* joined the startup in progress
* twisty maze of shell scripts
* time consuming
* easy to break
* cron jobs which rewrote the crontab

* in portland we have bespoke artisanal everything

* we also used new relic
* pros:
  * nice graphs
  * application level view
  * good error analysis
* cons:
  * slow to update
  * many false-positive alerts (not totally their fault)
  * we couldn't afford it (has changed some since then)

* so those were our motivating reasons to change
* but the main motivator was not getting enough sleep

* so i changed our monitoring to nagios
  * nagios: every bit as hideous as you remember
  * yes it's hideous, but everything is right where you left it in 1912
  * the new shinies are great, e.g. sensu
  * but if we started using sensu it would have been the most complicated thing in our stack

* hating on nagios: the middle years
  * this is when nagios starts getting chatty
  * as soon as you see a problem, you write a new check and ratchet up the chattiness
  * everyone hates you when you write spammy checks

* how do i monitor something like HBase / hadoop?
  * best way to monitor HBase: hbck, the hbase consistency checker
  * nagios -> hbck bash script -> parse output
  * the most awesome tool in the world won't be able to monitor stuff like this out of the box
  * the only way you get that is by writing a custom check, which is the same no matter what technology you use

mongoDB:

* much like stumbling upon a robbery, i walked into a mongoDB in progress, with zero monitoring
* found nagios-plugin-mongodb
* worked pretty well, made a few fixes & improvements
* and they accepted my pull request!
* but.. mongoDB gave us trouble on cybermonday
* our traffic spiked and our response time went to crap
* "a single write operation holds the lock exclusively, and no other read or write operations may share the lock"
* the write lock always seemed sketchy, but it couldn't be that big of a problem, right? it was

* so.. next step.. we need to measure everything
  * we had an old unused, unmaintained graphite install
  * running something inside screen does not make it a daemon!
  * so, get that into shape
  * statsd chef cookbook worked great
  * graphite cookbook.. not so good, chef 11 only (we're dragging our feet on chef 10) and we run nginx, not apache
  * had to use tcpdump to debug why statsd/graphite didn't work
  * but got it working eventually

* shout out to carbonate
  * whisper-fill.py: backfills data between whisper files
  * very useful for the cutover

* how to detect real outages vs. deliberate drop-offs in traffic?
  * we provide a third party cookie
  * some people enable/disable our cookie on purpose (e.g. because they think it's causing a problem)
  * and some people disable it accidentally (pushing bad code)
  * this is difficult to catch without constantly looking at the graphs

* we didn't have money for new relic so we used sentry (open source error reporting system)
* this was really helpful in catching API errors from third parties trying to integrate with us

* showed a diagram of all their monitoring tools and the way the data flows
* when we explain this to non-ops people, they usually ask "why do you guys use so many tools? can't you use just one?"
* no! there is no one tool, there is some overlap, but you can't survive with just one monitoring tool

* what's next? wishlist for what i want to do next
  * logstash, kibana, elasticsearch
  * etsy/skyline - anomaly detection
  * etsy/oculus - metric correlation for etsy's "kale" system
  * zorkian/nagios-api - REST-like JSON interface to nagios
  * grafana - better graphite interface
  * hubot - want to use this to interact with nagios via chat

* what is the ideal monitoring system?
  * finds real problems
  * actionable alerts
  * usable by everyone

Q&A:

* Q: why did you choose nagios if everyone hates it?
  * A: i've done sysadmin before, quite a few years ago, i've never set it up from scratch, but i had a feeling it would work, it wasn't too bad to set it up manually, we needed a solution ASAP, and it worked

* Q: have you looked at check_mk?
  * A: i'm aware of it but if haven't looked closely at it, right now a lot of our nagios checks are alerting on data in graphite, what would you suggest using it for?
* Q: uhhhh monitoring (?)

* Q: what do you want to get out of the nagios API?
  * A: scheduling downtime and acknowledging alerts via hubot

## Conclusion of Day 1 ##

Jason Dixon:

* i remember talking about composable monitoring 2 years ago (http://www.infoq.com/news/2012/10/future-monitoring)
* remember just a few years ago all we have was just nagios & cacti?
* look how far we've gotten in just a few years