## welcome ## Jason Dixon: * this monitorama is 2x the size of last year & berlin * conference buddies, if you see someone with a heart sticker introduce yourself to them * everyone give a high five or free hug * why do this? this isn't a ruby conference * empathy and culture is important, especially for ops * between engineers, ops, and management * and for the community here * share the love * sponsors are great bla bla * breaks and lunch bla bla ## Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools! - Adrian Cockcroft ## http://www.slideshare.net/adriancockcroft/monitorama-please-no-more * keynote * formerly of netflix * graph of enterprise IT cloud adoption * from left to right: ignore, ignore, ignore, no, no, I said No dammit, oh no, oh fuck * rest of world = half way through cloud adoption * you are here = trying to play catch up 20 years exp: * 94 "SE Toolkit" * 98 Sun Perf. Tuning * 99 Resource Mgmt. * 00 Capacity Planning for Web Services * 07 Outstanding Contrib. to Computer Metrics * 04-08 Capacity Planning Workshops * 14 Monitorama! state of the art in 2008: * cacti, ganglia, nagios, zenoss, mrtg, Wireshark * low number of machines * it was subversive to think that open source could replace expensive enterprise tools * created "SE", a C interpeter which could extract solaris performance information and output it all in a standard format * created "virtual adrian", a simple rule based system for automated monitoring of disk, memory, etc. in solaris (to watch systems while he was on vacation) why no more monitoring tools? * we have too many * we need more analysis tools, can we get an analysorama conference? * rule #1: we spend too much time collecting, storing, and displaying metrics * if you spend 50% of your time on this it's too much * we need more automation, more analysis * monitoring should not be tacked on, it should be a default what's wrong with minutes? * not enough resolution to catch problems * it takes 5-8 minutes before you start seeing alerts * if you had second resolution, you can see the difference in 5 seconds * if your rollbacks are quick, you can revert a bad change in 5 seconds * compare a 10 second outage to a 10 minute outage * from continuous delivery we know that small incremental changes are best * so we need the same from monitoring * instant detection and rollback within seconds should be a goal * SaaS tools that do this: VividCortex, boundary * how does netflix do it? hystrix and turbine, websockets, streaming metrics, 1 second resolution & 15 seconds of history, circuit breakers, pages go to who is directly responsible for a specific component or change * rule #2: metric collection -> display latency should be < human attention span (10s) what's wrong with milliseconds? * in a lot of JVM instrumentation, ms is the standard * the problem with ms is that a lot of datacenter and hardware communication needs nanosecond resolution * rule #3: validate your measurement system has enough accuracy and precision * if there's a difference between something taking X and Y nanoseconds in your system, and all you have are a bunch of 1ms data points, you can't identify the problem what's wrong with monoliths? * monolithic monitoring tools are easy to deploy, but when they go down, you then have no monitoring * there needs to be a pool of aggregators, displayers, etc. * easier to do upgrades, more resilient to downtime * anything monolithic has performance problems, scalability problems, SPOFs, can't tell the difference between monitoring system going down vs. actual system going down * in-band monitoring: running monitoring on the same process, server, data center, etc. as the system itself * SaaS monitoring: send to a third party * both: an outage can't take out both monitoring systems, HA monitoring * they might not being monitoring exactly the same stuff, but they should have some overlap * rule #4: monitoring needs to be as (or more) available & scalable than the underlying system continuous delivery: * high rate of change * new machines being spun up and shut down all the time (in netflix's case) * short baselines for alert threshold analysis * ephemeral configuration * short lifetimes make it hard to aggregate historical data * hand tweaked solutions do not work, it would take too much effort microservices: * complex flow of requests * how do you monitor end-to-end when the dependencies and flow of requests is so complex and dynamic? * Gilt Groupe: went from a handful of services to 450 services over the course of a year * "death star" microservice pattern: everything is calling everything else in one big tangled graph of dependencies * how to you visualize this? we need more hierarchy & grouping closed loop control systems: * how did netflix do autoscaling? * on every deploy during peak time, double the number of servers * using load average, which is not the best metric to use * lots of overshoots * new solution: scryer * predictive autoscaler, FFT based algorithm, builds a forward predicted model to set the autoscale level * scales ahead of time, then corrects as necessary * using the old method it was hard to do this analysis, because the data was so chunky (from the doubling) code canaries: * ramp up of deployment, looks for errors, if there are problems it emails the responsible team and stops rolling out the code monitoring tools for developers: * most monitoring tools are built for ops / sysadmin (DBA vs. network admin vs. sysadmin vs. storage admin) * fiefdoms of different teams and tools, different levels of access, hard to collaborate, hard to integrate and extend * state of the art is to move towards APM, analytics, integrated tools for all teams * deep linking & embedding, extensible tools * business transactions, response time, runtime (e.g. JVM) metrics challenges with dynamic ephemeral cloud apps: * dedicated hardware: arrives infrequently, disappears infrequently, sticks around for years, unique IPs and MAC addresses * cloud assets: arrive in bursts, stick around for a few hours, recycles the IP and MACs of machines that were just shut down! * in the cloud model, you need to have a historical record of everything that ever happened in your infrastructure (Netflix Edda) traditional arch: * business logic * DB master & slave * some fabric in between * storage new cloud systems: * business logic * NoSQL nodes * cloud object store * not all hosted cloud services have detailed monitoring / metrics exposed * you depend on web services to integrate with cloud services * span zones & regions, monitoring now needs to span zones & regions too * NoSQL introduces new failure modes 5 rules: * 1. analysis > collection * 2. key business metric monitoring should be second resolution * 3. precision and accuracy -> more confidence * 4. monitoring must be more scalable than the underlying system * 5. start building distributed, ephemeral cloud native applications Q&A: * Q: you mentioned better visualization for microservices, like what? * A: a user hits the homepage -> what services are hit?, there is no arch. diagram anymore, part of viz. involves seeing which zones and regions are hit, manual tagging & hierarchy of components, owners, etc. it's useful to for instance limit to just the services my team owns or depends on, aspect-oriented view, but it's not a solved problem, most OSS monitoring tools have good backends but less good UIs, cloudweaver looks interesting * Q: canary system, what types of checks are you running? * A: error rate, CPU time, response time, jmeter functional tests, business metrics, and you need to do the comparison on freshly spun up nodes (e.g. 3 old vs. 3 new copies of the code on freshly spun up machines) ## Computers are a Sadness, I am the Cure - James Mickens ## * (this talk was just entertainment, no practical information) * i'm here to take you on a quest * everything i'm going to tell you is 100% true * bla bla * distributed systems send messages back and forth * most messages fail because god hates us * so we send more * 10 years ago the MapReduce paper was like alien technology * it was so simple and seductive, you just specified a map and reducer function, ran it on commodity machines, it was amazing * that was 10 years ago * let's stop talking about MapReduce * say "word count" one more time * let's also stop talking about "the cloud" * the problem with all this social cloud stuff is that i hate most people * there are two kinds of people: people who have actually build cloud software and others * others: cloud is great!, 99.9999999%!, everyone is happy, everything is a solved problem! * real cloud people: it's a nightmare, hardware fails, SLAs are misleading, IO is queued up, packets get sent to a black hole, it's madness * why does anything happen at all in the cloud? * it's like an old timey map with dragons in the middle * this is why we need monitoring & analysis * a message of hope: give up * look at the CAP theorem, you can't have it all * if your email goes down, then your reaction should be to want to use email less, go do something else * can't take your test at your MOOC? take it later, your MOOC degree will be just as worthless * let's be serious though * some things we do need to care about * (nosql rant i didn't fully write down, nosql = bane from batman, throw out all the rules and laws, chaos) * conventional wisdom: america needs more programmers * reality: we need fewer programmers * technology is not the future, no more stupid apps, painting is the future, go do that, leave me alone * if you are a VC who funds this kind of stuff, i hope you become poor * let's be serious about security * threat model: mossad or not-mossad * either you are being attacked by mossad or you're not * "not attacked by mossad" = where you want to be, just keep using strong passwords and don't click on weird links * "you are being attacked by mossad" = no defenses, you're going to die * america's mental model of the CIA, FBI, etc. are that they are bunch of boy scouts * in reality: drones, exoskeletons, cable splicing submarines * they're not going to send boy scouts, they're not going to fight close range musket battles, they're going to use their advantage of having access to all the infrastructure you depend on * how do you defend against that with rocks and pencils and leaves? * easy attacks are easy * "Mary" from "Central University" working as a "Rectuier" with an attactive profile picture wants to be my friend on Facebook * obviously i don't know mary * BUT WHAT IF I DO KNOW MARY * most important goal in security: eliminate men as a gender * possible solution: dude overflow detected -> trigger bear trap and the guy from the SAW movie summary: * ozzy osbourne crazy train = cloud computing * bane = nosql * bla bla Q&A: * Q: can i be your friend on facebook? * A: there is a background check, and i will wait 2-3 days to show i'm not desparate, i encourage you to submit an application though, i love judging people ## Simple math to get some signal out of your noisy sea of data - Toufic Boubez ## * i lied! there are no simple tricks * too good to be true = it probably is * background: * CTO Metafor Software * CTO Layer 7 Technologies * CTO Saffron Technologies * let's start with the "Wall of Charts" * hire a new guy: shove him in front of the wall of charts * we collect 1000s of metrics, pick 10, and put them in a dashboard * this is meaningless * WoC leads to alert fatigue * alert fatigue is one of the largest problem in ops * watching WoCs cannot scale * at some point, you will need a person or a team dedicated to watching the WoCs * so we need to turn this work over to the machines * to the rescue: anomaly detection * definition: detect events or patterns which do not match expectation * definition for devops: alert when one of our graphs starts looking wonky * who else is doing anomaly detection? * manufacturing QC has been doing this for a long time * measure the diameter, weight, etc. of the flux capacitors and throw the outliers away * assumptions: normal, gaussian distrbution; data is "stationary", it doesn't change much over time * the "three-sigma rule": 68% of the values lie within 1 std dev of mean, 95% lie within 2, 99.7% lie within 3 * mark those percentages as the "red lines" on the graphs and take action when a value falls outside of a red line * if you implement 3-sigma rule alerts in the data center: * a. you get alerted all the time, or * b. you don't get alerted when there's a real problem * the assumptions from manufacturing (gaussian, stationary) don't apply to the data center * static thresholds are ineffective * if data is moving, we need a moving threshold, that's a smart idea * the "big idea" of moving averages: the next value should be consistent with the recent trend * finite window of past values, ignore the whole history * calculate a predicted value * "smoothed" version of time series * compare squared error rates between smooth vs. raw data * now you can compute the 3-sigma values based on that smoothed data * what about spikes, outliers, etc.? windows can be skewed * ok, now we use a weighted moving average, less weight on data that is further away * not good enough, doesn't handle trends, exponential smoothing * double exponential smoothing (DES) * triple exponential smoothing (TES) * Holt-Winters (seasonal effects) * result: * a. you are woken up a lot less, but still woken up * b. it still doesn't catch some problems * are we doomed? * no * smoothing works on certain kinds of data * smoothing works when deviations are normally distributed * there are lots of non-gaussian techniques, we're only going to scratch the surface in this talk * trick #1: histograms * (better: kernel densities, but histograms work and are simple) * if you have a bunch of different time series of the same metric, build a histogram for each series * start by looking at the distribution of your data, understand what it looks like before you start your analysis * trick #2: kolmogorov-smirnov test * it sounds cool and it works * compares two probability distributions * requires no assumptions about the underlying distribution * measures max dist. between two cumulative dists. * good for comparing day-to-day, week-to-week, seasonal affects * "are these two series similar or not?" * KS with windowing * example: KS for week 1 vs. week 2 and week 2 vs. week 3 (where week 3 is during christmas and we experienced a problem) * 1 vs. 2: small distance * 2 vs. 3: huge distance * the case where 3-sigma static threshold failed is now extremely clear with KS * trick #3: diffing / derivatives * often when your data is not stationary, the derivative is * e.g. random walks * most frequently, the first difference is sufficient: dS(t) <- S(t+1) - S(t) * once you have the stationary data set, gaussian techniques work better * real example: CPU time * the distribution is totally non-gaussian, very noisy and random looking * but.. first difference, it totally is gaussian! * you're not doomed if you know your data * understand the statistical properties of your data * data center data is typically non gaussian * so don't use smoothing * use histograms, KD, and derivatives instead Q&A: * Q: is your point to make everything gaussian? * A: no! sorry if i conveyed this message, KS does not involve gaussian, there are lots good non-gaussian techniques ## The Care and Feeding of Monitoring - Katherine Daniels ## * a story * pagerduty tells us our site is down * so we checked, and it was down * then... a minute later, it's back * hmm. ok. * then.. a few minutes later * down again * and up again * this is.. The Blip, a randomly occurring outage that fixes itself * so what's happening? * 500 rate.. nothing * API errors.. nothing * error rate... nothing * what are we missing from our monitoring? * monitor all the things! * we're missing something, just start randomly adding metrics until we find it * then you get.. this.. * zenoss screenshot that's all red from down checks * we're trying to find a needle in a haystack and just added more hay * this is why you don't do a full body diagnostic scan for medical patients, the more you look for, the more you might find, and they might not all be actual issues * so, we need to monitor only some of the things.. * first looked at the load balancers, because everything dropped out of the LB at once * tried provisioning a new ELB, switching availability zones * looked at access logs * everything worked the same, still getting the blip * how about the healthcheck? * the healthcheck was hitting something called "healthD", a healthcheck service that failed when one or both of two important backend components went down * and there weren't any logs or monitoring for healthD itself * looking inside healthD showed that one of the two services, api2, had a problem * it seems a certain misbehaving user was triggering bad requests * so we went into api2 and added metrics per response type * found the response type that stood out * decreased timeouts from 60 seconds to 5 seconds * optimized some slow queries * deleted some old slow / unused API methods * now the site was back to normal why didn't we have monitoring for this? * 1. black boxes, mysteries * any X-as-a-Service that you depend on (e.g. ELBs) are black boxes and need some special care for monitoring * 2. technical debt / bad technical decision * why did the healthcheck require both services to be up? * why did we even have two separate APIs? * long ago someone decided to do a rewrite, but the old system remained * we can only move foward at this point, we can't shut down either system, so we need to monitor both * what to monitor: * monitor all services * monitor responsiveness (network, API, web server) * system metrics (memory used, CPU used, disk space) * application metrics (read lock time, write lock time, error rate, API response time) * don't get into a situation where you have to say "oh yeah that check is red but it's OK, don't worry" * as someone mentioned earlier, your monitoring needs to scale above your application * load test your monitoring, make sure it can keep up and responds properly with increased load * monitoring should not be a silo, it shouldn't be an ops problem * monitoring should be built in to the application from the beginning * work with developers * ask: "what does it mean for this application to work properly? what does it look like when it breaks?" * monitoring shouldn't be a reactive last minute thing ## Car Alarms and Smoke Alarms - Dan Slimmon ## * Sr. Plat Engineer at Exosite, which does internet of things * we recently made a better mousetrap that texts you when it goes off, so if you have a building full of mouse traps you only need to check the one that was tripped * we wear many hats in ops * but data science is becoming a very important hat * people believe you when you have graphs * signal to noise ratio * example: plagiarism detection * let's say we make a system that has a 90% chance of positive plagiarism detection * 20% chance of negative result * and 30% of kids currently plagiarize some questions: * 1. given a random paper, what's the prob you get a negative result? * 59% * 2. what's the probability that the system will catch a plagiarized answer? * 90%, duh, we already knew that, why'd i ask you that? * 3. if you get a positive result, what's the probability the result really is plagiarized? * 65.8% * this is an unintuitively terrible result * we originally heard 90% chance * but now in the real world it's down to 65.8%, that's pretty useless * sensitivity and specificity * sensitivity: % of actual positives that are identified as such * specificity: % of actual negatives that are identified as such * high sensitivity: freaks the the fuck out when anything might be considered slightly bad * high specificity: if it says you cheated, sorry, you definitely cheated * here's the graph if you want to look at it again: http://imgur.com/LkxcxLt.png * how does this relate to ops? * positive predictive value is the probabiilty that: when you get paged, something is actually wrong * consider your service has 99.9% uptime, and your check is 99% accurate * that sounds pretty good right? * P(TP) = 0.01% * P(FP) = 0.99% * PPV = P(TP) / (P(TP) + P(FP)) = 9.1% * if you get paged, you only have a 1 in 10 chance that something is actually wrong * that's horrible * car alarms * when you hear a car alarm, is your immediate reaction to run and check to make sure everything is ok? * the majority of car alarms sounding don't indicate a problem, they go off all the time for no reason * they have low specificity, high sensitivity * smoke alarms * when you hear a smoke alarm in a building, you don't have the same reaction * you don't sit around and say "do you guys smell smoke? i think i'm just gonna wait here" * you get out of the building and wait for the fire department to give the OK * why do we have such noisy checks? * undetected outages are embarrassing, so we focus on sensitivity * that's a normal, good reaction to have * but understand that the relation between the alert threshhold and PPV * looser threshold = less alerting, higher PPV, more uninterrupted sleep (but a chance you'll miss a real problem) * strict threshold = more alerting, lower PPV, more false positives * sensitivity / specificity don't need to be competing concerns * instead of a line, you need a surface * hysteresis is a great way to get these additional degrees of freedom * state machines * time series analysis (like mentioned earlier, smoothing, histograms, derivatives, etc.) * as your data changes (e.g. your service becomes more or less reliable) or your checks become more reliable * your sensitivity & specificity will change too, sometimes wildly, so you can't just set it once and forget about it * a lot of nagios configs conflate the detection vs. indentification of a problem * for example, say you have these 4 checks for your website: * 1. apache process count * 2. swap usage * 3. site responding to HTTP * 4. requests per second * "your alerting should only tell you whether work is getting done" * if your site is still up, but apache isn't running, that's great news! (haha) * so cross off #1 and #2 * and #3 and #4 can be combined into one check, if your RPS is good, then it must be responding * here's a tool that i want: something like nagios that monitors services instead of hosts * when a service is down, only then do you kick off a bunch of host level diagnostics * if the tool was aware of these SNR concepts (specificity, etc.), and had some built in knobs to tune, that would be even better * other useful stuff: * bischeck * see links in slides Q&A: * Q: is it foolish to tweak these knobs manually? shouldn't this be automated? * A: i haven't found anything to automate this yet, manually tweaking is the only way i've found so far ## Metrics 2.0 - Dieter Plaetinck ## * works at vimeo * video transcoding & storage * lots of metrics, lots of graphite * when a user uploads, it first runs a few checks to determine which data center to route your upload to * graphite is used to make a feedback loop to make sure that kind of automated system is working properly * but this talk is going to be about problems, mostly with graphite * a timeseries looks like this: (unixtime, value) * timeseries are labelled like "mysql.database1.queries_per_second" * it is difficult to navigate the hierarchies * it is difficult to find how and why a metric is being generated * timeseries don't have units, they don't describe their behavior (e.g. semantics like which time period they cover) * unclear, inconsistent formats * metrics are tightly coupled to the source and lack context * one metric name can have multiple meanings * complexity = lots of sources * lots of people * multiple aggregators * it's a time sink * everything has to be done explicitly, even when this data could be determined implicitly (units, legend, axes, titles, etc.) * in graphite, different subtrees may contain the same types of data, so this makes it hard to compare across the hierarchy * as you gather more metrics, these problems get worse * metrics 2.0 tries to solve these problems * metrics have a self describing format compare graphite: stats.timers.dfs5.proxy_server.object.GET.200.timing.upper_90 to metrics2.0: { server: dfvimeodfsproxy5, http_method: GET, http_code: 200, unit: ms, metric_type: gauge, stat: upper_90, swift_type: object } * metrics20 allows you to use more characters to label your metrics (e.g. "/" for "Req/s") * metrics20 allows you to add extra metadata to your metrics * for example, src/from parameters, so you can track where a metric is being submitted from * conceptual model -> wire protocol (compatible with graphite/statsd/carbon) -> storage * metrics20.org * units are extremely useful: * MB/s, Err/d, Req/h, ... * B Err Warn Conn Job File Req ... * we allow you to use SI + IEEE standard units * easier to learn, more flexible Carbon-tagger: * middleware between old graphite instance and new metrics20 instance * adapts old format to new format (adding metadata, units, etc.) Statsdaemon: * similar to etsy statsd, drop-in compatible * if you send a bunch of bytes B over time, it automatically figures out this is B/s * if you send a bunch of milliseconds ms over time, it automatically calculates percentiles/min/max/mean/etc. Graph-Explorer: * dashboard system with a new query syntax New query syntax: * proxy-server swift server:regex unit=ms * automatically does group-by based on metadata * automatic legends, axes, tagging (these are all manual in graphite) stat=upper_90 from datatime to datetime avg over (5M, 1h, 1d, ...) Some examples: Which is slower, PUT or GET? stack ... http_method:(PUT|GET) swift_type=object Show http performance per server: http_method:(PUT|GET) group by unit, server grab all job stats (note how no timeseries names are explicitly given, this finds all timeseries that have a unit of "Jobs/second"): transcode unit=Job/s avg over