# Monitorama 2014 notes # http://monitorama.com/ Best talks day 1: * **Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools!** - Adrian Cockcroft * gave 5 good rules for monitoring systems, showed what cloud / microservices monitoring looks like @ Netflix * **Simple math to get some signal out of your noisy sea of data** - Toufic Boubez * explains why static alert thresholds don't work and gave 3 techniques to use instead * **Car Alarms and Smoke Alarms** - Dan Slimmon * how to use sensitivity and specificity in monitoring, some good math * **Metrics 2.0** - Dieter Plaetinck * metrics20.org = redesign of graphite that fixes a bunch of stuff, keep an eye on this project * **StatsG at New York Times** - Eric Buth * the first half of the talk on ops philosophy was really interesting, second half about statsg is not so useful Best talks day 2: * **"Auditing all the things": The future of smarter monitoring and detection** - Jen Andre * really awesome security talk, lots of good practical steps for us * **Is There An Echo In Here?: Applying Audio DSP algorithms to monitoring** - Noah Kantrowitz * shows how to use audio processing techniques on monitoring data, good math, very interesting * **The Lifecycle of an Outage** - Scott Sanders * github's tools & procedures & culture around resolving outages * **A whirlwind tour of Etsy's monitoring stack** - Daniel Schauenberg * practical walkthrough of Etsy's (extensive) monitoring system * **Web performance observability** - Mike McLane & Joseph Crim * not sure we can directly use the tool they made, but this is a good idea of what a web performance benchmark suite looks like, also see canary.io lightning talk Good lightning talks: * **serverspec + sensu**: interesting approach to testing & monitoring, if you write serverspecs for testing / CI, you can also run then on your productions servers and get even better coverage * **monitoring & inadvertent spam traps**: anecdote from a developer on how developers can use monitoring to solve problems * **Expanding Context to Faciliate Correlation**: showed 3 open source tools that improve on graphite/nagios web interfaces * **canary.io**: project from github ops for doing web performance testing, still in the early stages, but looks promising * **Distributed Operational Responsibility**: some tips from spotify on why ops responsiblities (like monitoring) should be shared with developers Semi-interesting sponsor plugs: * **VividCortex**: MySQL performance analysis tool (SaaS) from ex-percona guys * **Pagerduty**: we should start using multi-user alerting (new feature, they gave 2 good use-cases) * **Elastic Search**: ~70% of the people attending were using ElasticSearch * **Big Panda**: building a smarter "inbox" for ops (to replace email + jira) Recurring themes / big takeaways: * monitoring must scale ahead of the underlying system * you need high frequency monitoring: it's not OK to wait minutes for a check result or alert * collect data on everything with graphite * data collection should be a default on everything from the beginning, it should not be a time-consuming / reactive / after-the-fact process * only alert when work isn't getting done, RAM / swap / CPU / etc. are not something you should directly alert on * manually watching graphs & dashboards doesn't scale * start using anomaly detection * static thresholds do not work for data from the data center, moving averages are only slightly better, you need to use better math * do more analysis, understand your data (scatterplots, histograms, find distributions, correlations, probability & stats, etc.) * ops should provide self-service data collection / monitoring / alerting for developers