# Monitorama 2014 notes #

http://monitorama.com/

Best talks day 1:

* **Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools!** - Adrian Cockcroft
  * gave 5 good rules for monitoring systems, showed what cloud / microservices monitoring looks like @ Netflix 
* **Simple math to get some signal out of your noisy sea of data** - Toufic Boubez
  * explains why static alert thresholds don't work and gave 3 techniques to use instead
* **Car Alarms and Smoke Alarms** - Dan Slimmon
  * how to use sensitivity and specificity in monitoring, some good math
* **Metrics 2.0** - Dieter Plaetinck
  * metrics20.org = redesign of graphite that fixes a bunch of stuff, keep an eye on this project
* **StatsG at New York Times** - Eric Buth
  * the first half of the talk on ops philosophy was really interesting, second half about statsg is not so useful

Best talks day 2:

* **"Auditing all the things": The future of smarter monitoring and detection** - Jen Andre
  * really awesome security talk, lots of good practical steps for us
* **Is There An Echo In Here?: Applying Audio DSP algorithms to monitoring** - Noah Kantrowitz
  * shows how to use audio processing techniques on monitoring data, good math, very interesting
* **The Lifecycle of an Outage** - Scott Sanders
  * github's tools & procedures & culture around resolving outages
* **A whirlwind tour of Etsy's monitoring stack** - Daniel Schauenberg
  * practical walkthrough of Etsy's (extensive) monitoring system
* **Web performance observability** - Mike McLane & Joseph Crim
  * not sure we can directly use the tool they made, but this is a good idea of what a web performance benchmark suite looks like, also see canary.io lightning talk

Good lightning talks:

* **serverspec + sensu**: interesting approach to testing & monitoring, if you write serverspecs for testing / CI, you can also run then on your productions servers and get even better coverage
* **monitoring & inadvertent spam traps**: anecdote from a developer on how developers can use monitoring to solve problems
* **Expanding Context to Faciliate Correlation**: showed 3 open source tools that improve on graphite/nagios web interfaces
* **canary.io**: project from github ops for doing web performance testing, still in the early stages, but looks promising
* **Distributed Operational Responsibility**: some tips from spotify on why ops responsiblities (like monitoring) should be shared with developers

Semi-interesting sponsor plugs:

* **VividCortex**: MySQL performance analysis tool (SaaS) from ex-percona guys
* **Pagerduty**: we should start using multi-user alerting (new feature, they gave 2 good use-cases)
* **Elastic Search**: ~70% of the people attending were using ElasticSearch
* **Big Panda**: building a smarter "inbox" for ops (to replace email + jira)

Recurring themes / big takeaways:

* monitoring must scale ahead of the underlying system
* you need high frequency monitoring: it's not OK to wait minutes for a check result or alert
* collect data on everything with graphite
* data collection should be a default on everything from the beginning, it should not be a time-consuming / reactive / after-the-fact process
* only alert when work isn't getting done, RAM / swap / CPU / etc. are not something you should directly alert on
* manually watching graphs & dashboards doesn't scale
* start using anomaly detection
* static thresholds do not work for data from the data center, moving averages are only slightly better, you need to use better math
* do more analysis, understand your data (scatterplots, histograms, find distributions, correlations, probability & stats, etc.)
* ops should provide self-service data collection / monitoring / alerting for developers