## Sponsor Plug: New Relic - Chase ## New Relic browser / front end: * how fast your pages load * how fast are your ajax calls? * JS error tracking interesting stuff we found: * error messages get translated, "Syntax error" vs. "Erreur de syntaxe", they get reported differently * his site had no ajax, but there were a ton of AJAX errors ** what is this stuff? ** the majority are toolbars, malware, etc. ** browser extensions, google translate, etc. ** some are pretty nasty, "Skype click-to-call" got into an infinite loop and triggered tens of thousands of errors ## Sponsor plug: Elastic Search - Rashid ## * who uses ES? show of hands * 70% use it vs. 30% don't (hmm... interesting..) * i'm going to give a workshop on wednesday, so i'll demo a lot more then * but if anyone has any questions, feel free to ask me now * Q: why do we need log searching? why elasticsearch? ** A: a graph shows you when something might be wrong, but logs allow you to go back to the original event and see what exactly happened * Q: what did you have for breakfast? ** A: yogurt, granola, melon * Q: do you want to buy a musket? ** A: yes, to defend myself from the government * Q: did you know you can 3d print a musket? ** A: yes, i'm terrified of this * Q: does ZK cluster discovery work? ** A: not used it, zen (?) discovery works * Q: can you talk about jepsen and ES? ** A: there's a recent blog post about it, it's a tough subject, distributed is hard, we don't have an answer for everything but we're doing pretty good * Q: roadmap? ** A: for what? * Q: kibana? ** A: will talk more on wed, better aggregations / facets, which are useful for turning logs into charts, "top N query" reduced from N queries to 1 * Q: when is ES going to learn how to reindex something something without something? ** A: push harder if you want this feature ## Sponsor plug: Librato - Joe ## * CTO of librato * librato is a platform for storing, monitoring, and alerting on custom metrics * composable monitoring system tailored to you * in the past that meant building your own solution from scratch with a bunch of OSS * librato lets you correlate arbitrary time series with each other * marking events like deploys & config changes * no proprietary agent, everything works over HTTP * 80-100 products (middleware, web servers, databases, etc.) know how to speak to librato via opensource plugins * if you can write to stdout, you can capture that log output and send to librato as metrics * new features: ** more integrations ** better alerts - tune the sensitivity of alerts using historical data ** better on-call information - associate URLs / documentation with alerts, find all previous occurrences of an alert ** "composite metrics" - custom query language to manipulate raw data, calculate ratios, aggregates (looks like graphite's URL/function interface) ## Sponsor plug: Pagerduty ## * pagerduty sits between your monitoring systems and your on-call people * we integrate with everyone * we send SMS/email to the right person * we take reliability seriously, full end-to-end tests ** we have 4 android phones in our lab constantly receiving texts to ensure deliverability! new stuff: * multi-user alerting * on-call handoff notifications * SSO * outbound webhooks multi-user alerting: * we found this is a great way to do onboarding for new ops people * put the new guy on-call alongside a veteran so they can get trained up in being on-call * multi-user alerting is also good for higher levels of escalation * for example if two people sleep through the alert, then set up your third escalation level to alert everyone instead of continuing to retry people one-by-one handoff notifications: * notify by email, sms, and push when you go on or off call outbound webhooks: * now has integration with slack, hipchat, flowdock, etc. * live demo of webhook FAILED, kinda awkward... lolz * oh wait he just yelled from the crowd that it worked (sure it did) ## Sponsor plug: Dataloop.io - David ## * lots of teams spend a lot of time building monitoring solutions using OSS * but as soon as you try to get developers or QA to use it, you run into problems * high learning curve, confusing documentation, difficult interfaces * we want to un-silo the monitoring tools * as we move to microservices, traditional monitoring gets more difficult * we are building the monitoring tool for microservices * easy to use * flexibility of nagios / graphite, but with drag & drop * easy to create alerts * use existing nagios check scripts * speaks graphite/statsd/carbon protocol * create hierarchies with drag & drop * use tags * write plugins in any language * another thing we do besides config is visualization ** nagios, collectd, and statsd all in one place ** create dashboards via drag & drop, resize ** send dashboard reports via email (good for weekly / monthly reports to management teams) ** embeddable widgets * next, alerting: ** big feature is multiple triggers for alerts ** build context for your alerts ** condition A and condition B and condition C ** e.g. both web performance & service up/down check must trigger before alert goes off ** this decreases alert spam * actions: ** email / SMS / phone ** send to jira ** trigger event handlers (any language) * driven by API, command line tool, or github * launching later this year, beta testing now ## Sponsor plug: Salesforce ## no-show ## Sponsor plug: Puppet ## * who doesn't know what puppet is? * we have commercial & open source offerings * who's coming to the puppet party tonight? * it's really hard to get there, left then right * we're hiring, a lot * (scrolls through dozens of job listings) * can everyone from puppet labs stand up? * (like 20 people stood up) * come to puppetconf in SF, september 20-24 * all kinds of presenters, lots of topics * early bird pricing ends this month ## Sponsor plug: pingdom ## interesting numbers from our customers: * 14 billion checks per month * 9.4 million detected outages per month * 8 million alerts sent per month * total downtime to 500 million minutes, across 450k customers * what can we do at pingdom to help with this? * #1 most requested feature: alert management new feature: BeepManager * pingdom.com/beepmanager * team members can customize their method of contact * automated escalations * integrate with other systems (nagios, new relic, rackspace cloud monitoring) * alert flood protection * access levels * alert templates * most important feature of monitoring system is that it works for your team * we are committed to making our tool work for your team ## Sponsor plug: Grok - Jared ## * numenta.com/grok * we do anomaly detection * we've heard all about it these two days * how do we solve it? science * years of research, we've made some breakthroughs * automatic & unsupervised machine learning on timeseries data * open source at numenta.org first product: grok * mobile app * automated model creation & monitoring for AWS instances * showed some examples * automatic anomaly detection in CPU load * they used this to catch someone running manual builds on a build server * required no setup / training * free trial: simple to get running, 10 servers, no time limit ## Sponsor plug: Big Panda ## * we launched our private beta yesterday * we spend a lot of time tweaking tools, building thousands of alerts * what do you use to manage your response to issues? * jira, zendesk, email * those tools are meant for humans * they were not built for responding to tons of automatically created incidents, flapping alerts, etc. * bigpanda is basically jira for ops * live demo * home page "OpsBox" shows all alerts * UI should be very familiar to gmail users * star alerts, mute alerts * how do i rise above the noise of alerts? * shows a timeline of alerts, when did it start warning, when did it reach critical, when did it go back to normal * (pretty cool looking) * shows a lot more data in context * "Changes" view: event log of changes in your infrastructure * we're already helping people today respond to alerts in a much more intelligent manner ## Sponsor plug: Datadog - Alexei ## * cofounder and CTO of Datadog * hosted monitoring service * easily monitor from 5 to 50,000 hosts * what have we been working on the past year? * better graphs * better visualizations, histograms * better counts & counters * heatmaps * better alerts, more sophisticated alerting * the ability to embed disturbing images into your dashboards (nicholas cage meme pics) * more integrations: fastly, google cloud, slack, new relic, 50-60 integrations total * monitoring is fun! * who here has learned a lot these past two days? (everyone) * who here wants to work on monitoring more? (still everyone) * that's good news because we're hiring ha ha laffs