@M_richo, when testing and monitoring collide: * serverspec + sensu * serverspec = rspec testing framework for server configurations, platform agnostic, 26 resource types * very fast, example: 266 tests in 2.78 seconds * when do you want to write serverspecs? when you're writing infrastructure as code to validate your code * you can also run your serverspecs on your live servers, why? because it's quick and a cheap way to verify everything is working * great addition to your monitoring system * let's put this data into sensu * first attempt: wow we have a lot of failures, and i have no idea what's broken * 1. use rspec's json output format * 2. sensu has a feature to send check results over a socket * these two features allow you to split the checks up, instead of one huge summary check for all server you now have a bunch fo separate checks, easy to see failures * summary: ** write tests for your systems / infrastructure code ** don't duplicate your effort, run your serverspecs on production --- @laprice, monitoring postgres performance: * hardware determines: memory, random_page_cost, tablespaces * workload determines: query_planner, autovacuum, stats_collector * what is autovacuum? * cleans out dead tuples * reorders pages on disk * thresholds can be set per table * one of the primary culprits for "my database is slow and i don't know why" * highly tunable: workers, nap time, duration, timeout, max age, cost delay, cost limit, etc. * focus on the tables that need it most (the largest tables) * track dead tuple count & percentage (>5%) * main question to answer: are my tables being vacuumed when they should be? * you can get this info by querying pg_stat_all_tables, see the docs --- @petecheslock, 17th century shipbuilding and your failed software project: * aka - why your project managment sucks * the Vassa * grandest ship built by the royal swedish navy * the most expensive project ever undertaken by the country at the time * after sailing less than one mile a gust of wind hit the ship, it tipped over, and it sunk to the bottom of the sea * 50 years later they recovered the ship and analyzed what went wrong * the captain who survived was thrown into jail, he was asked if the crew was drunk, they were not, he was later released * it tipped because it didn't have enough ballast * why? it started off as a 108 foot ship * then was changed to 111 feet (originally wanted 120 feet) * then they wanted to add another gundeck * sure, ok, then they needed to scale it up to 135 feet * (nobody in sweden had even built a ship with two gundecks yet) * they kept revising the number of guns, size of guns * rush job * the king also needed to have a bunch of ornate carvings added, making it more top heavy * most of the design came from the king's head * they did a lurch test (30 men running back and forth on the deck, believe it or not), and they had to stop because the ship was about to tip over * the design changed so many times, they needed to add ballast, but there was no place to add it * if they did add ballast, the lower gun deck would have been underwater * so you may be thinking.. * why did they launch if all the tests failed??! * if they didn't launch on time, the people inolved would have been subjected to "the King's disgrace" (execution?) * to recap: ** schedule pressure ** changing needs ** no specs ** lack of project plan ** excessive innovation ** secondary innovations ** requirement creep ** lack of scientific methods ** ignoring the obvious: launched after failed tests * the lesson: those who ignore history are doomed to repeat it! --- @hypertextranch, monitoring & inadvertent spam traps: * i work at wordpress.com as a developer * i've never actually seen nagios * but i've infiltrated your ranks * we see a lot of spam * any developer can make their own stats * memorization < (intuition + investigation) * how i found a random spammer * i deployed elasticsearch and checked our monitoring to see if it made things better or worse * i saw queries stacking up * only 3 nodes pegged CPU, all other nodes were fine * if this were a problem in my code, it would have caused a problem on all nodes * every blog has a main instance and is i replicated to two extra machines * so it seems like this is a problem with a single blog * some user scripted thier blog to pull in articles from the washington post, splice in some affiliate links, and repeat every 30 seconds * every time a site gets marked as spam by our filter, it causes the articles to be reindexed * lesson: your devs should look at monitoring because they probably have more intuition about problems * automated monitoring might not have caught these three bad nodes * an ops dude would have noticed that three nodes * but i as a dev was able to intuitively pick up on the problem right away --- Chess - a reflection of life: * "Chess is everything: art, science, and sport" * tournament players lose 10-15 pounds after a tournament, physical and mental stress for 8 hours a day burns calories * you are the winner even if you lose, you can learn from every match * the game is egalitarian, the only thing that matters is the moves * it doesn't matter what your age or gender or race is * ego is the enemy of learning & growth * ego is an anchor * accept that there is more for you to learn, and you will * chess exemplifies the power of cause and effect * your moves at the start are directly related to the moves at the end * time & timing are everything * a good position fades quickly * the game is all about patterns * our brain is built to detect patterns * control the center applies to chess and to life and business * ran out of time --- @isaacfinnegan, Expanding Context to Faciliate Correlation: * basically i want to show off some cool stuff * "we've got great tools" * really? * i have to use 5 different tools to get stuff done, they all have different, crappy interfaces * github.com/evernote/graphite-web * templates for graphite * NagUI: federated nagios interface * very fast (especially compared to the classic interface) * bulk viewing, bulk actions * drag & drop custom views, saved views, share views with your team * graphite integration * acknowledge + send to jira * mobile interface too * CMDB: pull data from different tools into one view * nagui + jira + graphite * i think this is the next step for monitoring tools * instead of monolithic rewrites, integrate existing tools --- Feature Knobs & Deploy Knobs: * feature flags, feature toggles, config flags * they're awesome! * >100 deploys a day are awesome! * deploy dark and turn up slowly for everything * this leads to a problem though * over time, we have a million feature flags and it's not clear which ones can be safely turned off/on * you need a promotion process, cleanup process, which is tough * use feature knobs wisely... * what about deploy knobs? * with a deploy knob, once you turn it up, you can't go back * this makes them self-cleaning --- some dude running linux tried to present but couldn't get the display to work --- @michaelgorsuch, github ops, canary.io: * scratching an itch via small, composable tools * measure URL performance & availability * at high resolution (sub-second) * multiple vantage points * based on libcurl (ubiquitous and provides good stats) * sensorD, gets a blob of JSON with a list of URLs * it measures them with libcurl and spits out JSON *that's cool * now i have all these sensord instances running around the globe * what do i do with this json? * i need to aggregate * new tool: canaryD * siphon off the useful info, store it in redis for the past 5 minutes (starting small...) * exposes the stats via REST API * even with 5 minutes, that's 1200 measurements * compare that to nagios's check_http, that would be like 1 measurement per 5 minutes in nagios * so why not feed this high resolution data into a nagios check? * what if i want to share this data? * i want to make this open source, infrastructure independent * open measuring for an open web * it "launched" 3 days ago, by that i mean i tweeted a gist * it's running in DO, but rackspace offered a bunch of servers * someone already built a dashboard * github.com/canaryio * i'm learning go, don't be scared by the code --- Sergey Fedorov, netflix, Stateful monitoring: * couldn't present due to technical difficulties --- Martin Parm, spotify, Distributed Operational Responsibility: * first person to present using linux! * give ops responsibility back to developers * capacity planning * monitoring * config mgmt * instead of doing this for them, we give them the tools to do this * why do this? doesn't this seem like a bad idea? * we have so many changes and engineers we can't do it all with an ops team * so why not get the right people in front of a project the first time? * if you break something, you need to fix it, better accountability * we want the teams to work with technologies * how about monitoring? ** devs need training, but not a whole new education, just enough to solve their problems ** devs need autonomy, and will do stupid things (ops does stupid stuff too) * alerting: metrics & events -> magic monitoring pipeline & alerting rules -> pagerduty alerts ** our alerting stack: ffwd (homegrown stat forwarder), apache kafka, rimeann, even more stuff ** we don't need them to learn or touch the internals of that alerting stack * different abstraction levels * script hooks, drop a script in a folder * write your own python script with riemann library * write your own rules, provide tools for that * impact on monitoring? ** more monitoring, better monitoring ** monitoring platform ** more teaching, less babysitting / hand-writing monitoring code --- Charlie, cofounder of Hosted Graphite, protecting your lizard brain while on-call: * failures are very stressful at Hosted Graphite, people depend on us for their monitoring * feedback loop: failures -> more checks -> more alerting -> more docs * things are getting better, but... * but failures start training you on a primitive level, that certain things are bad * you start to learn that your phone is a source of pain and fear * things were alright until they weren't * panic, jumpy, stressful * why is that the reaction? you need to be calm to solve the technical problem * and most outages aren't that serious * i have to remind myself "it's not that bad" * but my lizard brain is fucking terrified no matter what * if you hear an incoming text, and it isn't even your phone, and you jump, then that's not right * just let people know that you're down, that can relieve some stress * is that stress symbolic of something else? are you afraid of failing? your company failing? * what are other on-call people thinking? * i've heard the same stuff from everyone.. big or small company, big or small team, one person or multiple people on-call * having someone else on-call in front of you is helpful * turn off all other notifications on your phone * what can we do better? i want to talk to people about this * what can companies do to improve mental health of those on-call? * i'm gonna stand by the door back there and i want to talk to you