@M_richo, when testing and monitoring collide:

* serverspec + sensu

* serverspec = rspec testing framework for server configurations, platform agnostic, 26 resource types
* very fast, example: 266 tests in 2.78 seconds
* when do you want to write serverspecs? when you're writing infrastructure as code to validate your code
* you can also run your serverspecs on your live servers, why? because it's quick and a cheap way to verify everything is working
* great addition to your monitoring system
* let's put this data into sensu
* first attempt: wow we have a lot of failures, and i have no idea what's broken
* 1. use rspec's json output format
* 2. sensu has a feature to send check results over a socket
* these two features allow you to split the checks up, instead of one huge summary check for all server you now have a bunch fo separate checks, easy to see failures

* summary:
** write tests for your systems / infrastructure code
** don't duplicate your effort, run your serverspecs on production

---

@laprice, monitoring postgres performance:

* hardware determines: memory, random_page_cost, tablespaces
* workload determines: query_planner, autovacuum, stats_collector
* what is autovacuum?
* cleans out dead tuples
* reorders pages on disk
* thresholds can be set per table
* one of the primary culprits for "my database is slow and i don't know why"
* highly tunable: workers, nap time, duration, timeout, max age, cost delay, cost limit, etc.

* focus on the tables that need it most (the largest tables)
* track dead tuple count & percentage (>5%)

* main question to answer: are my tables being vacuumed when they should be?
* you can get this info by querying pg_stat_all_tables, see the docs

---

@petecheslock, 17th century shipbuilding and your failed software project:

* aka - why your project managment sucks
* the Vassa
* grandest ship built by the royal swedish navy
* the most expensive project ever undertaken by the country at the time

* after sailing less than one mile a gust of wind hit the ship, it tipped over, and it sunk to the bottom of the sea
* 50 years later they recovered the ship and analyzed what went wrong

* the captain who survived was thrown into jail, he was asked if the crew was drunk, they were not, he was later released

* it tipped because it didn't have enough ballast
* why? it started off as a 108 foot ship
* then was changed to 111 feet (originally wanted 120 feet)
* then they wanted to add another gundeck
* sure, ok, then they needed to scale it up to 135 feet
* (nobody in sweden had even built a ship with two gundecks yet)
* they kept revising the number of guns, size of guns
* rush job
* the king also needed to have a bunch of ornate carvings added, making it more top heavy

* most of the design came from the king's head

* they did a lurch test (30 men running back and forth on the deck, believe it or not), and they had to stop because the ship was about to tip over

* the design changed so many times, they needed to add ballast, but there was no place to add it
* if they did add ballast, the lower gun deck would have been underwater

* so you may be thinking..
* why did they launch if all the tests failed??!
* if they didn't launch on time, the people inolved would have been subjected to "the King's disgrace" (execution?)

* to recap:
** schedule pressure
** changing needs
** no specs
** lack of project plan
** excessive innovation
** secondary innovations
** requirement creep
** lack of scientific methods
** ignoring the obvious: launched after failed tests

* the lesson: those who ignore history are doomed to repeat it!

---

@hypertextranch, monitoring & inadvertent spam traps:

* i work at wordpress.com as a developer
* i've never actually seen nagios
* but i've infiltrated your ranks

* we see a lot of spam
* any developer can make their own stats

* memorization < (intuition + investigation)

* how i found a random spammer
* i deployed elasticsearch and checked our monitoring to see if it made things better or worse
* i saw queries stacking up
* only 3 nodes pegged CPU, all other nodes were fine
* if this were a problem in my code, it would have caused a problem on all nodes
* every blog has a main instance and is i replicated to two extra machines
* so it seems like this is a problem with a single blog
* some user scripted thier blog to pull in articles from the washington post, splice in some affiliate links, and repeat every 30 seconds
* every time a site gets marked as spam by our filter, it causes the articles to be reindexed

* lesson: your devs should look at monitoring because they probably have more intuition about problems
* automated monitoring might not have caught these three bad nodes
* an ops dude would have noticed that three nodes
* but i as a dev was able to intuitively pick up on the problem right away

---

Chess - a reflection of life:

* "Chess is everything: art, science, and sport"
* tournament players lose 10-15 pounds after a tournament, physical and mental stress for 8 hours a day burns calories

* you are the winner even if you lose, you can learn from every match

* the game is egalitarian, the only thing that matters is the moves
* it doesn't matter what your age or gender or race is

* ego is the enemy of learning & growth
* ego is an anchor
* accept that there is more for you to learn, and you will

* chess exemplifies the power of cause and effect
* your moves at the start are directly related to the moves at the end

* time & timing are everything
* a good position fades quickly

* the game is all about patterns
* our brain is built to detect patterns

* control the center applies to chess and to life and business

* ran out of time

---

@isaacfinnegan, Expanding Context to Faciliate Correlation:

* basically i want to show off some cool stuff

* "we've got great tools"
* really?
* i have to use 5 different tools to get stuff done, they all have different, crappy interfaces

* github.com/evernote/graphite-web
* templates for graphite

* NagUI: federated nagios interface
* very fast (especially compared to the classic interface)
* bulk viewing, bulk actions
* drag & drop custom views, saved views, share views with your team
* graphite integration
* acknowledge + send to jira
* mobile interface too

* CMDB: pull data from different tools into one view
* nagui + jira + graphite
* i think this is the next step for monitoring tools
* instead of monolithic rewrites, integrate existing tools

---

Feature Knobs & Deploy Knobs:

* feature flags, feature toggles, config flags
* they're awesome!

* >100 deploys a day are awesome!
* deploy dark and turn up slowly for everything

* this leads to a problem though
* over time, we have a million feature flags and it's not clear which ones can be safely turned off/on
* you need a promotion process, cleanup process, which is tough
* use feature knobs wisely...

* what about deploy knobs?
* with a deploy knob, once you turn it up, you can't go back
* this makes them self-cleaning

---

some dude running linux tried to present but couldn't get the display to work

---

@michaelgorsuch, github ops, canary.io:

* scratching an itch via small, composable tools
* measure URL performance & availability
* at high resolution (sub-second)
* multiple vantage points
* based on libcurl (ubiquitous and provides good stats)
* sensorD, gets a blob of JSON with a list of URLs
* it measures them with libcurl and spits out JSON
 *that's cool

* now i have all these sensord instances running around the globe
* what do i do with this json?
* i need to aggregate
* new tool: canaryD
* siphon off the useful info, store it in redis for the past 5 minutes (starting small...)
* exposes the stats via REST API
* even with 5 minutes, that's 1200 measurements
* compare that to nagios's check_http, that would be like 1 measurement per 5 minutes in nagios
* so why not feed this high resolution data into a nagios check?

* what if i want to share this data?
* i want to make this open source, infrastructure independent
* open measuring for an open web
* it "launched" 3 days ago, by that i mean i tweeted a gist
* it's running in DO, but rackspace offered a bunch of servers
* someone already built a dashboard
* github.com/canaryio
* i'm learning go, don't be scared by the code

---

Sergey Fedorov, netflix, Stateful monitoring:

* couldn't present due to technical difficulties

---

Martin Parm, spotify, Distributed Operational Responsibility:

* first person to present using linux!

* give ops responsibility back to developers
* capacity planning
* monitoring
* config mgmt

* instead of doing this for them, we give them the tools to do this
* why do this? doesn't this seem like a bad idea?

* we have so many changes and engineers we can't do it all with an ops team
* so why not get the right people in front of a project the first time?
* if you break something, you need to fix it, better accountability
* we want the teams to work with technologies 

* how about monitoring?
** devs need training, but not a whole new education, just enough to solve their problems
** devs need autonomy, and will do stupid things (ops does stupid stuff too)

* alerting: metrics & events -> magic monitoring pipeline & alerting rules -> pagerduty alerts
** our alerting stack: ffwd (homegrown stat forwarder), apache kafka, rimeann, even more stuff
** we don't need them to learn or touch the internals of that alerting stack

* different abstraction levels
* script hooks, drop a script in a folder
* write your own python script with riemann library

* write your own rules, provide tools for that

* impact on monitoring?
** more monitoring, better monitoring
** monitoring platform
** more teaching, less babysitting / hand-writing monitoring code

---

Charlie, cofounder of Hosted Graphite, protecting your lizard brain while on-call:

* failures are very stressful at Hosted Graphite, people depend on us for their monitoring
* feedback loop: failures -> more checks -> more alerting -> more docs
* things are getting better, but...

* but failures start training you on a primitive level, that certain things are bad
* you start to learn that your phone is a source of pain and fear

* things were alright until they weren't
* panic, jumpy, stressful
* why is that the reaction? you need to be calm to solve the technical problem
* and most outages aren't that serious

* i have to remind myself "it's not that bad"
* but my lizard brain is fucking terrified no matter what

* if you hear an incoming text, and it isn't even your phone, and you jump, then that's not right

* just let people know that you're down, that can relieve some stress

* is that stress symbolic of something else? are you afraid of failing? your company failing?
* what are other on-call people thinking?
* i've heard the same stuff from everyone.. big or small company, big or small team, one person or multiple people on-call

* having someone else on-call in front of you is helpful
* turn off all other notifications on your phone
* what can we do better? i want to talk to people about this
* what can companies do to improve mental health of those on-call?
* i'm gonna stand by the door back there and i want to talk to you