## "Auditing all the things": The future of smarter monitoring and detection - Jen Andre ##

* founder & programmer at Threatstack

* premise:
  * 1. are you keeping a record of all processes running on your network?
  * 2. are you keeping a record of all hosts those processes are talking to?
  * if not, you are not secure

* why do you want to know this information?
* because you're a tinfoil hat security person
* is there a reason to be this paranoid? yes, if you ever get hacked

* even if you think you are secure, people are the weak links
* should you care if you are hacked?
* snapchat for pets: maybe not
* big pharmaceutical company: yes
* rest of us: it depends, but probably yes
* do a risk assessment process to figure out how important this is to you

* whenever a company is hacked
* they all post the same message
* "we got hacked but we found no evidence of really bad stuff. please reset your password as a precaution."
* really?
* did you look for evidence? or is that wishful thinking
* do you even have any evidence?
* we don't know what goes on internally
* but I do know that forensics after the fact is really hard and really expensive
* if you log everything ahead of time by default, this is much easier

* the cloud
  * for security people the cloud limits visibility
  * old school networking: defined perimeter, harden the outside of your network, DMZs, firewalls, etc.
  * in the cloud this doesn't apply, there is no well defined perimeter
  * so you need to do continuous security monitoring
  * audit everything, instrument everything, keep historical records of everything (sent to a secure place)
  * continually improve monitoring & detection

what to monitor:

* systems
  * authentications, processes, netowrk traffic, kernel modules, file system access
  * intrusion detection
* apps
  * authentications, DB requests, http logs
  * "active defense"
* services
  * API calls to SaaS or cloud providers
  * incident response

* do you know who is accessing your S3 buckets? do you have logs of that?

monitoring your systems:

* start at the host level
* processing auditing - linux audit
* network flow - libnetfilter_conntrack
* login - wtmp/audit/pam_loginuid
* keep everything in one 'big data' DB (e.g. elasticsearch)
* write scripts to analyze this data

The Linux Audit System

pros:

* powerful
* built in to the kernel
* relatively low overhead
* apt-get install audit
* it audits all the things, sort of
* syscalls, syscalls by user, logins, etc.
* doesn't include network data

how does it work?

    kernel threads doing things
    -> audit messages ->
    kernel thread queue
    -> netlink socket ->
    userland audit daemon & tools (redhat's auditd, auditctl, etc.)
    -> /var/log/audit/audit.log

configuration:

    files (watch all modifications to /etc/shadow):
        -w /etc/shardow -p wa

    syscalls (watch all kernel module changes):
        -a always.exit -F arch=ARCH -S init_module -S delete_module -k modules

    follow executable:
        -w /sbin/insmod -p x

cons:

* the logging is very obtuse
  * some values are a mishmash of strings, decimal integers, hex
  * lots of manual matching up of cryptic names and values to log lines
* it can crash your box
  * if the auditor is slower than the rate of incoming messages, buffers fill up and stuff starts crashing
  * enable rate limiting to help prevent this
* performance...

* one alternative is to connect directly to the auditing socket and write your own listener
  * for example, we wrote a listener that emits JSON instead of the obtuse text logs
  * we also wrote a luajit listener that can do super fast filtering, transformation, and alerts

* libevent + filtering + state machine parser
* reduced CPU usage from 120% to 10%, greatly increase throughput

logins:

* wtmp / "last" command
* fairly easy to parse and turn into json
* auditd also records login info
* you can configure SSH to emit login events to audit

* what about tracking "sudo su -"? how do i track those commands when anyone can become root?
  * use pam_loginuid
  * this adds a session ID to every audit event so you can track everything from the user login -> running commands as root

network traffic:

* src/dst ips
* src/dst ports & protocol type
* use the netfilter & conntrack systems
* netfilter = used by iptables
* conntrack = tracks connections
* turn this on: sysctl nf_conntrack_acct
* the conntrack tool will show you raw packets and byte counts, very ugly
* use libnetfilter_conntrack to emit JSON
* it's hard to directly tie a process to conntrack data
* but you can correlate using port numbers

putting it all together:

* someone logs in
* you can view all the commands they run (as their user or as root)
* you can view all their network connections
* and all this information is stored in a database that can be queried or accessed through a web interface

bonus: detection

* so i am collecting all this information now, how can i use it for detection?
* most attacks typically aren't very sophisticated
* many attacks use valid credentials (obtained through weak human targets, social engineering, malware)

what to look for:

* "is this user running commands they shouldn't be?"
* "why is a user running gcc?"
* "why is a user account running a command that only root or system user should run?"
* "where are my users connecting from?" (china? eastern europe?)
* "what are my users connecting to?" (again, any outlying places like china, eastern europe)
* you can create simple rules for these

Q&A:

* Q: something about conntrack
  * A: capturing raw data is very large, you need to filter, another option is to have a NAT box / router that all machines connect through and track everything there

* Q: are you saying it's ever OK to be hacked?
  * A: no, but your response is different depending on what industry you're in, e.g. the medical industry you must respond within a certain number of days and disclose the information in a certain way according to the law, hacking is only going to be more common, everyone will eventually be hacked

* Q: something about standards, are there any tools to help achieve standard compliance?
  * A: (she lost her voice and couldn't continue)

## Is There An Echo In Here?: Applying Audio DSP algorithms to monitoring - Noah Kantrowitz ##

* math ahead!
* metrics have value @ a certain time
* we can put them into graphs, we look at them all day every day
* but you can also put this data into a .wav file
* have you ever seen a visualizer / EQ?
* it looks kinda like our graphs
* but they have a frequency domain
* value over time vs. value over frequency

* x axis frequency: 0Hz -> 20Hz
* y axis decibel value: +0dB -> +50dB

* you can use the fourier transform to turn (time, value) data into frequency data
* (gave the formal definition)

* sine wave
* add multiple sine waves together
* add some noise
* and this starts looks like one of our graphs in systems land

* you can convert this graph to frequency space to get the underlying components
* this reveals new information

* instead of the mathy formal definition of FT (with integrals and infinity signs, which computers are bad at)
* we use DFT and DTFT, discrete fourier transforms
* one problem with this is that we have to do an O(N^2) calculation on the entire data set
* there is an algorithm called Fast Fourier Transform
* which is O(NlogN) instead of O(N^2)

* an IFT does the opposite process, it turns frequency data into time series data

low-pass filter:

* say we have a series with a threshold
* and it's constantly flapping in nagios terms
* use FFT to convert to frequency, run a low-pass filter, use IFT To get back to time series
* then apply your threshold
* this gets rid of the noise
* e.g. it allows you to catch longer term rampups instead of short term blips

* there are also high-pass filters (delete high values) and band-pass filters (delete outside of range)

windowing:

* chops off data that you aren't concerned with
* rectangular window function - very simple to implement
* need to be careful of spectral leakage when using a small window size
* which gives you "mushy" peaks, less clear signal
* triangular window function - better, but not perfect, also easy to implement
* blackman harris window function - best result

how do you do this?

* NumPy is the one-stop shop, all of these functions are built-in
* FFTW for C
* go-dsp for Go
* nothing in ruby, there isn't much scientific / numeric software for ruby

* go forth and find the signals!

bonus content:

* discrete cosine transform (DCT)
  * how most audio/video compression works
  * this is why MP3 files are smaller than WAV files
  * WAV stores all the frequency data
  * MP3 stores the DCT, much smaller to store, then uses IFT to decompress
  * someone, please write a metrics database that uses DCT!

* wavelets
  * next generation compression systems (e.g. H264)
  * someone should build something using this too

* ???
  * (something i missed)

* hysteresis
  * use input to predict output

* control theory
  * goes hand in hand with signal analysis
  * signal analysis gives you tools to analyze data, but control theory gives you tools to act on the data
  * for example autoscaling
  * PID control loops

Q&A:

* Q: can you demo some of the numpy code?
  * A: sorry, no, it's too much to get into right now

* Q: any monitoring tools using these techniques?
  * A: no! I don't know of any, nagios flap detection is a poor reinvention of the most basic form of signal analysis, but it sucks, there's a thousand years of research on this subject and nobody is reading it or implementing it!

* Q: is our data amenable to this approach? is our data really all built out of sine waves?
  * A: most of the data we look at has periodic components, at the very least you have a daily cycle; and there are a lot more cycles e.g. timeouts, response times, user activity, etc. all contribute to periodic rhythms

* Q: is your code on github?
  * A: no it's all homegrown hacky python code, not releaseable yet

* Q: if we added FFT to graphite would that solve a bunch of problems?
  * A: yea that'd be helpful, but would be better in a streaming system like riemann

* Q: something about high frequency data
  * A: it's the same problem as audio, audio needs to be sampled, you might need to do the same thing with your data, sample it

* Q: how do you deal with noise in data? what about the colored noises?
  * A: haven't run into this much, i'm using data i know to be periodic

## A Melange of Methods for Manipulating Monitored Data - Dr Neil J. Gunther ##

* http://en.wikipedia.org/wiki/Neil_J._Gunther
* author of many books, teaches classes, workshops
* The Practical Performance Analyst

* no more plane crash analogies? (monitorama berlin joke)
  * too bad, it's a useful
  * asiana flight 214
  * report found that asiana pilots are too focused on instrumentation
  * they didn't do basics like... look out the window

* monitoring is not about pretty pictures / graphs / tools / fancy math
  * it's all about the data
  * what story is the data trying to tell you?
  * you need to have a consistent interpretation of data, across all the data

* how do we converge on consistency? i'll show some examples

The Greatest Scatter Plot

* (shows strip charts of metric1 and metric2)
* if we were good at looking at data the stock market would be a solved problem
* is there a relation between metric1 and metric2?
* put both sets of data into a scatter plot
* does it show anything interesting? a trend in any direction?
* linear regression
* Least Squares Fit
* LSQ fit and R^2 value (what percent of the data matches up with the model?)
* are we done now? no, this is just the beginning
* is linear fit the best choice?
* what is the meaning of the slope?
* are you comfortable extrapolating this model into the future?

* the most important scatter plot in history
* 1929
* Edwin Hubble's plot of distance of stars from us & their velocity
* what does the slope mean? v/r, Hubble's constant
* from this slope we can calculate the age of the universe!
* one small problem, hubble's calculation of the age of the universe (2B years) was lower than age of the earth (3-5B)
* how did the earth get here before the universe?

* what could he do?
* (answers from the crowd: "look out the window", "fudge the data")

* well, the earth is not stationary, so he compensated for earth's velocity
* and... the data got worse!

* nonetheless, he published the data
* some thought he was crazy, it's obvious something is not right
* 70 years later, Hubble is now vindicated
* Hubble's plot was a tiny area of what we can now see
* telescopes weren't good enough in Hubble's time
* the data was wrong, but his model was correct

* lesson: treating data as divine is a sin
* i am fond of saying that all data is wrong

irregular time series:

* regular samples: like a metronome, every time has a value
* irregular samples: missing data
* you use the arithmetic mean on regular series
* you use the harmonic mean on irregular series
* with unequal intervals you need to scale the mean based on how long the intervals are between data points
* use HM on aggregate monitored data when the following apply:
* R - rate metric (y axis)
* A - something i didn't catch
* T - something i didn't catch
* E - something i didn't catch
* this doesn't come up too often in our systems

Power Laws and the Law of Words:

* Zipf's law
* plot the frequency of words in the english language
* words like "the" are many many magnitudes higher than more exotic words
* what function describes this data? it's hard to say from looking at the graph
* the trick is to use logarithmic axes
* check if a linear regression works on the data with logarithmic axes
* power laws imply persistent correlations that need to be explained
* what is the explanation in Zipf's case?
* the rules of english grammar require certain words to be more frequent than others

* example: DB query times
* rank by time (histogram)
* put on loglog axes
* hmm this data looks weird now, it's not linear
* it has three different behaviors
* 1st part: power law decay
* 2nd part: exponential decay
* 3rd part: exponential decay
* is that enough?
* no, we must determine why each of those correlations fit

* example: in Australia all business were required to register an ABN number for tax purposes, with a hard deadline
  * very similar to the healthcare.gov problems
  * at the 11th hour, people rushed to finish, and the system crashed
  * could that peak have been predicted?
  * yes, it's complicated, but a power law can do this

* lesson: rank data by frequency (histogram) and try using log / loglog axes
  * you can use this technique to predict spikes in noisy data
  * this allows you see a strong correlation, the explanation is more difficult

* conclusion: aim for consistency
* learn to listen to your data

Q&A:

* Q: have you seen people fudging data in the operations world?
  * A: physicists are notorious for this, i haven't seen it as much in the operations world, i have been guilty of ignoring or overlooking strange noises or inconsistencies, also, be careful of making really complicated models (unless you know what you're doing), at some point you may feel a conviction about your model like Hubble did, and Hubble was correct in the end, important question for science: "how do I convince myself this model is true?", use this approach when making your models, look at Einstein's first 5 papers, everything is written in a way that anyone can understand, using very broad statements, then gradually narrows down and paints you into a corner of accepting his claim, and these were outrageous claims at the time, as simple as possible but no simpler, and this is now a rambling answer but it was fun to give

* Q: hubble's estimate was wrong because his data wasn't accurate, it seems in our world that our measuremens are very accurate, does that change our approach?
  * A: so, do we need to do something differently from Hubble? i'm fond of saying that all measurements are wrong, you don't have his exact problem, but you should never trust the data, you can have completely accurate measurement of the wrong thing, (relays an anecdote about LHC measurements that were accurate to 6-sigma, but a 50 cent connector was not attached properly, so the data was super accurate garbage that was misleading people)

* Q: a comment - we can measure time accurately in computing, but most data in operations is very inaccurate and noisy

* Q: another comment - i'm struggling with eventual consistency of the cloud, as such you have to deal with eventual consistency, even in your monitoring
  * A: sure, that's a different concept, but yes if you're using a distributed system, the "consistency" of your models will have to take these distributed computing problems into account

* Q: in your last example with the power laws, you found the peak after the fact, does it work ahead of time?
  * A: yes, you can construct a power law prediction, it's not always correct, but it's another tool, requires more math

* Q: would human behavior play into your prediction? i.e. you're counting on people to wait to the last minute?
  * A: no, i might point to human behavior as the explanation, but the prediction does not depend on that fact

## The Final Crontab - Selena Deckelmann ##

* works at Mozilla on the Socorro team
* Socorro is a crash reporting system
* about:crashes
* click on a crash there and it takes you to socorro's web interface
* crash reports from users are fun to read (shows some funny quotes and http://lqbs.fr/suchcomments/)

* (showed some diagrams of the system architecture)
* postgres is central to the system
* it's the main architectural element
* background tasks are also important

so, what is the final crontab?

    */5 * * * * socorro /usr/bin/crontabber

* our old cron jobs had no tests
* but they were so critical to our systems
* everything was special shell scripts
* jobs would kick off postgres stored procedures that would break if run twice and are very hard to debug

* email from cron
  * everyone has this problem
  * worst month: 22k emails sent from cron

* crontabber saved us from a lot of these problems
* cron emails are a security blanket that we no longer need anymore
* use nagios/sentry instead

* what's cron good for? it runs jobs on a predictable schedule

how socorro uses cron:

* reports
* postgres materialized views
* status logging
* jobs that don't fit into a queue system because of dependencies, complexity, etc.

* github.com/mozilla/crontabber
* pip install crontabber

here's what our jobs look like:

    socorro.cron.jobs.matviews.ProductionVersionsCronApp|1d|02:00
    ...dozens of lines like this...

* everything is a python class with a run method
* shared code (e.g. transactions, setup, teardown), is shared across jobs using decorators
* jobs have a frequency ("1d") and start time ("02:00"), and the job code contains metadata like dependencies

* uses configman (github.com/mozilla/configman) for parsing command line args vs. config files
* github.com/mozilla/socorro/blob/master/config/crontabber.ini-dist

what do i like about this system?

* no more shell scripts, that's the main thing, huge improvement
* easier to write & test
* automatic retries on failure
* jobs wait on their dependencies to run (including when a dependency fails)
* dependencies are documented in the code, automatically builds a visualization of job flow
* automated nagios alerts, including sending triggered exceptions to IRC, no more email alerts
* configurable number of failures before CRITICAL
* unit test framework for jobs

problems:

* configs are a bit complex
* one-off runs aren't simple (stored procedures are designed to only run once per day)
* no parallel execution yet, jobs are run linearly in dependency order, one possible solution:


    */5 * * * * crontabber --conf=/etc/cron1.ini
    */5 * * * * crontabber --conf=/etc/cron2.ini
    */5 * * * * crontabber --conf=/etc/cron3.ini

* yea... we're not going there again :)

* depends on python 2.6 or higher and postgres 9.2 or higher

Q&A:

* Q: no question but just want to say that it looks awesome
  * A: thanks!

* Q: have you had problems with circular dependencies?
  * A: not sure, we only have 4 levels of dependencies, so i don't think we've run into that yet

* Q: how is the JSON postgres performance?
  * A: awesome, document size per row is tiny, main write DB is 1.5TB, half of that is probably JSON, way faster than hadoop, 1 hour for hadoop query -> 10 minutes for same query in postgres

* Q: you're trying to get rid of shell scripts, did you rewrite in python or wrap them in python?
  * A: rewrite in python, bash is OK to start, but gets too crufty

* Q: did you look at pgAgent? (job scheduling agent for postgres)
  * A: no we didn't look at that

* Q: can it do cross-node dependencies?
  * A: what do you mean
* Q: like if a job on machineA depends on a job on machineB?
  * A: no... right now it only runs on one machine

* Q: is there a reason you didn't look into marathon or cronos for distributed cron?
  * A: we didn't need a distributed tool, crontabber is more about the framework for jobs, and all these jobs seemed pretty critical to the product so we wrote our own system to handle them

* Q: do you handle timeouts & stuck jobs?
  * A: timeouts are built into the jobs themselves when necessary

* Q: how do you determine what jobs are currently running? any visualization?
  * A: no visualization, but that info is in the crontabber logs

## This One Weird Time-Series Math Trick - Baron Schwartz ##

* more math...
* this was going to be about math, but other people already covered it!

* works at VividCortex - New Relic for the database
* formerly worked at Percona
* author of: High Performance MySQL & Web Operations

* "anomalies" vs. "typical data"
* anomaly = not typical

my worldview:

* monitoring tools are not enough
* monitoring = healthchecks, metrics, graphs

* we need performance management
* work-getting-done is top priority
* we need more than recipes or functions to grab and apply, we need to know the right techniques to use

* fault detection = work is not getting done, true/false
* anomaly detection = something is not normal, uses probability & statistics
* just because something is anomalous doesn't mean it's bad

what is the holy grail?

* determine normal behavior
* predict how metrics "should" behavior
* quantify deviations from prediction
* do useful stuff with that data

* at 1 second resolution, your systems are anomalous all the time

* that holy grail is very practical, too practical for this talk
* sometimes i want to do something fun
* like use fun math

* high level math is difficult to do at scale, it's better suited to academic papers

* timeseries metrics are not always best displayed in strip charts

* how many of you know these statistical / probability methods? (shows big list of methods)
* how many of you have used the smirnov-kolmogorov test? (mentioned in Toufic's talk)
* how many of you know these descriptive statistics methods? (wikipedia page on descriptive stats)
* i don't know any of these

* but basic statistics is good for quite a bit
* learn the simplest, most effective approaches first
* advanced stuff is there if you need it
* you don't need a PhD to do this

* spectrum of metrics analysis:

    turd polishing <-------- sweet spot --------> lilly gilding

* anomaly detection
* anomaly -> deviation -> forecast/prediction -> central tendency/trend -> characterization of historical data
* these are all separate problems with different techniques

* dumb systems don't produce good results
* if a system is getting work done, it's not faulty, no matter what a fancy technique says

control charts

* draw lines for 3 sigmas
* is the process within normal limits?
* control charts assume a stationary mean
* most data is not normally distributed
* lots of problems at smaller time scales

first idea: moving averages

* gives us a moving control chart
* somewhat expensive to compute
* current values are influenced by values in the past
* a spike in data causes an inverse spike in the sigma values once that spike drops out of the window

exponential moving averages

* more biased to recent history
* cheaper to compute, only need to remember one value at each step and apply a decay factor
* EWMA is a form of a low-pass filter
* we can do the same thing we did earlier and make EWMA for control charts
* which is a little better than moving average control charts or plain control charts
* one place where EWMA falls down are trends
* the EWMA lags behind the actual trend

double exponential smoothing

* tries to solve the lagging by adding a prediction
* once you do this, the alpha and beta factors become very sensitive
* it's easy to way undershoot or overshoot the trend

* holt-winters forecasting
* DES plus seasonal indexes
* more complex, slow to train, previous anomalies start getting built into the predictions

* MACD - moving average convergence-divergence
* comes from the finance world
* finance is probably the most advanced application of these techniques, look there for inspiration
* seems to be the most accurate

Q&A:

* Q: what happens when you subtract current timeseries data from previous week's data?
  * A: yea i've tried that sort of thing, this is similar to holt-winters, what happens if you had an outage last week? then you will be predicting an outage next week, also, is week the right period? should you combine weekly/daily/hourly? should you use multiple "seasons" (i.e. if using weekly data, use 3 weeks in the past)?

## The Lifecycle of an Outage - Scott Sanders ##

* operations at github

* tools + process = confidence

* take any business metric and multiply it by your downtime
* while you have downtime, you have no registrations, no revenue, etc.

* human error is not random, it is systematically connected to people, tools, tasks, and operating environment

triggers:

* detection & notification of a problem, get a human involved
* alert fatigue is real
* people tune out notifications
* human fatigue is also a problem
* if you are paged in the middle of the night
* keep shifts as short as possible, right now github has 24 hour shifts
* simplify overrides and give them out freely
* be persistent, don't page every 15 minutes, page every 60 seconds until a problem is ack'ed
* escalate quickly, don't let a dead battery cause your downtime to go on longer
* be loud
* create handoff reports for every on-call shift, spot trends
  * github has a chat command called "handoff" which generates a report & graphs of all incidents during an on-call shift

initial response:

* establish command & identify severity, quickly
* graphs are a great way to determine severity
* chat bots are a great way to signal to both systems & teammates what is happening during an incident

github's monitoring stack:

* graphite, 175k updates/sec
* collectd (system level metrics), 1200 metrics per host
* statsd (app level metrics), 4 million events/sec
* and.. sFlow, SNMP, HTTP, etc.
* logging: scrolls, splunk, syslog-ng
* 1TB of logs indexed per day
* special purpose monitoring directly covers business concerns

* we don't consider a tool production ready until we can interact with it via chat
  * because that interface fits our culture
  * you should do the same for your culture
  * accept the processes that emerge and adapt your tools to augment those processes
  * don't force your team into processes

corrective action

* collective knowledge & feedback loops
* real example: last year, github was hit by a string of DDOS attacks

    hubot: nagios critical - ddos detected via splunk search
        (this also generates a github issue
        with the check result and a link
        to DDoS-mitigation.md playbook)
    tmm1: oh?
    tmm1: /arbor graph -1h @application
    hubot: <graph of incoming traffic>
    tmm1: /pager me incoming ddos
    tmm1: ...more steps to determine what's happening...
    other people join in
    jssjr: going to enable protection now
    jssjr: /shields enable w.x.y.z/24
    hubot: please respond with the magic word, today's word is knight
    jssjr: /shields enable w.x.y.z/24 knight
    jssjr: /graph me -1h @network.border.cp1.in
    hubot: <graph of incoming traffic at the router to verify the change>

* playbooks are awesome
* they allow you to distribute knowledge
* as you come across a new problem or missing knowledge, add more to your documentation
* tools make software less horrible
* nobody should have to know everything about your entire infrastructure
* make things safe for your less experienced engineers

create issues for postmortems

* dedicate a repository for postmortems, for github this private repo is: github/availability
* identify problems
* involve many people
* propose solutions

* some incidents require a public postmortem to be released the same day
* but the private postmortem can be open for weeks, to make sure we got it right and are completely satisified the issue is fixed
* this is how we close the loop on outages and make progress towards prevention

* for example, some improvements for DDoS are: automatic mitigation, better monitoring, etc.
* study the lifecycle of your outages
* tools are complimentary to your process, not the other way around
* communication is the cornerstone of incident management
* tools & process enable confidence
* never stop iterating

Q&A:

* Q: do you have problems with availability of your tools during outages?
  * A: absolutely, for example we keep the playbooks off-site and on-site to make sure they're always available

* Q: you mentioned a huge graphite instance, what backend are you using? i don't think whisper would work?
  * A: we are using whisper

* Q: tell us about the "shields up" command, what does it do? does it get logged somewhere?
  * A: well, our chat is logged, that gives us the timeline

* Q: if you're fixing an outage and you need to clone something from github, what do you do?
  * A: ha ha well we work very hard to make sure that doesn't happen

## A whirlwind tour of Etsy's monitoring stack - Daniel Schauenberg ##

* software engineer on infrastructure team @ etsy

* 25 million members
* 18 million items listed
* 60 million monthly visitors
* 1.5 billion page views per month

* all with a single monolithic PHP app
* master-master mysql
* we have some smaller services in java
* and image service is not in PHP

* we deploy a lot
* the actual number doesn't matter much
* what matters is how comfortable are you deploying a change right now?

* when you start at etsy the first thing you do is deploy the site (team section)
* and then you watch the graphs
* what are in the graphs?

ganglia:

* system level metrics, everything specific to a node (requests per second, jobs queued, CPU, memory, etc.)
* one instance per DC/environment
* 220k RRD files
* fully configured through chef roles
* automatically runs all files in a certain directory to generate these stats

StatsD:

* single instance, one server
* traffic mostly comes from 70 web servers & 24 API servers
* heavily sampled (10%)
* graphite as backend

graphite:

* application level metrics (not system level)
* 2 machines: 96G RAM, 20 cores, 7.3T SSD RAID 10
* 500k metrics per minute
* mirrored master/master setup
* sharded setup, 7 relays running per box, replicating data to the other server
* the sharded setup also helps isolate problems (when something blows up, only one of the two servers is affected)

* things to monitor when running graphite:
  * disk writes, disk reads, # of keys being written, # of values being written, cache vs. relay stats
* don't monitor graphite with graphite
* we monitor graphite with ganglia

syslog-ng:

* web, search, gearman, photos, nagios, network, vpn
* 1.2GB of logs written / minute
* fully configured via chef roles (to determine which log files to send for a node)
* rule ordering is important
* syslog boxes also run a web frontend called supergrep which is a node.js app that basically runs "tail -f *.log | grep ..." over the web

* syslog boxes also run etsy/logster
* extracts metrics from log files
* written in python
* runs once per minute via cron

splunk:

* supergrep only shows the last ~1 minute of data, how about longer?
* splunk indexes all your log files
* easy & powerful search syntax
* saved searches
* glorified grep

logstash:

* experiment to replace splunk
* easier to integrate with
* easy to set up in dev environment (can't do this with splunk)
* can logstash give our developers more insight while they are developing?

eventinator:

* tracks all events in the infrastructure
* chef runs & changes
* DNS changes
* network changes
* deploys
* server provisioning and decommissioning (we use dedicated hardware, no cloud)
* 12 million events in the last 2 years
* originally stored in one mysql table, now using elasticsearch (free search)

chef:

* everything is configured with chef
* same cookbooks in dev & prod
* every node runs chef every 10 minutes
* tons of custom knife plugins & handlers
* we use spork for our workflow, which notifies IRC of changes / promotions, also kicks off a CI build
* mentioned git repo vs. chef server being out of sync
* "knife node lastrun web0200.ny4.etsy.com"
* 120 recipes successfully run in 20 seconds

* there's also a handler for failures, chef failures are automatically sent to a pastebin and posted in chat

nagios:

* raise your hand if you have a strong feeling about nagios (everyone raised their hand)
* raise your other hand if that feeling is love (only a few people)
* well, too bad for most of you, computers don't care about your emotions
* nagios works really well for us
* 2 instances per DC/environment
* we use nagdash to aggregate results across all instances, our main view of the world
* interact via IRC, set downtime, see check results
* used to have a manual deploy process (ssh into box, etc.)
* why do that? we have a good way to test & deploy software
* now they have a real deployment process, real CI process
* feels just like working on the web app, that's a good thing

nagios herald:

* adds context to nagios alerts
* what are the first 5 things you do when you get paged?
* you already have your phone in your hand, wouldn't it be great to get this information in the alert?
* now our alert emails contain graphs, tables, output of shell commands, alert thresholds, alert frequency (# of times alert has been triggered in the past 7 days)
* this is awesome, on-call is so much better now

ops weekly:

* we have weekly rotations
* at the end of your shift, you are given a survey
* you have to specify which alerts were actionable, which were ignorable
* # of pages during sleep vs. awake time
* amount of time kept awake by alerts
* can also scrape data from fitbit to get actual sleep times
* and these results are discussed at the weekly ops meeting

summary:

* use a set of trusted tools
* enhance tools when they come up short
* keep trying new things
* write your own tools where applicable

See our blog, github, and other talks for more detail.

Q&A:

* Q: how do you feel about kale?
  * A: kale is our anomaly detection stack, it's still an experiment, we're trying to figure out how and where to use it, it was recently broken by a graphite upgrade

* Q: how self-service is your nagios setup? do you provide tools for devs to build monitoring?
  * A: not very self-service, still need to write your own checks & configs, but every team has an ops person, and all those people are excited about writing checks that make developers lives better

* Q: elaborate on logstash & elasticsearch?
  * A: right now it's an experiment, also using kibana, side-by-side with splunk, what parts of splunk work better in logstash? how useful is it for developers in their dev environment? those are the main points

* Q: how many syslog servers? do you split the logs between multiple hosts for performance reasons?
  * A: two, and I think they both get the same data for redundancy purposes

## Wiff: The Wayfair Network Sniffer - Dan Rowe ##

* wayfair.com
* leads the infrastructure tools team at Wayfair
* two sub-teams: internal tools (customers are employees) and dev tools (customers are engineers)

* wayfair is an online retailer
* 7 million products
* 16 million visitors per month

* a lot of these kind of presentations someone presents a homegrown tool and everyone is like
* "why did you do it that way? why didn't you use X?"
* i'm going to try to cover those questions ahead of time

our setup:

* active/active DC setup
* main sites -> loadbalancer -> PHP web server farm
* java / ASP.net for other stuff

logging overview:

* syslog, app log, network traffic, commits
* logstash
* elasticsearch
* kibana, dashboards, graphite, zabbix, ad hoc querying & alerting

what is wiff?

* out of band traffic sniffer and analyzer
* wireshark as a service
* packet processing pipeline

* feed in packets -> process -> output -> report / analyze -> profit

how do you feed in the packets?

* wireshark / NIC level
* pcap files (ring buffer or tcpdump files)
* rabbit mq

* once you feed in the packets, configure which protocols, ports, etc. you are interested in
* currently HTTP, HTTPS (needs private keys to decrypt, take care not to log the request/response bodies anywhere..), and TCP are supported

* showed a typical HTTP processing workflow (big diagram)

* reporters output the data somewhere
* JSON, elasticsearch, rabbitmq

* wiff is the beginning of the pipeline
* we have some example kibana queries to get started with
* once it's in elasticsearch it's up to you to do the analysis
* alerting: doesn't exist yet, want to build an alerting system for ES

pessimism:

* if we already have web server logs and application logs, why do we need this?
* this is just another vantage point to gather this data
* it's a companion tool

* where does it fit?
* you tell me, it can track both inbound & outbound traffic

* it can spot problems before the request hits a given layer
* what if your LB or webserver is misconfigured?
* what if the request never reaches where you expect it to reach?
* what if your server segfaults?
* can spot problems that don't show up in logs
* real world example: Set-Cookie was being specified multiple times per response, but their logging was only showing it as set once

* because it's out of band, it doesn't matter if it crashes, it doens't matter if it goes down
* it doesn't require you to make changes to your application
* very little performance overhead
* (i think all of these arguments apply to using plain old tcpdump?)

* MOAWSL: mother of all web server logs
* we have this layer that aggregates all web requests in a single log file, standard format
* but if you didn't have this layer, wiff could be used to do that

other benefits:

* runs on windows
* can be used to watch network traffic of proprietary / third party software
* packet RTT
* obtain network timing information
* call frequency (how often is this web API getting called?)
* showed screenshots of command line tool & kibana dashboard

todo:

* improve SSL decryption performance (do it in the background)
* better reporting

notes:

* needs some monitoring
* watch for dropped packets, un-stitchable requests
* no support for SPDY or websockets
* YMMV, it works for us, not used by anyone else yet

github.com/wayfair/wiff

Q&A:

* Q: do you instrument wiff before & after the load balancer? to track requests through the system?
  * A: uhh we can see the source/destination and track them that way, but that isn't done automatically

* Q: anything on the roadmap for SIP traffic?
  * A: no, but we have a big call center, i can see it being useful there

* Q: what is the throughput?
  * A: we have 10G NICs, it's only using ~1G in testing, depends on tcpdump buffer settings and how much your NIC can handle

## Web performance observability - Mike McLane & Joseph Crim ##

* work at Godaddy
* we went full prezi, so bring some dramamine

* measure performance
* is it good enough?
* if not, look for bottlenecks

* how are people using our hosting?
* setting up blogs, PHP apps
* what are the common use cases?
* know your customer
* so... lots of PHP benchmarks
* wordpress, joomla, drupal

* response time is very important for your customers and their customers
* people leave and/or complain when things are slow

* imagine loading screens in video games, nobody likes loading screens

* google has shown that page load time has a direct impact on how likely a person is to make a purchase
* google ranks your site based on the load time

webrockit:

* webrockit is our performance testing stack
* how long does page load time take in a real browser?
* data collected has to be real, match up with real users' experience
* it needs to be understandable by our internal users

* webrockit uses headless browsers to calculate page load time
* time to first byte
* number of assets
* time to complete loading assets
* 100 different stats related to page load time

why not use a commercial offering?

* too expensive for the amount of traffic we want to pump through
* data resolution wasn't good enough
* didn't include all the stats we wanted
* we wanted to feed data into graphite
* no commercial offering gave us all the features we wanted

how about open source?

* similar to commercial offerings
* we looked at: casperjs, selenium, watir, ghost.py
* none of them had all the parts we wanted

* so we decided to build our own and open source it
* working prototype in 3 days
* using phantomjs, wraps headless webkit with an API
* and it was spot on with how real browsers work, gave accurate measurements
* the API lets you do some cool stuff like overriding which IP to use for host
* and exposes all the internal timing / metrics in the browser

example:

* let's say we want to benchmark changes across changes in our app
* let's use a standard LAMP stack, running wordpress, using stock versions of everything
* no optimization ahead of time
* let's point webrockit at it
* start by focusing on time to first byte
* test #1: enable compression
  * this made time to first byte slightly worse
  * that's useful to know
* test #2: switch from modphp to fastcgi + phpfpm
  * no speed change, but more stable looking graphs
* test #3: enable APC
  * APC is an opcode cache for PHP, so source doesn't need to be compiled for each request
  * gave a great improvement in response time
* test #4: upgrade package versions
  * php 5.3 to 5.5, apache 2.2 to 2.4, fastcgi -> modproxyfcgi
  * another good improvement

The end result is that we had a nice workflow for testing and iterating on performance changes.

how does webrockit work?

* we decided to use sensu
* which is normally used for monitoring
* but had all the basic pieces we needed for building a performance testing suite

* we wanted the design to be API-first, REST API
* written in jruby & sinatra (jruby = easier deployment)
* users Riak for main source of truth, storing results
  * the data structures used are really simple, would be easy to port to other data stores
* checksync API, webrockit API -> write checks to disk for sensu
* all metrics go into graphite

web UI:

* uses rails
* set up a poller, e.g.: AWS east & west, digital ocean, internal network, etc.
* then set up a check: name, run interval, which poller to use, URL, ip address override (to skip DNS lookup)
* you can view a queue of all the jobs, each job has some debugging info in case there's a problem
* wait for the job to run for a while then you can view results
* graphite dashboards (high level overview of a few metrics)
* cubism graphs (condensed strip charts, very easy to spotcheck)
* explorer view (drill down into those 100 different finegrained metrics, add multiple targets to a graph to visualize better)

future:

* virtualization
* introduce packet loss / traffic shaping / bandwidth limits / TCP level network tweaks
* better analysis (see all the previous talks on math & anomaly detection)
* heatmaps
* events & errors (200 expected and now it's 404 or 301, page size drastically changed, etc.)
* better dashboards, what is the state of the art? can we use or feed into those systems
* better debian support (we're a RH/centos/fedora shop)
* real configuration management (we are both a puppet & chef shop, which drew applause from the crowd, they are using bash scripts to install everything right now)

sound interesting?

* twitter.com/webrockit
* webrockit.io
* https://github.com/WebRockit