A/B Testing

null-Hypothesis tests

The p-value is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence. [2]

A common mistake is to run multiple null hypothesis tests as the data are coming in and decide to stop the test early on the the first significant result. [1]

If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. [3]

Sample Size Calculator

Issues with null-hypothesis method: [4]

Even if preliminary evidence says that one version is terrible, we will keep losing conversions until we hit an arbitrary threshold.
If we hit that threshold without having reached statistical proof, we cannot continue the experiment.
Naive attempts to fix the former problems by using the same statistical test multiple times leads to our making far more mistakes than we are willing to accept.

A/B Split Test Significance Calculator

Bayesian A/B testing

Bayesian A/B testing is an alternative to Students T-Test (t-distributions) and obviously p-distrubutions which require large sample sizes.

unlike the Student T-Test, you can stop the test early if there is a clear winner or run it for longer if you need more samples. While is is generally true A/B Testing with Limited Data shows a workaround.

bayesian_ab_test.py

priors: represent what we believe before we run the test

Advantages of Basian Testing

Easier to interpret results, p-values are confusing. Try to follow A/B Testing with Limited Data without your brain melting
"measuring the probability at time t that B is better than A (or vice versa). You can look at the data, check if the test is finished, and stop the test early if the result is highly conclusive." [5]
You can use your current posteriors as new priors for what is essentially the start of a new test without any major interruptions in your development flow. [5] This is the probably the worst thing you can do with traditional hypothesis testing.
Bayesian A/B test achieves the same lift as the standard procedure, but typically uses fewer data points. [5]

TODO Re-learn Calculus

[1]	http://ewulczyn.github.io/How_Naive_AB_Testing_Goes_Wrong/

[2]	http://en.wikipedia.org/wiki/P-value

[3]	http://www.evanmiller.org/how-not-to-run-an-ab-test.html

[4]	http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigorous.html

[5]	(1, 2, 3) http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html

tomleo/ab-testing-notes.rst

A/B Testing

null-Hypothesis tests

Bayesian A/B testing

Advantages of Basian Testing

TODO Re-learn Calculus