The p-value is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence. [2]
A common mistake is to run multiple null hypothesis tests as the data are coming in and decide to stop the test early on the the first significant result. [1]
If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. [3]
Issues with null-hypothesis method: [4]
- Even if preliminary evidence says that one version is terrible, we will keep losing conversions until we hit an arbitrary threshold.
- If we hit that threshold without having reached statistical proof, we cannot continue the experiment.
- Naive attempts to fix the former problems by using the same statistical test multiple times leads to our making far more mistakes than we are willing to accept.
A/B Split Test Significance Calculator
Bayesian A/B testing is an alternative to Students T-Test (t-distributions) and obviously p-distrubutions which require large sample sizes.
- unlike the Student T-Test, you can stop the test early if there is a clear winner or run it for longer if you need more samples. While is is generally true A/B Testing with Limited Data shows a workaround.
- priors
- represent what we believe before we run the test
- Easier to interpret results, p-values are confusing. Try to follow A/B Testing with Limited Data without your brain melting
- "measuring the probability at time t that B is better than A (or vice versa). You can look at the data, check if the test is finished, and stop the test early if the result is highly conclusive." [5]
- You can use your current posteriors as new priors for what is essentially the start of a new test without any major interruptions in your development flow. [5] This is the probably the worst thing you can do with traditional hypothesis testing.
- Bayesian A/B test achieves the same lift as the standard procedure, but typically uses fewer data points. [5]
- Easy Evaluation of Decision Rules in Bayesian A/B testing
- A Formula for Bayesian A/B Testing
- Asymptotics of Evan Miller's Bayesian A/B formula
- Finish reading http://ewulczyn.github.io/How_Naive_AB_Testing_Goes_Wrong/
| [1] | http://ewulczyn.github.io/How_Naive_AB_Testing_Goes_Wrong/ |
| [2] | http://en.wikipedia.org/wiki/P-value |
| [3] | http://www.evanmiller.org/how-not-to-run-an-ab-test.html |
| [4] | http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigorous.html |
| [5] | (1, 2, 3) http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html |