Skip to content

Instantly share code, notes, and snippets.

@tomleo
Last active March 23, 2022 10:25
Show Gist options
  • Save tomleo/b011ad8db69fb8c18108 to your computer and use it in GitHub Desktop.
Save tomleo/b011ad8db69fb8c18108 to your computer and use it in GitHub Desktop.

A/B Testing

null-Hypothesis tests

The p-value is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence. [2]

A common mistake is to run multiple null hypothesis tests as the data are coming in and decide to stop the test early on the the first significant result. [1]

If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. [3]

Sample Size Calculator

Issues with null-hypothesis method: [4]

  • Even if preliminary evidence says that one version is terrible, we will keep losing conversions until we hit an arbitrary threshold.
  • If we hit that threshold without having reached statistical proof, we cannot continue the experiment.
  • Naive attempts to fix the former problems by using the same statistical test multiple times leads to our making far more mistakes than we are willing to accept.

A/B Split Test Significance Calculator

Bayesian A/B testing

Bayesian A/B testing is an alternative to Students T-Test (t-distributions) and obviously p-distrubutions which require large sample sizes.

  • unlike the Student T-Test, you can stop the test early if there is a clear winner or run it for longer if you need more samples. While is is generally true A/B Testing with Limited Data shows a workaround.

bayesian_ab_test.py

priors
represent what we believe before we run the test

Advantages of Basian Testing

  1. Easier to interpret results, p-values are confusing. Try to follow A/B Testing with Limited Data without your brain melting
  2. "measuring the probability at time t that B is better than A (or vice versa). You can look at the data, check if the test is finished, and stop the test early if the result is highly conclusive." [5]
  3. You can use your current posteriors as new priors for what is essentially the start of a new test without any major interruptions in your development flow. [5] This is the probably the worst thing you can do with traditional hypothesis testing.
  4. Bayesian A/B test achieves the same lift as the standard procedure, but typically uses fewer data points. [5]

TODO Re-learn Calculus

[1]http://ewulczyn.github.io/How_Naive_AB_Testing_Goes_Wrong/
[2]http://en.wikipedia.org/wiki/P-value
[3]http://www.evanmiller.org/how-not-to-run-an-ab-test.html
[4]http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigorous.html
[5](1, 2, 3) http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment