Read my latest post on Towards Data Science, in which I discuss how the data-generating process affects AB test performance and how to understand the nuances of this process through running Monte Carlo simulations.
Running an AB test on a properly randomized sample guarantees that the test results are unbiased and precise if the sample is large enough. What “large enough” means depends on the situation. What other variables affect the outcome? For example, if you running an AB test of how a landing page design affects a customer’s propensity to buy, and the website does have an effect, but it is a very small effect compared to the income of the customer (which you don’t observe) you will need a much larger sample to determine the true impact than if the website design is the sole driver of customers’ purchases.
More often than not, we observe some features that likely impact the outcome of interest, and hence, AB test performance and some that don’t. On top of that, there may be many factors that we do not observe that affect the outcome we are measuring in the AB test.
Obviously, we can’t observe what we don’t observe, but running Monte Carlo simulations can help. Here’s an analogy: imagine you’re planning a road trip. You consider factors like traffic, weather, and construction. A Monte Carlo simulation would be like taking hundreds of virtual road trips, each with slightly different conditions, to see which routes will most likely get you there on time. Then, you could look at the distribution of travel times across all those different conditions to see if your plans get you to your destination on time in, say, 95% of the scenarios.
You can apply this kind of simulation technique to your AB test, where the road trip plan is the AB test design, and the different traffic, weather, and construction conditions are the different observed and unobserved features that drive the data-generating process. If you make reasonable assumptions about the distribution of those features, a Monte Carlo simulation is a decision tool that can help you determine if your test design is reasonable.