A/B Testing
Last updated
Last updated
A/B testing is an example of a completely randomized trial. It is considered as a standard and basic test for randomized controlled experiments. This makes A/B testing a popular approach for running experiments and determining if experience "A" OR "B" works better. This helps to establish causation as opposed to a correlation in observed data.
It is a very basic and effective way of testing two versions of things and figure out which version performs better. It is a randomized experiment testing method with two variants A and B, then statistically testing which one is better. The main responsibility and use of A/B testing are done to experiment with new ideas for testing Machine Learning Models and actions to improve existing ones. It does the testing and reports the statistics for it. In A/B testing we are actually controlling how to allocate users to different groups. It is a gold standard method for understanding the causal relationship.
A/B testing is one of the commonly used methods for causal Interpretation.
In a lot of cases, getting the data for the entire population is very hard. The solution to this would be taking a sample (ex. selecting 1000 from 100000 people) and doing statistical analysis on the sample data to predict how the remaining(99,000 people) are going to perform. This process of selecting a small portion from larger population is called sampling (this sample can represent the overall population). Example: If 10% from 1000 people buy a product, then if we select a sample of 10 people, 1 would buy the product. The ratio will remain the same.
AB Test allows us to construct a hypothesis and understand why certain things perform better than others.
Hypothesis Testing: A statistical hypothesis test is a method of statistical inference. Commonly, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis that proposes no relationship between two data sets.
Hypothesis Testing Steps:
Generate Hypothesis: Null Hypothesis(New features have no effect, it is a baseline assumption that there is no relationship between 2 datasets) and Alternate Hypothesis.
Gather relevant data
Test statistics (Statistical significance,p=value)
P-Value- It is the specific probability of getting results as extreme as we have if the null hypothesis were true.
If p-values is 0.0216, which is less than 0.05, we reject the null hypothesis by saying there is a 2.16% chance of Alternate hypothesis coming from random noise. We are just proving there is a correlation between the conditions but not causation ( a relation between new ads and higher people coming in)
A/B test starts by deciding what hypothesis we want to test. Example: examine the size of the subscriber's button. The metric here is how many people click on the button. To run the test we show different buttons to different people and then conclude which size of a button made more CTR(Click Through Rate). In this case there is is very important to understand that there might be different factors influencing the clicks on the button, screen size? mobile & laptop version? which can influence the result of our analysis. By randomization, we are minimizing these factors here. If the population of these are not similar, we can consider diving the population with screen size (Conditioning on the confounder). Usually, in such cases, we need to identify the effects and incorporate assumptions within the process of causal effect estimation, which can otherwise lead to wrong conclusions.
Note: Assumptions are an important part of modeling in Causal Inference. Hence If we disregard the assumptions before modeling, it can lead to miscellaneous results!!
Multivariate testing: A/B testing can further be used for testing multiple variables, we can end up doing A/B/C/D testing
Companies use this method for testing everything from website designs to offers and product descriptions etc. A/B testing is a very important concept to gather both quantitative and qualitative user insights and use them to understand the potential customers and optimize the conversion funnel based on data.
Discovery A/B tested the components of their video player to engage with their TV show 'super fan.' The result? A 6% increase in video engagement.
ComScore A/B tested logos and testimonials to increase social proof on a product landing page and increased leads generated by 69%.
Secret Escapes tested variations of their mobile signup pages, doubling conversion rates and increasing lifetime value.
Self Selection Bias: When we are running an AB Test, we plan on testing it on a random sample of people that represent the whole population. But is that really true? , The test is running on people present(Online, in case of an e-commerce website testing). We have to be sure about the effect that we are testing, really what we are looking for?
It is very difficult to have a perfect, clean, unbiased A/B test when we are increasing the scale of people or the target audience.
Looking at too many metrics (risk of spurious correlations)
Ending the AB test too soon. It usually takes a lot of time (3-4 weeks of running time).
Long Term AB tests also have some disadvantages, as they are hard to experiment in the complete isolation from other interventions or other new features. It affects the accuracy of the model, by increasing the negative impact on the model.
Multi-split testing
To ensure the sampling is random, we have to run our test every day of the week (Longer duration)
Avoid small sample size by calculating the min sample size required before test
Make sure the test is not stopped too soon
Retesting is important, to remove the possibility of contradictory results.
AB Testing checks the effect of the treatment over a population that selects themselves into AB Test
https://towardsdatascience.com/data-science-you-need-to-know-a-b-testing-f2f12aff619a
what AB Testing is actually measuring
https://hbr.org/2017/06/a-refresher-on-ab-testing
https://www.invespcro.com/blog/ab-testing-14-sampling-issues-that-can-ruin-your-test/