Simpsons Paradox
Last updated
Last updated
Simpsons Paradox is a very common phenomenon in Probability and Statistics. It occurs when we see trends in our data (which is actually separated into groups) but because it is aggregated together it shows trends that are not causal. When data is separate is shows different trends, whereas when aggregated it shows different trends.
It is important to understand and resolve Simpsons Paradox because the data we see is not all the data that there is. This shows us how easy it is to fall into a paradox when relying only on intuition about data and unaided statistical methods. It tells us the limits of statistical methods and why causality is necessary to avoid these paradoxical conclusions.
Examples:
Biased Admissions: A study shows that in UC Berkley the admissions are biased towards men, and the difference was huge so it was unlikely to be due to chance. The study later showed that women tend to apply in more competitive departments which have less acceptance rate, whereas men tend to apply to less competitive departments that have a high rate of acceptance.
Simpson Paradox on Reddit: Avg Comment weight decreased over time? what should be done?.... If we split the data we see that the avg comment length increases with time the user joined the platform. So it's just that new users have been added on the platform
COVID-19 Dataset: According to research results, whites make up a lower percentage of deaths than cases. But when we aggregate all of the ages, whites have a higher fatality rate. The reason is simple: whites are older. According to U.S. census data (not shown here), 9 percent of the white population in the United States is over age 75. ReadMore
The COVID 19 dataset shows that such problems are really serious and raises a fundamental question that does the data reflect reality?
Solution: A solution here would be Stratification. Conditioning on the variables can solve this problem here.
This paradox is resolved when there is a proper Causal Relations addressed between statistical modeling. We have to think Causally. Causal Model for the data needs to be built. Consider the data generating process(DAG) and applying the causal model, we can resolve Simpsons Paradox.
Conclusion:
Aggregated data and disaggregated data might show different results that can lead to confusion in understanding, analyzing, and Inferring from the data.
Interrogating the data and using correctly phrased causal queries is really important.
Unless the analyses on data are done effectively, it's hard to understand and resolve such problems. Hence, Causality is important to make sense out of this data
Resources: