Four Common Statistical Misconceptions You Should Avoid
The Base Rate Fallacy That Finds Too Many Terrorists
Here's how the base rate fallacy works: say you have a company with 25% female employees and 75% male employees. From the outside, this would appear to be a biased selection of male candidates. We assume this because in the United States, the gender distribution is roughly equal . However, this ignores the pool of applicants. If only 10% of applicants were females, then a higher percentage of women who applied were selected versus the percentage of men who applied.
Another common example involves the mythical terrorist-spotting device. Imagine a box that has a 99% success rate at positively identifying a terrorist and a 99% chance at properly identifying a non-terrorist correctly. One would assume that if-out of population of 1 million people, 100 of whom are terrorists-the box identifies a person as a terrorist, there is a 99% chance it's correct. In reality, it's a lot closer to 1%. The reason being that the box falsely rang for 1% of non-terrorists (9,999 people), as well as correctly ringing for 99% of real terrorists (99 people).
Extrapolation That Leads to Polygamy
Extrapolation is a favorite of anyone anticipating economic trends or predicting the future. ""This thing happened over a set period of time, ergo it will continue to happen."" Except that might not be true. When analyzing past trends, we have to keep in mind that the factors that produced those trends are subject to change.
Take, for example, smartphone market share prediction. Back in 2009, Gartner predicted that by 2012, Symbian would be the top smartphone operating system worldwide, with 39% of the market, while Android would have only 14.5%. Also, Windows Mobile would be beating Blackberry, just behind the iPhone. Needless to say, this wasn't even remotely the case .
So, why was Gartner so far off? Because extrapolation doesn't account for changing circumstances. Microsoft killed off Windows Mobile in favor of Windows Phone, a platform that Nokia adopted instead of Symbian. In one big move, the entire prediction was rendered not only incorrect, but completely impossible. Things always change, which is why nearly all predictions based on statistical trends should reasonably be followed with the phrase ""...assuming nothing changes.""
Correlation That Doesn't Always Imply Causation (But Might)
Avoiding the ""Correlation doesn't imply causation"" fallacy is an old favorite. So old, in fact, that it comes with its own Latin adage: cum hoc ergo proptor hoc. However, the counterpoint to this that often gets overlooked is that correlation raises questions about causation. Or, to quote xkcd (again): ""Correlation doesn't imply causation, but it does waglge its eyebrows suggestively and gesture furtively while mouthing 'look over there'.""
Consider one highly controversial example from the Missouri University of Science and Technology that found certain types of internet usage correlated to depression. Users suffering from depression were found to check email more often, watch more videos, or indulge in more file-sharing.
The initial assumption made by many reading was that the study claimed that internet usage led to depression. The mantra that ""Correlation does not imply causation!"" could be invoked to argue the study is incorrect, but it also throws the baby out with the bathwater. When there is no direct explanation for why one thing correlates to another, further study-not outright dismissal-is warranted.
The Simpson's Paradox That Both Raises and Reduces Wages
The Simpson's paradox is one that bends the mind, but it's really just complex math. The short version is that sometimes when you examine data in sub-groups, you can see one trend, but see a completely opposite trend when you view that same data in aggregate. For example, the median wage, adjusted for inflation, rose in the United States since 2000. However, the median wage actually fell for every sub-group of workers .
The consequences of this paradox are that occasionally, if you're looking at data in combined form, you may be led to a contradictory conclusion than if you looked at it in parts. One famous example, based on a real study , found that a kidney stone treatment A was more successful in treating both large and small kidney stones when viewed separately, but treatment B was more successful when both groups were combined.
Unfortunately, this makes decisions based on data subject to Simpson's Paradox to be more complex. On the one hand, if you know the size of a kidney stone, treatment A would be obviously preferable. However, when you start dividing data up to yield different results, you can cut up the data to show anything you want.
The best course of action with Simpson's paradox (and, in fact, with any statistical data), is to use the information to refer back to the story of the data . Statistics are heavily math-based, but they're used to analyze real-world scenarios and situations. Separated from reality, statistics are of limited value. Reliance on numbers as an unbiased representation of reality is comforting, but without tying it to real-life people and situations, the information borders on worthless.