Sunday, April 18, 2010

The dark side of Statistically Significant

According to Wikipedia:

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance.

The use of the word significance in statistics is different from the standard one, which suggests that something is important or meaningful. For example, a study that included tens of thousands of participants might be able to say with very great confidence that people of one state are more intelligent than people of another state by 1/20 of an IQ point. This result would be statistically significant, but the difference is small enough to be utterly unimportant. Many researchers urge that tests of significance should always be accompanied by effect-size statistics, which approximate the size and thus the practical importance of the difference.

Statistically significant is something I come across everyday in my line of work, and to be honest, the most abused and misunderstood term. There are people who assume that if something comes out statistically significant, all their questions are answered and their problems are solved.

An article by Tom Siegfried on Science News throws up interesting facts and assumptions about the Statistical Significance test. I have changed the 2nd part to use a channel effectiveness example instead of clinical trials in the original article.

1. The Hunger Hypothesis

The amount of evidence required to accept that an event is unlikely to have arisen by chance is known as the significance level or critical p-value. In other words, a p-value of .05 means that there is only a 5 % chance of obtaining the observed (or more extreme) result by chance.

So does this mean that you are 95% certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance? It is incorrect, however, to transpose that finding into a 95 percent probability that the null hypothesis is false. “The P value is calculated under the assumption that the null hypothesis is true,” writes biostatistician Steven Goodman. “It therefore cannot simultaneously be a probability that the null hypothesis is false.”

That interpretation commits an egregious logical error (technical term: “transposed conditional”): confusing the odds of getting a result (if a hypothesis is true) with the odds favoring the hypothesis if you observe that result.

Consider this simplified example. Suppose a certain dog is known to bark constantly when hungry. But when well-fed, the dog barks less than 5 percent of the time. So if you assume for the null hypothesis that the dog is not hungry, the probability of observing the dog barking (given that hypothesis) is less than 5 percent. If you then actually do observe the dog barking, what is the likelihood that the null hypothesis is incorrect and the dog is in fact hungry?

That probability cannot be computed with the information given. The dog barks 100% of the time when hungry, and less than 5% of the time when not hungry. A well-fed dog may seldom bark, but observing the rare bark does not imply that the dog is hungry.

2. Statistical significance is not always statistically significant.

The effectiveness of communication channels (Tele marketing, Emails or Direct Mails) are usually tested by comparing the results from Test & Control groups.

Using significance tests, the channel’s effect (response rate, purchase, etc) on the Test group is pronounced to be greater than the Control group by an amount unlikely to occur by chance.

The standard in most of the significance tests is 5%, and a result expected to occur less than 5% of the time is considered “statistically significant.” So if Email drives a higher response in the Test group than the Control (the non-Emailed group) by an amount that would be expected by chance only 4% of the time, it would be concluded that the Email campaign really worked.

Now suppose Direct Mail also delivered similar results – Test group having a higher response than the Control group, but by an amount that would be expected by chance 6% of the time. In that case, conventional analysis would say that such an effect lacked statistical significance and that there was insufficient evidence to conclude that Direct Mail worked.

If both channels were tested against each other, rather than separately using Control groups - one group getting Emails and another similar group receiving Direct Mails, the difference between the performance of Email and Direct Mail might very well NOT be statistically significant.

“Comparisons of the sort, ‘X is statistically significant but Y is not,’ can be misleading,” statisticians Andrew Gelman of Columbia University and Hal Stern of the University of California, Irvine, noted in an article discussing this issue in 2006 in the American Statistician. “Students and practitioners [should] be made more aware that the difference between ‘significant’ and ‘not significant’ is not itself statistically significant.”

The Control group for Email may be doing a lot worse than the Control group for Direct Mail. The difference in the response rates between Test & Control for Email thus becomes higher for Email, and a statistically significant result was observed.

Signing off, with a quote from William W. Watt:
"Do not put your faith in what statistics say until you have carefully considered what they do not say."