The world of science has extolled the virtue of the scientific method. Yet, there is a lack of clarity on what this method is. I find that many people conflate statistical hypothesis testing with the scientific method leading to unfortunate consequences. This confusion is so widespread that Wikipedia has also fallen into it. Perhaps this problem is due to the oldest mistake in the book of following a process without giving adequate thought to that that process’ underlying values. Therefore, we must achieve clarity on the values of science that we want our scientific methods to embody. Here are the values of science for me, which help provide desiderata for a scientific method:
- Truth: Telling the whole truth is a hallmark of the scientific temper. This manifests in many ways. One of them is an admission of uncertainty. We don’t know many things and are comfortable revealing how much we know and how much we don’t. Another is attempting to tell the story like it is, without attempting to distort our findings for motives other than discovery of knowledge.
- Learning: Scientists value learning. The two desiderata of scientific learning that I can think of are:
- What we learn today should not be invalidated tomorrow in its original context: In other words, when situations are discovered where my theory does not work, those situations call for an extension of the theory without nullifying it. My theory should become a specialized theory, and would continue to hold in the context in which it was proposed. An example of this is Newton’s laws of motion. We know they don’t help explain atomic motion. And yet, they continue to explain the motion of large bodies. One way to violate this desiderata is when a theory not only fails to explain phenomena outside the investigator’s original context, but it also fails in a replication of the investigator’s original context. When this happens, the theory was not scientific to begin with, and the methods followed may be pseudo-scientific.
- We should learn based on what we know: When we are fairly certain of something, we should not learn much from a few experiments. When we are fairly uncertain of something, we should learn much from a few experiments. Imagine if the opposite were true. If we were told that someone is a psychic, and we started off believing that such things are not true. If we held a totally open mind, at the first instance of an experiment where someone demonstrated psychic ability, we would have to admit a good chance that people could be psychic. However, if we did believe that people cannot be psychic, then one experiment would have minimal impact on our belief, and only after several experiments showing conclusive evidence would we want to start changing our belief. Honoring this desiderata protects us from random noise in experiments. The reader is referred to Feynamn’s excellent mind-reader example in This Unscientific Age to see how a scientist ought to investigate such phenomena.
- We should always leave room for doubt: While being certain about repeated results is important or no knowledge creation can happen, it is just as important to leave a little room for doubt, without which, we would not be able to learn anything new. We note that this is quite different from believing we know nothing about the universe we live in, which some scientists claim to believe to further science.
- Practical: The scientific method must be practical, in that, it must give us results when applied to any object of inquiry. If our method does not generate learning on its object, then it is likely not very scientific.
These values are largely inspired by Feynman’s lecture on The Value of Science, that I’ve found to be one of the clearest articulations on the subject. The reader should continue reading only if the desiderata provided are deemed acceptable.
Statistical Hypothesis Testing
I will proceed to explain how statistical hypothesis testing does not hold up to the desiderata of science.
The Problem with Statistical Significance
Starting with truth, I find it problematic that hypothesis testing relies on notions of statistical significance that amplify the results of an experiment due to the word “significant,” when in fact, determining whether something is significant is a pursuit that is outside the realm of science, involving value judgment in a decision context, of which, the method of hypothesis testing makes no mention. It is the duty of scientists to use terms that clarify and not deceive the reader.
All the classical statistician means with statistical significance is that “the chance of seeing such a result like the one we’re seeing being caused entirely by chance is less than 5%.” Now there is something blatantly wrong about using a term in its own definition – we usually do that for things that we do not quite understand. Correcting it to simply say “the chance of seeing such a result is below 5%,” there is still a gaping hole, exposed by the question: whose chance is it? Probability does not exist in the world – it is a construct we’ve created to aid our thinking, and hence, any chance that we talk about is “our” chance – a representation of our belief. We cannot sever probability from the individual who comes up with it. This is also important in order to honor the desiderata of learning – we learn based on what we know. Therefore, clarity on who is learning is central. A simple correction would fix this, such as, “Given what “I” know, the chance of seeing a result like this is less than 5%.”
This is more than a semantic quibble. The worldview of the hypothesis tester does not allow existing knowledge to be incorporated. The implication of being a hypothesis tester in a worldview consistent with our desiderata is that we believe we know nothing about our universe and we are absolutely sure about this assertion. Suppose for a moment that like the White Queen (in “Through the Looking Glass, and What Alice Found There,” who could believe six impossible things before breakfast), we can also believe this.
We now come to a critical issue – why is 5% ok for statistical significance? My professor would often ask in class how many of us were willing to step out of the room if we knew that there was a crazy guy with a gun outside the building and we had a 5% chance of being shot. He would then share that he’d do his best to stay in the building until this problem was resolved, as a 5% chance of dying was a very high risk to take. By claiming statistical significance on a result (especially a medical one) and accepting a 5% margin of error, we are superimposing the acceptability of risk for other people without verifying that people are indeed ok with such a level of risk. Making decisions on what constitutes acceptable risk for others without their knowledge gets us into being unethical – it involves deception and possibly harm. Unfortunately, statistics classes rarely make clear the ethical implications of using this method in medical research.
The next problem in statistical significance calculations is that of sample size. Different schools of science have different standards. In the field of Psychology, it used to be the case that 30 samples were enough to claim statistical significance. In the field of Biology, it turns out to be 6. What is the science behind it? I didn’t understand how this works and am still looking for clarity on this. Why not 35 samples? Why not 25 samples? The statisticians now start citing confidence intervals to justify these numbers, so we’ll go on to confidence intervals next.
The Problem with Confidence Intervals
Using the same desiderata as before, we find that confidence intervals, like statistical significance, are another example of a term used in a way that is not consistent with its popular English meaning. Given how many students of statistics misunderstand confidence intervals, I would argue that it is unethical for professors of statistics to continue using the term. People often confuse the confidence intervals to be about confidence, and start to think that a 95% confidence interval implies that there is a 95% chance of finding the quantity they are after in the given interval. This is not true, as the professor of statistics will be quick to tell us, and the key lies in religiously chanting, “If we were to construct these intervals many times (read thousands), then 95% of the time, the quantity of interest would lie in these intervals,” as opposed to, “the chance that the quantity of interest is in this interval is 95%.” Confused? So am I. But this is just the beginning of the religiosity.
Let’s assume we are clear about the definition of confidence intervals the way the classical statisticians want us to be. Ok – can I use the 95% chance as a probability to make decisions? No. Because it is not a probability! Ok, ok – looks like I wasn’t clear on this the first time. Now that I’ve constructed a confidence interval, and I observe the result of an experiment, how do I update the interval? I can’t, comes the answer. That would violate the principles of science. Huh? It turns out that the classical statisticians believe that updating the interval amounts to tampering. We need to throw away our present data, and do the experiment all over again in order to construct a new confidence interval. This squarely violates our desiderata on learning, and also the desiderata of practicality. There are many experiments that cannot be repeated all the time, and if we are to throw away scarce results and not utilize them for learning, we are not being practical.
Broader Issues with Statistical Hypothesis Testing
The ground on which this method stands requires us to believe that the “true distribution” of a quantity of interest is a normal distribution, and check if we are able to reject this hypothesis. However, we know of many things that we’d hesitate to ascribe a normal distribution. The statistician counters by invoking the central limit theorem, which says that the mean of a “sufficiently large” number of independent random variables will be approximately normally distributed. So, by taking the mean of our quantity of interest from different readings, each considered random and independent, we are not doing anything wrong by treating it as a normal distribution.
The problem with this line of reasoning is twofold. First, we often have reason to believe that the phenomena we see may be related, violating our independence assumption. Second, the underlying notion of a “true distribution” from which this phenomena has been generated (and to which, all error measures are directed) assumes a “God distribution.” Where did such a notion come from? Like a religious creed, the classical statistician treats such a question as taboo. Some scholars, trying to circumvent these questions, get into the exercise of discovering “nature’s distribution” which is even more problematic for implying that nature actually has a distribution, when it really is us who are characterizing our state of information with a distribution.
The impact of using statistical hypothesis testing
One big impact of using this method has been the elevation of random noise to scientific knowledge, as a direct consequence of the method’s refusal to incorporate current knowledge (because we learn the most from a single study when we start with total ignorance, instead of requiring more evidence as we’d like to when we already have some knowledge to the contrary). One can thus see how the field of nutrition flip-flops by suddenly finding new enemies to our health, only to redact the declaration a decade later. Finding new challenges to existing understanding is always welcome. Accepting results from a single study, however, to throw away our current knowledge is reckless. A matter of great concern is that, in the medical field, a curious effect can be noticed. Studies of the past cannot be replicated to give similar results in the present (to see this, do a google scholar search on failure to replicate; also read the Newsweek article Why Almost Everything You Hear About Medicine is Wrong; Jonah Lehrer’s The Truth Wears Off). This violates the desiderata that new knowledge should not invalidate the results in the same context of the older experiments and leads us to wonder about the scientific validity of the method of statistical hypothesis testing used in all these studies.
Ioannidis and Lehrer provide an explanation for this failure, which is summarized in John Cook’s blog post:
Suppose you have 1,000 totally ineffective drugs to test. About 1 out of every 20 trials will produce a p-value of 0.05 or smaller by chance, so about 50 trials out of the 1,000 will have a “significant” result, and only those studies will publish their results. The error rate in the lab was indeed 5%, but the error rate in the literature coming out of the lab is 100 percent!
Given our discussion so far, the discerning reader may note that while it would be a good idea for our frequentist scientists to start reporting all their non p-value results, the 5% error rate conclusion is still based on a fundamental misunderstanding of probability that prevents us from using the results of such experiments in rational decision-making. We would do well to examine the morality of our decision to ignore existing scientific knowledge knowing that our learning will be artificially amplified due to the deficiency of our method (we learn the most from a single experiment when we assume total ignorance), having a possibly devastating effect on those who will be impacted by our research (especially in the medical field).
Explaining the advance of science
The astute reader will at this point raise the objection that if statistical hypothesis testing is so bad, then how come science has advanced so much using this method? There are many discoveries, admittedly, that have been reported using this method. To examine this argument, we must fall back on that sage advice, “correlation is not causation,” and examine how hypotheses get made. As the hypothesis tester will remind us, the method of hypothesis testing considers the creation of hypotheses to be outside the scope of the method. Hypotheses can be pulled out of a hat, but in the real world, they are the result of people spending lots of time examining a phenomena and noticing patterns. Since time immemorial, science has advanced without any notion of statistical hypothesis testing, and arguably so because of the powers of observation that we humans finds in ourselves. I will argue that it is this power of observation, more than anything else, that has led to the advance of science. When good observation has been coupled with the use of the statistical method of hypothesis testing, the latter has become unfortunately conflated with the former. To improve in science, therefore, scientists have had to master their own minds, not with obfuscatory statistical techniques but by slowing down and making a ruthless commitment to seeing things as they are and not as they want them to be.
One might then ask how statistical hypothesis testing got started. It had to do with farming, where there was plentiful data. Having lots of data has one interesting side-effect, as my professor would say – “even the village idiot can tell you what’s going on without these methods.” Just like the conflation of good observation with the merits of the method, plentiful data hid the defects of the method, for when we have lots of data, it does not matter what prior belief we pick – we will eventually get to the right conclusions. The problem arises when we have scarce data, which is where statistical hypothesis testing gets used the most.
Due to all these issues, I am inclined to conclude that statistical hypothesis testing fails on multiple levels in its consistency with the desiderata of a scientific method. In plainspeak, statistical hypothesis testing is problematic as a scientific method, notwithstanding its wide usage in many scientific pursuits. In future posts, I will examine an older (and suddenly newer) method of thinking that transcends the difficulties of statistical hypothesis testing.