“Remember, statistical testing is even in vogue in particle physics, because no experiment is without experimental errors, and error estimation requires statistics. Decision Theory cannot base itself on Black and White Information. It is always a tinge of grey!!! Does this consideration make any difference?”

True (that SHT is in vogue in particle physics). That does not imply that frequentist statistics makes sense. Decision Analysis is largely based on Bayesian probability theory, so it is designed from the ground up to deal with uncertainty (or many shades of grey).

It makes our consideration of probability even more important. I might add that my teacher might be Prof. Ron Howard, but his teacher in Bayesian foundations was the great physicist, E. T. Jaynes, who wrote (but did not finish in his lifetime) the seminal book, Probability Theory: The Logic of Science. His notes have spread far and wide, and the book was finally completed by a student.

The problem is that most frequentist physicists nowadays actually do not understand probability theory. If you ask them what it means, they will likely say something like the frequency to be discovered in nature. The notion that two people observing the same event might have different probabilities does not exist in their world, and if admitted, brings down the entire frequentist science.

In the Bayesian worldview, the key question is, “What did you believe before you did the experiment?” You can only learn based on what you knew, and what you believed you could possibly learn. Error cannot be properly defined without the prior probability and the likelihood. To expand a little bit, I can believe that there is a 40% chance of rain tomorrow. Then, I can consider a rain detector, and characterize its sensitivity (chance of true positive) and specificity (chance of true negative). By doing so, I have a complete model for the detector’s accuracy — given the experiment’s result, I know how to change the probability of “Rain.” Can you explain how to update the probability of Rain based on experimental results using the frequentist method (p-tests, etc.)?

The frequentist physicist’s probability is not a probability — it is something weird that cannot be used to make decisions, handle uncertainty logically or support the updating of our beliefs. There is no true probability distribution in the universe. Probability is an expression of our belief/knowledge, existing purely in our heads, and owned by the individual. Practically, this approach is behind almost everything useful in our universe — like the spam filters processing this email. Engineers have long embraced the Bayesian model as it produces great results. Children learn in this way (see this TED talk) — constantly trying different hypotheses and updating their prior.

If by statistics, people mean mean quantitative measurement, I don’t have any issues with the use of quantitative analysis (like computing mean, variance, etc.). If by statistics, people include the Bayesian approach of being logical about uncertainty, I am thrilled with statistics. But if, by statistics, people include the frequentist approach of claiming statistical significance, using confidence intervals, etc., then I have several philosophical and practical concerns, as outlined in my paper.

]]>

**Abstract
**Many have documented the difficulty of using the current paradigm of Randomized Controlled Trials (RCTs) to test and validate the effectiveness of alternative medical systems such as Ayurveda. This paper critiques the applicability of RCTs for all clinical knowledge-seeking endeavors, of which Ayurveda research is a part. This is done by examining statistical hypothesis testing, the underlying foundation of RCTs, from a practical and philosophical perspective. In the philosophical critique, the two main worldviews of probability are that of the Bayesian and the frequentist. The frequentist worldview is a special case of the Bayesian worldview requiring the unrealistic assumptions of knowing nothing about the universe and believing that all observations are unrelated to each other. Many have claimed that the first belief is necessary for science, and this claim is debunked by comparing variations in learning with different prior beliefs. Moving beyond the Bayesian and frequentist worldviews, the notion of hypothesis testing itself is challenged on the grounds that a hypothesis is an unclear distinction, and assigning a probability on an unclear distinction is an exercise that does not lead to clarity of action. This critique is of the theory itself and not any particular application of statistical hypothesis testing. A decision-making frame is proposed as a way of both addressing this critique and transcending ideological debates on probability. An example of a Bayesian decision-making approach is shown as an alternative to statistical hypothesis testing, utilizing data from a past clinical trial that studied the effect of Aspirin on heart attacks in a sample population of doctors. As a big reason for the prevalence of RCTs in academia is legislation requiring it, the ethics of legislating the use of statistical methods for clinical research is also examined.

Click here to see the full paper

Aspirin Study Model referred to in the paper (uses the Beta Distribution)

]]>

In recent years, exposes (article, study) have shown Pharmaceutical companies deeply embedded in the culture of medical research, biasing results heavily. Pharmaceutical companies love it when studies support the drug they want to sell and are not happy with studies that indicate otherwise. The New York times carried an article that shared findings that breast cancer surgical interventions may have been too hasty and many women have had painful treatments that they didn’t need. Surgeons love it when someone publishes a new procedure that they may perform on a patient and get extremely uncomfortable when any research tries to imply that no procedure may help, and the surgeon is better off doing nothing. This conflict of interest in medicine is already common knowledge and ethics committees are doing their best to address it. However, the problem that Anjali, Imran and Pavan are up against runs much deeper. It can be traced to an unknown time and place, when a hero named Karthik, enamored by his own strength and prowess, challenged his quiet and humble brother, Ganesh, to a game he thought would help establish his superiority. Karthik bet that he could circumambulate the world quicker than Ganesh could, and the judge would be their parents. No sooner had Ganesh accepted the challenge that Karthik jumped onto his vehicle and zoomed off, setting a new record for going around the world three times. Confident that this new record could not be beaten, he waited for Ganesh to do his thing. Ganesh, being pot-bellied, got up slowly, and quietly went around his parents three times, and declared, “I’m done.” The stunned Karthik, asked him, “What do you mean?” To which, Ganesh responded, “My parents are my world, and I have just circumambulated my world.”

India being India, Karthik and Ganesh are now treated as gods, with lots of followers. While Ganesh’s simple demonstration set the record straight for Karthik, unfortunately, the same is not true for their followers in the scientific realm. Karthik is the god of a group of scientists who are labeled “frequentists,” who ban the notion of subjectivity or beliefs, and limit their existential understanding to data in the external world. Ganesh is the god of a much smaller group of scientists who are labeled “Bayesians,” who readily incorporate beliefs into their existential understanding and acknowledge that learning does not happen in a vacuum, but in the context of what we already believe. The Bayesian worldview allows for the the fact that two people, coming from entirely different belief systems, may look at the same event and draw entirely different inferences. The Bayesians and the frequentists are not at war. No, that war was won a long time back, and the conclusion was not just that frequentism is a failed philosophy, but that frequentism isn’t even a philosophy but a problematic and narrow method that misguides more than it guides. And yet, strangely, the followers of Karthik continue practicing it, in major fields that touch us very intimately.

To get a little deeper, we shall examine the main method in the frequentist’s arsenal called statistical hypothesis testing, which assumes the following: we know nothing about our universe, and all events that we see are unrelated to each other. Under these strong assumptions, when we aggregate enough trials of what we shall call “random” events, the average of a quantity of interest will be distributed in a bell-shaped curve (called a “normal” distribution). This curve allows us to calculate the chance of seeing a result below a certain level. We then conduct our experiment and gather our data, and based on the sample average and given our expectation of a bell-shaped curve, we check if the chance of getting the average is below 5%. If so, we then claim that the result is statistically significant at the 95% level, which is the standard for most journal papers. By now, many readers might be fooled into thinking all of this sounds rational and sophisticated. An example by applied mathematician John D. Cook illustrates the fundamental problem. By the 95% logic, out of 1000 pre-publication studies of a treatment that is ineffective, we would expect 50 to show an erroneous result of effectiveness being significant, and 950 to not show any significance. The culture of clinical studies is such that journals will not publish results that are not significant. Thus, only the 50 erroneous results will see the light of day and random noise will be elevated to scientific truth. As the standard increases, matters get worse. At the 99% level, only 10 effective results will trump 990 ineffective results and be even more trusted as scientific fact.

The reader may wonder why this problem cannot be solved by also accepting papers that do not claim statistical significance. While that will help, the issue is more fundamental – we live in a world where we know something about everything. To repudiate our existing knowledge and feign ignorance is pseudoscientific and amounts to pretending to be in a position where our mind is the most open, while ignoring that we learn the most when we know nothing. Consider someone who claims to be able to read your mind. If you take the traditional scientific approach and believe you know nothing at all on this topic, then a few positive trials should start to swing your judgment quickly toward believing the mind-reader. However, if you start with a skeptical position, then a few positive trials will not be enough – you will need many consecutive positive results before you start changing your position. In other words, we learn depending on where we stand. A position of total ignorance is also a position, one that results in great importance being given to the results of fewer studies, and should thus only be chosen if we truly have no prior knowledge. The underlying value here is truth – we must start with the whole truth about our prior position if we are interested in arriving at practical results of our scientific inquiry that we can trust.

In January 2011, Newsweek carried an article titled “Why Almost Everything You Hear About Medicine Is Wrong,” ringing the first death knell in the popular media of clinical medical science as we’ve known it for the last four decades. The article focused on the work of Prof. John Ioannidis (currently at Stanford University), who published a paper titled, “Why Most Published Research Findings are False,” where he shows analytically using a Bayesian approach how the positive predictive value of a study (the chance that a theory is true given that a study says its true) decreases based on a variety of practical factors that are usually ignored by the scientific community, leading to the amplification of random noise as scientific theory. He also emphasizes what my professor would say in a folksy way, “when you have lots of data, even the village idiot can tell you what’s going on.” There is no value added by statistical hypothesis testing when the sample size is large. The whole promise of statistics was to aid our intuition when the sample size is small, and this is where statistical hypothesis testing fails spectacularly, by forcing unrealistic assumptions that result in misleading conclusions.

In dramatic style, Prof. Ioannidis published a paper in the Journal of the American Medical Association (JAMA) titled “Contradicted and Initially Stronger Effects in Highly Cited Clinical Research,” which showed that 32% of the most highly-cited studies in the field had exaggerated results. In other words, the high priests of the religion were wrong, once every three times. What hope then for the rest of us? He makes many practical recommendations, some of which include taking a holistic approach, looking at the totality of evidence in the entire community, registering all studies being conducted and committing to share the results no matter what they are, avoiding pharmaceutical company-driven research on good side-effects of existing drugs (these studies have been particularly prone to amplifying random noise as positive results due to vested interests) and giving up the obsession of statistical significance. He further encourages the Bayesian approach with the following recommendation: “Before running an experiment, investigators should consider what they believe the chances are that they are testing a true rather than a non-true relationship.” This is in essence a strong affirmation of the Bayesian approach — telling the whole truth about where we stand right now, which will help determine how we learn. Although Prof. Ioannidis is not very optimistic about these major changes coming in easily, it is incumbent upon the scientists of our time to reflect on the implosion of medical science and ask some hard questions.

One important question that Ioannidis, and the scientific community at large, does not explicitly tackle is: how do we come up with a hypothesis? To push us to the limits of our knowledge, I find the “astrologer question” useful. If an Indian medical astrologer tells me that I have an inflamed liver, and on performing an ultrasound, I find that to be the case, what hypothesis should I form? I have the experience of just myself, in one situation in which the astrologer’s prediction proved uncannily and unexpectedly correct. Suppose I form a hypothesis test by checking the accuracy of all medical predictions by this astrologer, and test how many turn out to be correct. Suppose further that a majority are not accurate. Should I reject astrology altogether? Or should I then try to discover what might explain my experience? Some methods like ethnography from the social sciences allow induction of hypotheses from the best available data. The notion of hypothesis induction is taboo in the world of classical statistics which treats the use of data to form hypotheses as tampering. As we attempt a transition from being followers of Karthik to becoming followers of Ganesh, the question of hypothesis formation merits some pause. Being a Bayesian may be necessary for practical scientific thinking, but it may not be sufficient.

As legendary scholar Abraham Maslow (of Maslow’s Hierarchy of Needs fame) points out in his book, The Psychology of Science, the current methods of the “scientific orthodoxy” are the methods of physics and astronomy that were born of the Industrial Revolution. During that time, a mechanistic worldview was imposed upon every pursuit, from education to business to the sciences. While this may have yielded some results in the inanimate world, reducing humans to chemicals, mechanistic particles or single numbers is problematic and unlikely to yield practical insights of a holistic being that is far more than the sum of its parts. Methods in medicine that maintain a holistic approach in regard to understanding humans and their needs have been around in the East, but are yet to be accepted by Western scientific orthodoxy.

All of this can be very disturbing for Anjali, Imran and Pavan to accept, notwithstanding the acceptance of most of the criticism by the scientific community that is largely invisible to the public. They may rightfully question our challenge of Karthik’s religion, and, ignoring the evidence already cited, ask for more real-world evidence that Karthik’s religion is a false one, and that Ganesh’s religion might be a better one. To answer this, we need to take a journey into other fields that have used similar methods. A primary example is statistical finance, whose mess we are still cleaning up wherever the methods were used, after a stupendous collapse that was starkly predicted by the gutsy options trader Nassim Taleb, author of Fooled by Randomness and The Black Swan. In an article in the New Yorker, Jonah Lehrer covers among other things, the crisis in the field of Psychology, which has been a heavy user of statistical hypothesis testing. Theories in that field have shown a “decline effect,” where once the theory is established, it becomes harder and harder to replicate, ultimately getting next to impossible. Not quite an example just yet, but one to watch out for would be climate science, where we can expect to see an implosion in the next few years as they use a lot of statistical hypothesis testing.

On testing out Ganesh’s religion in medical science, although much work remains to be done, Prof. Stephen Schneider, a rare Bayesian climatologist from Stanford University, found himself facing off with the medical establishment when he developed a rare form of cancer with a stark prognosis. Using his understanding of the Bayesian philosophy, he fought the system and prevented himself from becoming another statistical casualty. In the process, he helped develop a new protocol for cancer treatment, survived eight years longer than this doctors thought he would, and left behind a book called “The Patient from Hell,” that points out the fallacies of evidence-based medicine (a fancier name for classical statistical science), advocating instead for a Bayesian outlook. The main argument is that an individual is not a statistical average – therefore, when trying treatments, we should not be dogmatic about requiring statistical data (especially when there are none for rare illnesses) – we must be willing to combine our beliefs with contextual observations to make sensible inferences. The book is remarkable in that it is written not just for doctors and patients facing dreadful diseases like cancer, but for what Prof. Schneider called “patient advocates,” perhaps labeling a new profession that applies the Bayesian philosophy to find practical treatment solutions for each individual patient, customized to their context.

Even after all this evidence, Anjali desperately clings on to IVF for hope. A pioneer of the technique, Dr. Sami David, said in a CBS interview that IVF has “gone amok,” with countless women who have treatable causes of infertility opting instead for IVF, because it is pushed aggressively by the clinic. People on both sides of the table are either not interested or don’t have the time to get to the root cause of the problem. Dr. Geeta Nargund, head of reproductive medicine at St. George’s hospital, said in a Daily Mail interview, “Women are going around from clinic to clinic and receiving different doses of these drugs but there is no sound scientific evidence to show that it will help improve their chances of conceiving.” Imran swallows hard when a little online research reveals a number of studies that have failed to show any link between vitamins and longer life or better health. While absence of evidence is not evidence of absence, one wonders what the science was that led the vitamin manufacturers to start their business (was it statistical hypothesis testing?). Pavan meanwhile still refuses to question his nutritionist’s source of information on ghee, although Ayurvedic practitioners consider it therapeutic when used in moderation, and studies that have vilified ghee have involved statistical hypothesis testing.

Still not convinced, the trio ask, if clinical medical science has been using erroneous methods, how have we made important advances in our understanding using these methods? A better question would be whether there exists a third factor that explains both scientific progress and the results of statistical hypothesis testing. Perhaps that factor is simply the quality of observation of the phenomenon under study by the scientist; long before statistical methods, scientists have been keen observers of nature, discovering astounding truths by committed and powerful observation.

**Note:** *While Anjali, Imran and Pavan are fictional characters, their attitudes are not and represent conversations I’ve had with different people over the years.*

**Acknowledgments: ***Thanks to Dr. Thomas Seyller, Michael Silberman and Francisco Ramos-Stierle for helpful feedback during the development of this piece.*

]]>

To quote Cook,

Here’s an example that shows how p-values can be misleading. Suppose you have 1,000 totally ineffective drugs to test. About 1 out of every 20 trials will produce a p-value of 0.05 or smaller by chance, so about 50 trials out of the 1,000 will have a “significant” result, and

only those studies will publish their results. The error rate in the lab was indeed 5%, but the error rate in the literature coming out of the lab is 100 percent!

I may point out to the discerning reader that even if our frequentist scientists started reporting all their non p-value results, the 5% error rate conclusion is still based on a fundamental misunderstanding of probability that prevents us from using the results of such experiments in rational decision-making.

]]>

**The problem (source: The Cartoon Guide to Statistics, Pg 160)
**Harvard conducted a study that attempted to look at the effectiveness of aspirin in reducing heart attacks.

22,071 subjects (volunteer doctors) were randomly assigned to two groups. One group was given a placebo, while the other was given aspirin. They were observed for five years, and here are the results of that observation:

Attack | No Attack | n | Attack Rate | |

Placebo | 239 | 10,795 | 11,034 (n1) | 0.0217 (p1) |

Aspirin | 139 | 10,898 | 11,037 (n2) | 0.0126 (p2) |

Members of the Placebo group were 1.72 times likelier to suffer a heart attack than those in the aspirin group.

Upto this point, we have no objections with the study and how its results have been reported. The harm begins from this point on.

**Null Hypothesis (H0): **Aspirin had no effect: p1-p2 = 0

**Alternate Hypothesis (H1):** Aspirin does reduce the heart attack rate: p1>p2

p1-p2 = 0.0091

Standard Error (SE):

= 0.001746

The test statistic Z is: (p1-p2)/SE(p1-p2)

Zobs (observed) = 0.0091/0.001746 = 5.193

p-value = Pr(Z>=Zobs) = Pr(Z>=5.2) = 0.000000103

In other words, “if the null hypothesis were true, the probability of observing an effect this large is one in ten million – very strong evidence against H0!!” The difference is statistically significant (at the 95% level).

**Critiquing Results
**The statistical significance test makes us believe that the effect we saw is extremely significant, or, very rare to see by chance. The first question we want to ask is: what is the maximum number of heart attacks in the aspirin dataset that will give us statistical significance? The following graph gives us the answer.

So, if the results had looked as follows:

Attack | No Attack | n | Attack Rate | |

Placebo | 239 | 10,795 | 11,034 (n1) | 0.0217 (p1) |

Aspirin | 204 |
10,898 | 11,037 (n2) | 0.0126 (p2) |

the difference would have been statistically significant, and we would have rejected the null hypothesis. I can’t help scratching my head saying, 204 looks awfully close to 239 – what the heck is significant about it?

Next, we note that both sample sizes are roughly the same. If we make it the same and do a sensitivity of p-values to sample size, we find the following:

What this is telling me is that starting from 500 samples, if I see the same spread of data (239 vs 139) irrespective of sample size, it will always be a statistically significant result. Now I’m really scratching my head. If I see the same results in a million samples each, I wouldn’t be able to infer much from their difference, and yet, the methods are telling me to treat the difference as significant.

**Bayesian Insights
**To get more insights, we will jump to the Bayesian worldview, of which the frequentist worldview results from a very specific assumption – I know nothing about the universe and am absolutely sure about my ignorance. I will now pick a Beta distribution for its versatility in modeling my prior beliefs, and I will start with a uniform prior, or a Beta(1,1). This means that I believe that the distribution of the long run fraction of heart attacks for each of the cases (placebo and aspirin) could be just about anything. Then, as I see the results, I will update the beta distribution by adding the number of heart attacks and the total number of trials to each beta parameter. This gives me a posterior distribution of what I ought to believe on the distribution of heart attacks given whether I take the placebo or the aspirin, as shown in the figure below.

The two distributions look awfully close. Thankfully, we shall entertain no notions of statistical significance and report this as is. We have to be careful about our tendency to exaggerate. For instance, if I wanted to make the difference look big, I could change the horizontal scale, like so:

Now, being a Bayesian allows me to examine my own beliefs. I certainly don’t believe that in the given sample, I could see just about any fraction of heart-attacks. Rather, I’d start by expressing a prior that was somewhat like so: Beta(5,25) for placebos, and Beta(2.5,25) for Aspirin cases.

The bigger spread represents more uncertainty. After expressing such a prior, I can start the experiment and update my beliefs. After the 5-year period, I’d find:

Compared to the distribution we got by believing a uniform-prior, we find this to be almost indistinguishable. This demonstrates the insight: “When you have lots of data, it does not really matter what your starting belief is.”

By completely abandoning all talk of statistical significance, and seeing the results on this scale as so close to each other, we end up focusing on the most important questions, such as:

- Are doctors more likely to take good care of their health such that the effect of aspirin is minimal?
- How were the doctors in the sample taking aspirin? How did we control for its usage across doctors?
- What is the X-factor, knowing which, we’d consider aspirin consumption and heart attacks to be irrelevant?

]]>

**Truth:**Telling the whole truth is a hallmark of the scientific temper. This manifests in many ways. One of them is an admission of uncertainty. We don’t know many things and are comfortable revealing how much we know and how much we don’t. Another is attempting to tell the story like it is, without attempting to distort our findings for motives other than discovery of knowledge.**Learning:**Scientists value learning. The two desiderata of scientific learning that I can think of are:

**What we learn today should not be invalidated tomorrow in its original context**: In other words, when situations are discovered where my theory does not work, those situations call for an extension of the theory without nullifying it. My theory should become a specialized theory, and would continue to hold in the context in which it was proposed. An example of this is Newton’s laws of motion. We know they don’t help explain atomic motion. And yet, they continue to explain the motion of large bodies. One way to violate this desiderata is when a theory not only fails to explain phenomena outside the investigator’s original context, but it also fails in a replication of the investigator’s original context. When this happens, the theory was not scientific to begin with, and the methods followed may be pseudo-scientific.**We should learn based on what we know:**When we are fairly certain of something, we should not learn much from a few experiments. When we are fairly uncertain of something, we should learn much from a few experiments. Imagine if the opposite were true. If we were told that someone is a psychic, and we started off believing that such things are not true. If we held a totally open mind, at the first instance of an experiment where someone demonstrated psychic ability, we would have to admit a good chance that people could be psychic. However, if we did believe that people cannot be psychic, then one experiment would have minimal impact on our belief, and only after several experiments showing conclusive evidence would we want to start changing our belief. Honoring this desiderata protects us from random noise in experiments. The reader is referred to Feynamn’s excellent mind-reader example in This Unscientific Age to see how a scientist ought to investigate such phenomena.**We should always leave room for doubt:**While being certain about repeated results is important or no knowledge creation can happen, it is just as important to leave a little room for doubt, without which, we would not be able to learn anything new. We note that this is quite different from believing we know nothing about the universe we live in, which some scientists claim to believe to further science.

**Practical:**The scientific method must be practical, in that, it must give us results when applied to any object of inquiry. If our method does not generate learning on its object, then it is likely not very scientific.

These values are largely inspired by Feynman’s lecture on The Value of Science, that I’ve found to be one of the clearest articulations on the subject. The reader should continue reading only if the desiderata provided are deemed acceptable.

I will proceed to explain how statistical hypothesis testing does not hold up to the desiderata of science.

**The Problem with Statistical Significance**

Starting with truth, I find it problematic that hypothesis testing relies on notions of statistical significance that amplify the results of an experiment due to the word “significant,” when in fact, determining whether something is significant is a pursuit that is outside the realm of science, involving value judgment in a decision context, of which, the method of hypothesis testing makes no mention. It is the duty of scientists to use terms that clarify and not deceive the reader.

All the classical statistician means with statistical significance is that “the chance of seeing such a result like the one we’re seeing being caused entirely by chance is less than 5%.” Now there is something blatantly wrong about using a term in its own definition – we usually do that for things that we do not quite understand. Correcting it to simply say “the chance of seeing such a result is below 5%,” there is still a gaping hole, exposed by the question: whose chance is it? Probability does not exist in the world – it is a construct we’ve created to aid our thinking, and hence, any chance that we talk about is “our” chance – a representation of our belief. We cannot sever probability from the individual who comes up with it. This is also important in order to honor the desiderata of learning – we learn based on what we know. Therefore, clarity on who is learning is central. A simple correction would fix this, such as, “Given what “I” know, the chance of seeing a result like this is less than 5%.”

This is more than a semantic quibble. The worldview of the hypothesis tester does not allow existing knowledge to be incorporated. The implication of being a hypothesis tester in a worldview consistent with our desiderata is that we believe we know nothing about our universe and we are absolutely sure about this assertion. Suppose for a moment that like the White Queen (in “Through the Looking Glass, and What Alice Found There,” who could believe six impossible things before breakfast), we can also believe this.

We now come to a critical issue – why is 5% ok for statistical significance? My professor would often ask in class how many of us were willing to step out of the room if we knew that there was a crazy guy with a gun outside the building and we had a 5% chance of being shot. He would then share that he’d do his best to stay in the building until this problem was resolved, as a 5% chance of dying was a very high risk to take. By claiming statistical significance on a result (especially a medical one) and accepting a 5% margin of error, we are superimposing the acceptability of risk for other people without verifying that people are indeed ok with such a level of risk. Making decisions on what constitutes acceptable risk for others without their knowledge gets us into being unethical – it involves deception and possibly harm. Unfortunately, statistics classes rarely make clear the ethical implications of using this method in medical research.

The next problem in statistical significance calculations is that of sample size. Different schools of science have different standards. In the field of Psychology, it used to be the case that 30 samples were enough to claim statistical significance. In the field of Biology, it turns out to be 6. What is the science behind it? I didn’t understand how this works and am still looking for clarity on this. Why not 35 samples? Why not 25 samples? The statisticians now start citing confidence intervals to justify these numbers, so we’ll go on to confidence intervals next.

**The Problem with Confidence Intervals
**Using the same desiderata as before, we find that confidence intervals, like statistical significance, are another example of a term used in a way that is not consistent with its popular English meaning. Given how many students of statistics misunderstand confidence intervals, I would argue that it is unethical for professors of statistics to continue using the term. People often confuse the confidence intervals to be about confidence, and start to think that a 95% confidence interval implies that there is a 95% chance of finding the quantity they are after in the given interval. This is not true, as the professor of statistics will be quick to tell us, and the key lies in religiously chanting, “If we were to construct these intervals many times (read thousands), then 95% of the time, the quantity of interest would lie in these intervals,” as opposed to, “the chance that the quantity of interest is in this interval is 95%.” Confused? So am I. But this is just the beginning of the religiosity.

Let’s assume we are clear about the definition of confidence intervals the way the classical statisticians want us to be. Ok – can I use the 95% chance as a probability to make decisions? No. Because it is not a probability! Ok, ok – looks like I wasn’t clear on this the first time. Now that I’ve constructed a confidence interval, and I observe the result of an experiment, how do I update the interval? I can’t, comes the answer. That would violate the principles of science. Huh? It turns out that the classical statisticians believe that updating the interval amounts to tampering. We need to throw away our present data, and do the experiment all over again in order to construct a new confidence interval. This squarely violates our desiderata on learning, and also the desiderata of practicality. There are many experiments that cannot be repeated all the time, and if we are to throw away scarce results and not utilize them for learning, we are not being practical.

**Broader Issues with Statistical Hypothesis Testing
**The ground on which this method stands requires us to believe that the “true distribution” of a quantity of interest is a normal distribution, and check if we are able to reject this hypothesis. However, we know of many things that we’d hesitate to ascribe a normal distribution. The statistician counters by invoking the central limit theorem, which says that the mean of a “sufficiently large” number of independent random variables will be approximately normally distributed. So, by taking the mean of our quantity of interest from different readings, each considered random and independent, we are not doing anything wrong by treating it as a normal distribution.

The problem with this line of reasoning is twofold. First, we often have reason to believe that the phenomena we see may be related, violating our independence assumption. Second, the underlying notion of a “true distribution” from which this phenomena has been generated (and to which, all error measures are directed) assumes a “God distribution.” Where did such a notion come from? Like a religious creed, the classical statistician treats such a question as taboo. Some scholars, trying to circumvent these questions, get into the exercise of discovering “nature’s distribution” which is even more problematic for implying that nature actually has a distribution, when it really is us who are characterizing our state of information with a distribution.

**The impact of using statistical hypothesis testing
**One big impact of using this method has been the elevation of random noise to scientific knowledge, as a direct consequence of the method’s refusal to incorporate current knowledge (because we learn the most from a single study when we start with total ignorance, instead of requiring more evidence as we’d like to when we already have some knowledge to the contrary). One can thus see how the field of nutrition flip-flops by suddenly finding new enemies to our health, only to redact the declaration a decade later. Finding new challenges to existing understanding is always welcome. Accepting results from a single study, however, to throw away our current knowledge is reckless. A matter of great concern is that, in the medical field, a curious effect can be noticed. Studies of the past cannot be replicated to give similar results in the present (to see this, do a google scholar search on failure to replicate; also read the Newsweek article Why Almost Everything You Hear About Medicine is Wrong; Jonah Lehrer’s The Truth Wears Off). This violates the desiderata that new knowledge should not invalidate the results in the same context of the older experiments and leads us to wonder about the scientific validity of the method of statistical hypothesis testing used in all these studies.

Ioannidis and Lehrer provide an explanation for this failure, which is summarized in John Cook’s blog post:

Suppose you have 1,000 totally ineffective drugs to test. About 1 out of every 20 trials will produce a p-value of 0.05 or smaller by chance, so about 50 trials out of the 1,000 will have a “significant” result, and

only those studies will publish their results. The error rate in the lab was indeed 5%, but the error rate in the literature coming out of the lab is 100 percent!

Given our discussion so far, the discerning reader may note that while it would be a good idea for our frequentist scientists to start reporting all their non p-value results, the 5% error rate conclusion is still based on a fundamental misunderstanding of probability that prevents us from using the results of such experiments in rational decision-making. We would do well to examine the morality of our decision to ignore existing scientific knowledge knowing that our learning will be artificially amplified due to the deficiency of our method (we learn the most from a single experiment when we assume total ignorance), having a possibly devastating effect on those who will be impacted by our research (especially in the medical field).

**Explaining the advance of science
**The astute reader will at this point raise the objection that if statistical hypothesis testing is so bad, then how come science has advanced so much using this method? There are many discoveries, admittedly, that have been reported using this method. To examine this argument, we must fall back on that sage advice, “correlation is not causation,” and examine how hypotheses get made. As the hypothesis tester will remind us, the method of hypothesis testing considers the creation of hypotheses to be outside the scope of the method. Hypotheses can be pulled out of a hat, but in the real world, they are the result of people spending lots of time examining a phenomena and noticing patterns. Since time immemorial, science has advanced without any notion of statistical hypothesis testing, and arguably so because of the powers of observation that we humans finds in ourselves. I will argue that it is this power of observation, more than anything else, that has led to the advance of science. When good observation has been coupled with the use of the statistical method of hypothesis testing, the latter has become unfortunately conflated with the former. To improve in science, therefore, scientists have had to master their own minds, not with obfuscatory statistical techniques but by slowing down and making a ruthless commitment to seeing things as they are and not as they want them to be.

One might then ask how statistical hypothesis testing got started. It had to do with farming, where there was plentiful data. Having lots of data has one interesting side-effect, as my professor would say – “even the village idiot can tell you what’s going on without these methods.” Just like the conflation of good observation with the merits of the method, plentiful data hid the defects of the method, for when we have lots of data, it does not matter what prior belief we pick – we will eventually get to the right conclusions. The problem arises when we have scarce data, which is where statistical hypothesis testing gets used the most.

**Conclusion
**Due to all these issues, I am inclined to conclude that statistical hypothesis testing fails on multiple levels in its consistency with the desiderata of a scientific method. In plainspeak, statistical hypothesis testing is problematic as a scientific method, notwithstanding its wide usage in many scientific pursuits. In future posts, I will examine an older (and suddenly newer) method of thinking that transcends the difficulties of statistical hypothesis testing.

]]>