After having examined the issues with statistical hypothesis testing, we shall illustrate with an example of a large sample size experiment how we might report results, both with and without statistical hypothesis testing.

**The problem (source: The Cartoon Guide to Statistics, Pg 160)
**Harvard conducted a study that attempted to look at the effectiveness of aspirin in reducing heart attacks.

22,071 subjects (volunteer doctors) were randomly assigned to two groups. One group was given a placebo, while the other was given aspirin. They were observed for five years, and here are the results of that observation:

Attack | No Attack | n | Attack Rate | |

Placebo | 239 | 10,795 | 11,034 (n1) | 0.0217 (p1) |

Aspirin | 139 | 10,898 | 11,037 (n2) | 0.0126 (p2) |

Members of the Placebo group were 1.72 times likelier to suffer a heart attack than those in the aspirin group.

Upto this point, we have no objections with the study and how its results have been reported. The harm begins from this point on.

## Statistical Hypothesis Testing Results

**Null Hypothesis (H0): **Aspirin had no effect: p1-p2 = 0

**Alternate Hypothesis (H1):** Aspirin does reduce the heart attack rate: p1>p2

p1-p2 = 0.0091

Standard Error (SE):

= 0.001746

The test statistic Z is: (p1-p2)/SE(p1-p2)

Zobs (observed) = 0.0091/0.001746 = 5.193

p-value = Pr(Z>=Zobs) = Pr(Z>=5.2) = 0.000000103

In other words, “if the null hypothesis were true, the probability of observing an effect this large is one in ten million – very strong evidence against H0!!” The difference is statistically significant (at the 95% level).

**Critiquing Results
**The statistical significance test makes us believe that the effect we saw is extremely significant, or, very rare to see by chance. The first question we want to ask is: what is the maximum number of heart attacks in the aspirin dataset that will give us statistical significance? The following graph gives us the answer.

So, if the results had looked as follows:

Attack | No Attack | n | Attack Rate | |

Placebo | 239 | 10,795 | 11,034 (n1) | 0.0217 (p1) |

Aspirin | 204 |
10,898 | 11,037 (n2) | 0.0126 (p2) |

the difference would have been statistically significant, and we would have rejected the null hypothesis. I can’t help scratching my head saying, 204 looks awfully close to 239 – what the heck is significant about it?

Next, we note that both sample sizes are roughly the same. If we make it the same and do a sensitivity of p-values to sample size, we find the following:

What this is telling me is that starting from 500 samples, if I see the same spread of data (239 vs 139) irrespective of sample size, it will always be a statistically significant result. Now I’m really scratching my head. If I see the same results in a million samples each, I wouldn’t be able to infer much from their difference, and yet, the methods are telling me to treat the difference as significant.

**Bayesian Insights
**To get more insights, we will jump to the Bayesian worldview, of which the frequentist worldview results from a very specific assumption – I know nothing about the universe and am absolutely sure about my ignorance. I will now pick a Beta distribution for its versatility in modeling my prior beliefs, and I will start with a uniform prior, or a Beta(1,1). This means that I believe that the distribution of the long run fraction of heart attacks for each of the cases (placebo and aspirin) could be just about anything. Then, as I see the results, I will update the beta distribution by adding the number of heart attacks and the total number of trials to each beta parameter. This gives me a posterior distribution of what I ought to believe on the distribution of heart attacks given whether I take the placebo or the aspirin, as shown in the figure below.

The two distributions look awfully close. Thankfully, we shall entertain no notions of statistical significance and report this as is. We have to be careful about our tendency to exaggerate. For instance, if I wanted to make the difference look big, I could change the horizontal scale, like so:

Now, being a Bayesian allows me to examine my own beliefs. I certainly don’t believe that in the given sample, I could see just about any fraction of heart-attacks. Rather, I’d start by expressing a prior that was somewhat like so: Beta(5,25) for placebos, and Beta(2.5,25) for Aspirin cases.

The bigger spread represents more uncertainty. After expressing such a prior, I can start the experiment and update my beliefs. After the 5-year period, I’d find:

Compared to the distribution we got by believing a uniform-prior, we find this to be almost indistinguishable. This demonstrates the insight: “When you have lots of data, it does not really matter what your starting belief is.”

By completely abandoning all talk of statistical significance, and seeing the results on this scale as so close to each other, we end up focusing on the most important questions, such as:

- Are doctors more likely to take good care of their health such that the effect of aspirin is minimal?
- How were the doctors in the sample taking aspirin? How did we control for its usage across doctors?
- What is the X-factor, knowing which, we’d consider aspirin consumption and heart attacks to be irrelevant?