I had authored an article in 2011 titled “A Critique of Statistical Hypothesis Testing (SHT) in Clinical Research.” This paper highlights philosophical and practical problems that make SHT less than scientific and more than problematic. I shared this recently with a scientist friend, who wrote back:

“Remember, statistical testing is even in vogue in particle physics, because no experiment is without experimental errors, and error estimation requires statistics. Decision Theory cannot base itself on Black and White Information. It is always a tinge of grey!!! Does this consideration make any difference?”

True (that SHT is in vogue in particle physics). That does not imply that frequentist statistics makes sense. Decision Analysis is largely based on Bayesian probability theory, so it is designed from the ground up to deal with uncertainty (or many shades of grey).

It makes our consideration of probability even more important. I might add that my teacher might be Prof. Ron Howard, but his teacher in Bayesian foundations was the great physicist, E. T. Jaynes, who wrote (but did not finish in his lifetime) the seminal book, Probability Theory: The Logic of Science. His notes have spread far and wide, and the book was finally completed by a student.

The problem is that most frequentist physicists nowadays actually do not understand probability theory. If you ask them what it means, they will likely say something like the frequency to be discovered in nature. The notion that two people observing the same event might have different probabilities does not exist in their world, and if admitted, brings down the entire frequentist science.

In the Bayesian worldview, the key question is, “What did you believe before you did the experiment?” You can only learn based on what you knew, and what you believed you could possibly learn. Error cannot be properly defined without the prior probability and the likelihood. To expand a little bit, I can believe that there is a 40% chance of rain tomorrow. Then, I can consider a rain detector, and characterize its sensitivity (chance of true positive) and specificity (chance of true negative). By doing so, I have a complete model for the detector’s accuracy — given the experiment’s result, I know how to change the probability of “Rain.” Can you explain how to update the probability of Rain based on experimental results using the frequentist method (p-tests, etc.)?

The frequentist physicist’s probability is not a probability — it is something weird that cannot be used to make decisions, handle uncertainty logically or support the updating of our beliefs. There is no true probability distribution in the universe. Probability is an expression of our belief/knowledge, existing purely in our heads, and owned by the individual. Practically, this approach is behind almost everything useful in our universe — like the spam filters processing this email. Engineers have long embraced the Bayesian model as it produces great results. Children learn in this way (see this TED talk) — constantly trying different hypotheses and updating their prior.

If by statistics, people mean mean quantitative measurement, I don’t have any issues with the use of quantitative analysis (like computing mean, variance, etc.). If by statistics, people include the Bayesian approach of being logical about uncertainty, I am thrilled with statistics. But if, by statistics, people include the frequentist approach of claiming statistical significance, using confidence intervals, etc., then I have several philosophical and practical concerns, as outlined in my paper.

You wrote:

“The frequentist physicist’s probability is not a probability – it is something weird that cannot be used to make decisions, handle uncertainty logically or support the updating of our beliefs.”Ronald Fisher developed his approach to aid Cambridge geneticists breeding plants etc. who wished to find out how far apart particular genes were on the same chromosome.

They studied the observed rates of crossing in order to map the genes they had identified in linear order down each chromosome. The chance of a ‘cross’ is proportional to the distance between the genes.

Nothing weird about the way they calculated their results, or the accuracy with which they were able to create genetic mappings.

Ronald Fisher developed his approach to aid Cambridge geneticists breeding plants etc. who wished to find out how far apart particular genes were on the same chromosome.You cannot answer these questions unless you have a prior belief on how far apart they are. The frequentist position is actually a special case of the Bayesian one, where you assume you know nothing about the question at hand. That is a very strong belief to have, and it greatly changes your position based on what you see. The problem comes when it is not consistent with what you truly believe, and you end up with results that violate your common sense.

Just because Fisher came up with it does not make it sensible. You have to engage with the specific criticism I have put in my paper of the concepts that are prevalent in Fisher statistics.

Another paper that argues a similar point starts with the following:

The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug.You wrote: “Then, I can consider a rain detector, and characterize its sensitivity (chance of true positive) and specificity (chance of true negative). By doing so, I have a complete model for the detector’s accuracy – given the experiment’s result, I know how to change the probability of ‘Rain.'”

I thought that False positives and False negatives are necessary to generate a full theory of a detector.

Also I am under the impression that the ‘probability of rain’ is an estimate based on past experience. i.e. I know from the records over the last N = 100 years the number of 20th Octobers, n, that it rained, which I call the percentage chance, p = n, that it will rain on this 20th October (an algorithm which can be appropriately complexified to allow for the fact that the monsoon was late this year, what the jetstream is doing, El Nino etc.). My observation on Monday updates n to n’ = n + 1 or n, depending on whether it did or did not rain, and calculates p’ = 100n’/101.

Once an experiment has been made, does it not get added into the ‘sum total of past experience’ in this way?

I thought that False positives and False negatives are necessary to generate a full theory of a detector.Yes, and I have still not understood how you can use false positives and false negatives without combining it with a prior probability in order to build a learning model.

Also I am under the impression that the ‘probability of rain’ is an estimate based on past experience. i.e. I know from the records over the last N = 100 years the number of 20th Octobers, n, that it rained, which I call the percentage chance, p = n, that it will rain on this 20th October (an algorithm which can be appropriately complexified to allow for the fact that the monsoon was late this year, what the jetstream is doing, El Nino etc.). My observation on Monday updates n to n’ = n + 1 or n, depending on whether it did or did not rain, and calculates p’ = 100n’/101.Past experience is only one of the factors you may consider. Your own common sense can sometimes prevail over past experience. For instance, drawing on Nassim Taleb’s famous example, the statistical turkey thinks for a 1000 days that it is more and more likely to get its next meal, based on stronger and stronger past data. The 1001st day is Thanksgiving :).

Your decision to rely exclusively on past data is your prerogative. But it should be acknowledged as such, for no one forced you to do so. The Bayesian position simply asks you to acknowledge that there is no objective probability. A probability always has an originator – you. So, it behooves us to sign off on our probability statements by saying, “My probability for rain is…”

Once an experiment has been made, does it not get added into the ‘sum total of past experience’ in this way?You can choose to have such an updating model if that is what you believe. This is in total contrast to Google News, where their algorithm is about giving much higher probability that people are interested in what has happened recently. The sum total of past experience only makes sense under very strong assumptions – every observation is irrelevant to every other observation (which some people call “independence”). In the real world, that is almost never true.

You wrote:

“if, by statistics, you include the frequentist approach of claiming statistical significance, using confidence intervals, etc., then I have several philosophical and practical concerns”I have had to ‘wrap the knuckles’ of one of my friends (jocularly and kindly of course) when she stated:

‘if p < 0.05, it is true / correct, and if p < 0.05, then it is false / wrong'

But I do not think you could actually be referring to anything so naive.

When I am faced by an experimental result, what weight should I put on it?

I am often not infrequently faced with a table of results from an experiment where p < 0.05 / 0.01 etc in a couple of lines, but 0.06 or 0.07 in several others. I often apply a sign test to a Table of results and show that the overall result is highly significant, even if none of the individual results were.

I suspect that you are doing something similar. I personally regard the p = 0.05 cut off as meaningless. It is arbitrary and selected only for scientists rule of thumb and convenience. AND I emphasize precisely that in my lectures on experimental methods.

If I take every experimental result into account no matter what the p value, but with a weighting factor depending on its p value, then I get something much more informative (in my opinion, at any rate).

How does that relate to what you believe in?

But I do not think you could actually be referring to anything so naive.Nope, I am not. I don’t use p-statistics at all, and I am deeply suspicious of papers that report it. If it simply didn’t add any value, I’d be less harsh, but the problem is it misleads us into reading more into the results than we should. Why not simply report the results without problematic crutches?

When I am faced by an experimental result, what weight should I put on it?It depends on what you believed before you did the experiment. If you had a strong belief not to see the result you are seeing, then you’d want to see multiple instances of the result before you change your mind. But if your belief was not so strong to begin with, you may not need as many results to swing your mind. This is an important desiderata of a mathematical reasoning system, that unfortunately, is not met with the frequentist model.

I am often not infrequently faced with a table of results from an experiment where p < 0.05 / 0.01 etc in a couple of lines, but 0.06 or 0.07 in several others. I often apply a sign test to a Table of results and show that the overall result is highly significant, even if none of the individual results were.Significance is a term that should not be used. My professor jokingly says one should use mouthwash after using that term. 🙂 The problem is you don’t really mean significance. You mean, “the chance of seeing these results happen randomly is less than 5%/1%.”

Let’s parse that sentence again, removing the sleight of word in the term “randomness,” replacing it with its synonym, “chance.”

“The chance of seeing these results happen by chance is less than 5%/1%.”

This is circular logic, and to put it mildly, is a meaningless statement. Without addressing this fundamental contradiction, there is no point going any further.

How should we be applying “weights” on our experimental results, and can we see an example illustrating the problems with significance? Take a look at this post.