Today’s post, in a way, is at Arxiv: The Crisis Of Evidence: Why Probability And Statistics Cannot Discover Cause. Here’s the abstract (the official one has two typos, meaning my enemies are gaining in power and scope!):
The branch of mathematics concerning numerical descriptions … More models are only useful at explaining the uncertainty of what we do not know, and should never be used to say what we already know. Probability and statistical models are useless at discerning cause. Classical statistical procedures, in both their frequentist and Bayesian implementations, falsely imply they can speak about cause. No hypothesis test, or Bayes factor, should ever be used again. Even assuming we know the cause or partial cause for some set of observations, reporting via relative risk exaggerates the certainty we have in the future, often by a lot. This over-certainty is made much worse when parametric and not predictive methods are used. Unfortunately, predictive methods are rarely used; and even when they are, cause must still be an assumption, meaning (again) certainty in our scientific pronouncements is too high.
I use PM2.5 (particulate matter 2.5 microns or smaller, i.e. dust) as a running example, since it is one of the EPA’s favorite thing to regulate. I’ll be giving a version of this paper at this weekend’s Doctors for Disaster Preparedness conference in LA. I’ll concentrate more on the PM2.5 angle there, naturally, but I will and must hit the primary focus, which is that probability cannot discover cause.
“But, Briggs, isn’t all of statistics designed around discovering what causes what? Isn’t that what hypothesis tests and Bayes factors do?”
This is true: this is what people think statistics can do. And they are wrong. We bring knowledge of cause to data, we don’t get cause from data. Not directly. Understanding cause is something that is above or beyond any set of data. To understand that, you’ll have to read the paper, a mere 21 pages. If you expect that you have understood my argument by only considering what is in this post, you will be wrong.
An out-of-context-ish quotation (in a low or no group of 1,000 5 people got cancer of the albondigas, and in the “some” or high PM2.5 group of 1,000 15 did):
There is no indication in the data that high levels of PM2.5 cause cancer of the albondigas. If high levels did cause cancer, then why didn’t every of the 1,000 folks in the high group develop it? Think about that question. If high PM2.5 really is a cause—and recall we’re supposing every individual in the high group had the same exposure—then it should have made each person sick. Unless it was prevented from doing so by some other thing or things. And that is the most we can believe. High PM2.5 cannot be a complete cause: it may be necessary, but it cannot be sufficient. And it needn’t be a cause at all. The **data we have is perfectly consistent with some other thing or things, unmeasured by us, causing every case of cancer. And this is so even if all 1,000 individuals in the high group had cancer.
This is true for every hypothesis test; that is, every set of data. The proposed mechanism is either always an efficient cause, though it sometimes may be blocked or missing some “key” (other secondary causes or catalysts), or it is never a cause. There is no in-between**.
Always-or-never a cause is tautological, meaning there is no information added to the problem by saying the proposed mechanism might be a cause.
From that we deduce a proposed cause, absent knowledge of essence (to be described in a moment), said or believed to be a cause based on some function of the data, is always a prejudice, conceit, or guess. **Because our knowledge that the proposed cause only might be always (albeit possibly sometimes blocked) or never an efficient cause, and this is tautological, we cannot find a probability the proposed cause is a cause.
Consider also that the cause of the cancer could not have been high PM2.5 in the low group, because, of course, the 5 people there who developed cancer were not exposed to high PM2.5 as a possible cause. Therefore, their cause or causes must have been different if high PM2.5 is a cause. But since we don’t know if high PM2.5 is a cause, we cannot know whether whatever caused the cancers in the low group didn’t also cause the cancers in the high group. Recall that there may have been as many as 20 different causes. Once again we have concluded that nothing in the plain observations is of any help in deciding what is or isn’t a cause.**
Let’s start with the truth!