The Lab Leak Theory

A Bayesian Approach

David Friedman

Mar 11, 2023

There are a few facts about the origin of Covid that almost everyone agrees on:

It first appeared in Wuhan.

It is derived from a bat virus.

The Wuhan Institute of Virology (WIV) was doing research on bat viruses.

Wuhan also contained a wet market where live animals were sold.

There are other claims that people disagree on:

The earliest cases were almost all associated with the wet market

The details of the virus imply alteration by gain of function manipulations

The details of the virus imply zoonotic origin

Probably more that I haven’t followed

In order to base an opinion on the origin of Covid on the second set of claims I would have to know quite a lot about the detailed evidence on what happened in China — how many people had, or may have had, Covid how early and where — and about virology. I do not know those things. Neither, I suspect, does anyone in Congress or any of the reporters writing news stories on the topic. All of us are dependent for our opinions on those claims on second hand information, arguments and conclusions provided to us by the small number of people who have the relevant expert information or transmitted from experts to us via the media.

Evaluating second-hand information is hard, especially second-hand information from people who have a conclusion they want you to reach. I learned that lesson more than sixty years ago when I was a high school senior visiting colleges.

Yale was having a presentation on the subject of the House Unamerican Activities Committee. It stared with “Operation Abolition,” a movie made by the committee that convincingly argued, with lots of visual evidence, that the campaign to abolish the committee was a communist plot run by known communists. That was followed by “Operation Correction,” a movie made by the Northern California ACLU convincingly demonstrating multiple falsehoods in the first movie. That was followed by written material rebutting the second movie, written material rebutting that. It was a convincing demonstration of how easy it is to be persuaded by a selective presentation of arguments and evidence.

One solution to that problem is to look at both sides of the argument and decide for yourself which to believe. Doing that would be difficult for the controversy over the evidence of who had or did what when, most of which ultimately came from the Chinese government, which was in a position to filter what evidence got out.1 It would be still more difficult to evaluate the genetic arguments unless I was willing to put much more time and effort educating myself in virology than I am.

The other solution is to find experts I can trust or reporters I can trust to evaluate the credentials of experts. Unfortunately, most of the experts have good reasons to claim that the source was not a lab leak and so cannot be trusted. Back in March of 2020, the Lancet published a “Statement in support of the scientists, public health professionals, and medical professionals of China combatting COVID-19” which rejected the lab leak theory in strong terms:

We stand together to strongly condemn conspiracy theories suggesting that COVID-19 does not have a natural origin. Scientists from multiple countries have published and analysed genomes of the causative agent, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and they overwhelmingly conclude that this coronavirus originated in wildlife, as have so many other emerging pathogens.This is further supported by a letter from the presidents of the US National Academies of Science, Engineering, and Medicine and by the scientific communities they represent. Conspiracy theories do nothing but create fear, rumours, and prejudice that jeopardise our global collaboration in the fight against this virus.
…
We declare no competing interests.

There was no mention of the fact that Peter Daszak, one of the authors, was the head of a nonprofit that had funded bat virus research at the Wuhan Institute with money from the National Institutes of Health.

Fauci and the people working with him do not wish to believe, or have others believe, that the research they were funding was responsible for a pandemic that killed millions of people. Other virologists don’t want to offend the people who hand out government research grants, don’t want to believe that the sort of research they do was responsible for the pandemic, don’t want other people to believe it. There are surely some experts who either do not have an incentive to misrepresent the facts or are honest enough to tell the truth even if doing so is against their interest, but how am I to recognize them? Faced with lots of argument by credentialed authorities claiming to show evidence for or against zoonotic origin, how can I tell which I should believe?

The problem is made harder by the fact that the controversy over the origin of Covid has become a political issue. For quite a lot of people, including most of the reporters one might rely on to make sense of the controversy among experts, taking the wrong side on the issue of Covid origin would feel like treason against their side of the political spectrum. Not only does this mean that lots of people have an incentive to claim the balance of the evidence is against a lab leak even if it isn’t true, it also means that lots of other people have an incentive to claim the evidence is for a lab leak even if that isn’t true.

If you cannot trust the experts, cannot trust the media, and cannot evaluate the arguments over disputed facts for yourself, the best you can do is to use the undisputed facts. That is what I have tried to do. I do not know much about virology or about who in China can be trusted to tell the truth, but I do know the basic principles of statistics.

Classical vs Bayesian Statistics

The confidence measures produced by classical statistics are often interpreted as telling us how likely a hypothesis is to be true, but they don’t. The fact that the evidence confirms a hypothesis with a confidence level of .05 does not mean that there is only one chance in twenty that the hypothesis is false; it means that if the hypothesis is false in the particular way defined by the null hypothesis there is only one chance in twenty that the evidence in favor of the hypothesis would be as good as it is.

To see the difference, imagine I pull a coin out of my pocket and, without examining it, flip it three times. My hypothesis is that it is a double headed coin. It comes up heads all three times. If it is a fair coin, my null hypothesis, there is only one chance in eight of that result. It does not follow that the odds are now eight to one that the coin is double headed.

To estimate the probability that your hypothesis is true you use Bayesian statistics. Start with a prior probability, your initial estimate of how likely it is. Use Bayes Theorem plus your evidence to convert the prior probability into a posterior probability. For example …

Suppose you know that one coin in a million is double headed.2 Before you do the experiment, the probability that the coin you took out of your pocket is a fair coin is .999999, the probability that it is double headed is .000001. You flipped it three times, got three heads. The joint probability that you would draw a fair coin and it would come up heads three times is .999999/8. The joint probability that you would draw a double headed coin and it would come up heads three times is .000001, since coming up heads is a sure thing. You know that one of those things happened — I am assuming that all coins are either fair or double headed — so after doing the experiment the two probabilities, now posterior, sum to one. You know their relative probabilities. That gives you two equations in two unknowns. Solving them gives you a probability for the double headed hypothesis of .000008, for the fair coin hypothesis of .999992.

To apply the same approach to using the agreed on facts to estimate the probability that Covid originated as a lab leak, assuming the source was either a lab leak or animal to human transmission in a wet market that sold live animals, we need the prior probability for a lab leak happening, the probability that if the origin was a lab leak Covid would have first appeared in Wuhan and the probability that if the origin was zoonotic it would have first appeared in Wuhan.

I have no basis for the prior probability so will initially set it to .5. When I finish it should be obvious how, given your prior probability and your estimates of the facts the conditional probabilities depend on, you can calculate the posterior probability implied by the agreed-on facts.

The Conditional Probability of Wuhan if it was a Lab Leak

I see three relevant possibilities: A lab leak from a lab in Wuhan, probably the WIV, that first appeared in Wuhan, a lab leak from a lab in Wuhan that first appeared outside Wuhan, a lab leak from a lab outside of Wuhan. The early cases seem to have been almost all in Wuhan so, although it is possible that a lab leak from a lab in Wuhan would have led to initial cases somewhere else, perhaps carried by a lab worker leaving the WIV for a job elsewhere, it does not seem likely. I will therefor ignore the second possibility.

How many other labs were there, in China or elsewhere, that were working with bat viruses in ways that might plausibly have produced and leaked the Covid virus? I initially thought that the fact that the WIV was a BSL-4 level lab meant that such research was unlikely to be done in any less secure facility. There is only one other BSL-4 facility in China and it was not working on bat viruses, but there are a total of 42 in the world at present, so all I needed to do was to find out how many of them were working with bat viruses.

I then came across an article giving a detailed history of the bat virus research. It reported that the work done at the WIV was based on earlier research on bat viruses done by Ralph Baric at the University of North Carolina in a BSL-3+ lab. When the work was continued and extended by Zhengli Shi at the WIV she was working, at least as late as 2016, in a BSL-2 lab. According to an email from Shi, “Since bat viruses like WIV1 haven’t been confirmed to cause disease in human beings, her biosafety committee recommended BSL-2 for engineering them and testing them and BSL-3 for any animal experiments.”

How many labs are there, in China or elsewhere, that are potential candidates as the source of Covid? Baric’s lab at UNC qualifies and the WIV does. Online respondents have offered figures ranging from 2 or 3 to “many … doing research on coronaviruses.” It isn’t clear if any lab other than Baric’s or Shi’s was doing research that modified bat viruses in ways that could have produced Covid. I will assume for the moment that if I could only find references to two there are probably no more than ten, report my conclusion both on the assumption of ten and as a formula with the number of candidate labs one of the variables.

The Conditional Probability of Wuhan if it Originated in a Wuhan Wet Market

As of 2018, wet markets remain the most prevalent food outlet in urban regions of China despite the rise of supermarket chains since the 1990s. (Wikipedia article)

The alternative hypothesis is that it originated in a Wuhan wet market where wild animals were sold. The conditional probability that, if such a market was the source, Covid would have first appeared in Wuhan, absent much more detailed information about conditions in different wet markets than I have, is the ratio of the number of wet markets in Wuhan that sold wild animals (1) to the number in the world.

Wet markets are common in China and elsewhere — I have visited ones in China, Russia, Spain, Italy and Baltimore. In China they are no longer supposed to sell wild animals but sometimes, as at Wuhan, do. One person responding to my query online reported that in Papua-New Guinea, where he was, there were probably hundreds of markets selling live animals. Other online references reported live animal markets in Indonesia, the Philippines, India, Thailand, and Vietnam. A conservative estimate would be a thousand worldwide, and I will use that.

The Posterior Probability of a Lab Leak

PPL: Prior probability of a lab leak

PPW = 1-PPL: Prior probability of wet market origin

p(Wuhan: LL): Probability of first appearance in Wuhan if source was a lab leak

p(Wuhan: WM): Probability of first appearance in Wuhan if source was a wet market

Probability that origin was a lab leak and Covid appears first in Wuhan: PPL(p(Wuhan: Lab Leak)

Probability that origin was a wet market and Covid appears first in Wuhan: PPW[p(Wuhan: WM)]

The two probabilities above are both prior probabilities. We know that Covid appeared first in Wuhan and are assuming the origin was either a lab leak or the wet market, so the sum of the corresponding posterior probabilities is one. Scaling the priors up to make that the case gives us the corresponding posterior probabilities:

Probability that origin was lab leak:

PPL[p(Wuhan: Lab Leak)]/{PPL[p(Wuhan: Lab Leak)]+ PPW[p(Wuhan: WM)]}

My tentative numbers, based on the above arguments, are:

PPL=PPW=.5

p(Wuhan: LL) = .1 (ten possible lab sources)

p(Wuhan: WM) = .001 (a thousand life markets selling animals)

Probability that origin was a lab leak:

(.5)(.1)/{ (.5)(.1)+(.5)(.001)}=.05/.0505 = .99

So if my tentative numbers are correct, it was almost certainly a lab leak.

How sensitive is the result to the assumptions? If the prior is only a 10% chance of a lab leak, we have:

Probability that origin was lab leak:

(.1)(.1)/{(.1)(.1)+(.9)(.001)}=.01/.0109 = .92

Still probably a lab leak, but a significant chance of zoonotic origin.

If the prior is a 10% chance of a lab leak and there are a hundred labs it might have leaked from, we have:

Probability that origin was lab leak:

(.1)(.01)/{(.1)(.01)+(.9)(.001)}=.001/.0019 = .53

Under those assumptions it’s a coin flip.

Note, however, that the more labs there are, especially BSL-2 and BSL-3 labs doing work with bat viruses that could have produced Covid, the higher your prior probability of a lab leak happening should be.

You can decide for yourself what you think realistic numbers are, plug them into the formula, and calculate your own estimate of the probability that the origin was a lab leak.

U.S. intelligence claims that three researchers from the WIV got sick enough to require hospital care in November of 2019. Some U.S. officials say the sick researchers were involved in Coronavirus research. The information has not been confirmed by official Chinese sources. Should I believe it? U.S. intelligence sources may have their own biases.

According to what I could learn from a quick web search there are no genuine double headed U.S. coins, since there is no way the mint could produce them. There are, however, double-headed coins that have been produced by machining down two coins to half thickness and fusing the halves together. The U.S. produces about two billion quarters a year so if I assume that enterprising con men produce two thousand double-headed quarters a year and eventually release them into the general pool, that implies a prior of one in a million. Probably too high.

DinoNerd

In cases like this, I tend to simply say "I don't know", and leave it at that.

I think that's better than coming up with any kind of number, which is then all too likely to take on a life of its own, and persist in spite of any additional relevant data that may appear.

Fortunately, I don't have any decisions to make that would benefit from being partly based on this answer.

Expand full comment

3 replies

Nuño Sempere

Mar 12, 2023

See also: <https://www.rootclaim.com/analysis/What-is-the-source-of-COVID-19-SARS-CoV-2> for a similar analysis

2 replies by David Friedman and others

12 more comments...

David Friedman’s Substack

Discussion about this post