An Intuitive Explanation of Bayes' Theorem (excerpt)

Author: Eliezer Yudkowsky. Link to original: http://yudkowsky.net/rational/bayes (English).
Tags: lesswrong, рациональность Submitted by bt_uytya 13.02.2012. Public material.

Translations of this material:

into Russian: Интуитивное объяснение теоремы Байеса (отрывок). Translated in draft, editing and proof-reading required.
Submitted for translation by bt_uytya 13.02.2012

Text

So why is it that some people are so [i]excited[/i] about Bayes' Theorem?

"Do you believe that a nuclear war will occur in the next 20 years? If no, why not?" Since I wanted to use some common answers to this question to make a point about rationality, I went ahead and asked the above question in an IRC channel, #philosophy on EFNet.

One EFNetter who answered replied "No" to the above question, but added that he believed biological warfare would wipe out "99.4%" of humanity within the next ten years. I then asked whether he believed 100% was a possibility. "No," he said. "Why not?", I asked. "Because I'm an optimist," he said. (Roanoke of #philosophy on EFNet wishes to be credited with this statement, even having been warned that it will not be cast in a complimentary light. Good for him!) Another person who answered the above question said that he didn't expect a nuclear war for 100 years, because "All of the players involved in decisions regarding nuclear war are not interested right now." "But why extend that out for 100 years?", I asked. "Pure hope," was his reply.

What is it [i]exactly[/i] that makes these thoughts "irrational" - a poor way of arriving at truth? There are a number of intuitive replies that can be given to this; for example: "It is not rational to believe things only because they are comforting." Of course it is equally irrational to believe things only because they are [i]discomforting;[/i] the second error is less common, but equally irrational. Other intuitive arguments include the idea that "Whether or not you happen to be an optimist has nothing to do with whether biological warfare wipes out the human species", or "Pure hope is not evidence about nuclear war because it is not an observation about nuclear war."

There is also a mathematical reply that is precise, exact, and contains all the intuitions as special cases. This mathematical reply is known as Bayes' Theorem.

For example, the reply "Whether or not you happen to be an optimist has nothing to do with whether biological warfare wipes out the human species" can be translated into the statement:

p(you are currently an optimist | biological war occurs within ten years and wipes out humanity) =

p(you are currently an optimist | biological war occurs within ten years and does not wipe out humanity)

Since the two probabilities for [mono]p(X|A)[/mono] and [mono]p(X|~A)[/mono] are equal, Bayes' Theorem says that [mono]p(A|X) = p(A)[/mono]; as we have earlier seen, when the two conditional probabilities are equal, the revised probability equals the prior probability. If X and A are unconnected - statistically independent - then finding that X is true cannot be evidence that A is true; observing X does not update our probability for A; saying "X" is not an argument for A.

But suppose you are arguing with someone who is verbally clever and who says something like, "Ah, but since I'm an optimist, I'll have renewed hope for tomorrow, work a little harder at my dead-end job, pump up the global economy a little, eventually, through the trickle-down effect, sending a few dollars into the pocket of the researcher who ultimately finds a way to stop biological warfare - so you see, the two events are related after all, and I can use one as valid evidence about the other." In one sense, this is correct - [i]any[/i] correlation, no matter how weak, is fair prey for Bayes' Theorem; [i]but[/i] Bayes' Theorem distinguishes between weak and strong evidence. That is, Bayes' Theorem not only tells us what is and isn't evidence, it also describes the [i]strength[/i] of evidence. Bayes' Theorem not only tells us [i]when[/i] to revise our probabilities, but [i]how much[/i] to revise our probabilities. A correlation between hope and biological warfare may exist, but it's a lot weaker than the speaker wants it to be; he is revising his probabilities much too far.

Let's say you're a woman who's just undergone a mammography. Previously, you figured that you had a very small chance of having breast cancer; we'll suppose that you read the statistics somewhere and so you know the chance is 1%. When the positive mammography comes in, your estimated chance should now shift to 7.8%. There is no room to say something like, "Oh, well, a positive mammography isn't definite evidence, some healthy women get positive mammographies too. I don't want to despair too early, and I'm not going to revise my probability until more evidence comes in. Why? Because I'm a optimist." And there is similarly no room for saying, "Well, a positive mammography may not be definite evidence, but I'm going to assume the worst until I find otherwise. Why? Because I'm a pessimist." Your revised probability should go to 7.8%, no more, no less.

Bayes' Theorem describes what makes something "evidence" and how much evidence it is. Statistical models are judged by comparison to the [i]Bayesian method[/i] because, in statistics, the Bayesian method is as good as it gets - the Bayesian method defines the maximum amount of mileage you can get out of a given piece of evidence, in the same way that thermodynamics defines the maximum amount of work you can get out of a temperature differential. This is why you hear cognitive scientists talking about [i]Bayesian reasoners[/i]. In cognitive science, [i]Bayesian reasoner[/i] is the technically precise codeword that we use to mean [i]rational mind.[/i]

There are also a number of general heuristics about human reasoning that you can learn from looking at Bayes' Theorem.

For example, in many discussions of Bayes' Theorem, you may hear cognitive psychologists saying that people [i]do not take prior frequencies sufficiently into account,[/i] meaning that when people approach a problem where there's some evidence X indicating that condition A might hold true, they tend to judge A's likelihood solely by how well the evidence X seems to match A, without taking into account the prior frequency of A. If you think, for example, that under the mammography example, the woman's chance of having breast cancer is in the range of 70%-80%, then this kind of reasoning is insensitive to the prior frequency given in the problem; it doesn't notice whether 1% of women or 10% of women start out having breast cancer. "Pay more attention to the prior frequency!" is one of the many things that humans need to bear in mind to partially compensate for our built-in inadequacies.

A related error is to pay too much attention to p(X|A) and not enough to p(X|~A) when determining how much evidence X is for A. The degree to which a result X is [i]evidence for A[/i] depends, not only on the strength of the statement [i]we'd expect to see result X if A were true,[/i] but also on the strength of the statement [i]we [b]wouldn't[/b] expect to see result X if A weren't true.[/i] For example, if it is raining, this very strongly implies the grass is wet [mono]- p(wetgrass|rain) ~ 1 [/mono]- but seeing that the grass is wet doesn't necessarily mean that it has just rained; perhaps the sprinkler was turned on, or you're looking at the early morning dew. Since [mono]p(wetgrass|~rain)[/mono] is substantially greater than zero, [mono]p(rain|wetgrass)[/mono] is substantially less than one. On the other hand, if the grass was [i]never[/i] wet when it wasn't raining, then knowing that the grass was wet would [i]always[/i] show that it was raining, [mono]p(rain|wetgrass) ~ 1[/mono], even if [mono]p(wetgrass|rain) = 50%[/mono]; that is, even if the grass only got wet 50% of the times it rained. Evidence is always the result of the [i]differential[/i] between the two conditional probabilities. [i]Strong[/i] evidence is not the product of a very high probability that A leads to X, but the product of a very [i]low[/i] probability that [i]not-A[/i] could have led to X.

The [i]Bayesian revolution in the sciences[/i] is fueled, not only by more and more cognitive scientists suddenly noticing that mental phenomena have Bayesian structure in them; not only by scientists in every field learning to judge their statistical methods by comparison with the Bayesian method; but also by the idea that [i]science itself is a special case of Bayes' Theorem; experimental evidence is Bayesian evidence.[/i] The Bayesian revolutionaries hold that when you perform an experiment and get evidence that "confirms" or "disconfirms" your theory, this confirmation and disconfirmation is governed by the Bayesian rules. For example, you have to take into account, not only whether your theory predicts the phenomenon, but whether other possible explanations also predict the phenomenon. Previously, the most popular philosophy of science was probably Karl Popper's [i]falsificationism[/i] - this is the old philosophy that the Bayesian revolution is currently dethroning. Karl Popper's idea that theories can be definitely falsified, but never definitely confirmed, is yet another special case of the Bayesian rules; if [mono]p(X|A) ~ 1[/mono] - if the theory makes a definite prediction - then observing ~X very strongly falsifies A. On the other hand, if [mono]p(X|A) ~ 1[/mono], and we observe X, this doesn't definitely confirm the theory; there might be some other condition B such that [mono]p(X|B) ~ 1[/mono], in which case observing X doesn't favor A over B. For observing X to definitely confirm A, we would have to know, not that [mono]p(X|A) ~ 1[/mono], but that [mono]p(X|~A) ~ 0[/mono], which is something that we can't know because we can't range over all possible alternative explanations. For example, when Einstein's theory of General Relativity toppled Newton's incredibly well-confirmed theory of gravity, it turned out that all of Newton's predictions were just a special case of Einstein's predictions.

You can even formalize Popper's philosophy mathematically. The likelihood ratio for X, [mono]p(X|A)/p(X|~A)[/mono], determines how much observing X slides the probability for A; the likelihood ratio is what says [i]how strong[/i] X is as evidence. Well, in your theory A, you can predict X with probability 1, if you like; but you can't control the denominator of the likelihood ratio, [mono]p(X|~A)[/mono] - there will always be some alternative theories that also predict X, and while we go with the simplest theory that fits the current evidence, you may someday encounter some evidence that an alternative theory predicts but your theory does not. That's the hidden gotcha that toppled Newton's theory of gravity. So there's a limit on how much mileage you can get from successful predictions; there's a limit on how high the likelihood ratio goes for [i]confirmatory[/i] evidence.

On the other hand, if you encounter some piece of evidence Y that is definitely [i]not[/i] predicted by your theory, this is [i]enormously[/i] strong evidence [i]against[/i] your theory. If [mono]p(Y|A)[/mono] is infinitesimal, then the likelihood ratio will also be infinitesimal. For example, if [mono]p(Y|A)[/mono] is 0.0001%, and [mono]p(Y|~A)[/mono] is 1%, then the likelihood ratio [mono]p(Y|A)/p(Y|~A)[/mono] will be 1:10000. -40 decibels of evidence! Or flipping the likelihood ratio, if [mono]p(Y|A)[/mono] is [i]very small,[/i] then [mono]p(Y|~A)/p(Y|A)[/mono] will be [i]very large,[/i] meaning that observing Y greatly favors ~A over A. Falsification is much stronger than confirmation. This is a consequence of the earlier point that [i]very strong[/i] evidence is not the product of a very high probability that A leads to X, but the product of a very [i]low[/i] probability that [i]not-A[/i] could have led to X. This is the precise Bayesian rule that underlies the heuristic value of Popper's falsificationism.

Similarly, Popper's dictum that an idea must be falsifiable can be interpreted as a manifestation of the Bayesian conservation-of-probability rule; if a result X is positive evidence for the theory, then the result ~X would have disconfirmed the theory to some extent. If you try to interpret both X and ~X as "confirming" the theory, the Bayesian rules say this is impossible! To increase the probability of a theory you [i]must[/i] expose it to tests that can potentially decrease its probability; this is not just a rule for detecting would-be cheaters in the social process of science, but a consequence of Bayesian probability theory. On the other hand, Popper's idea that there is [i]only[/i] falsification and [i]no such thing[/i] as confirmation turns out to be incorrect. Bayes' Theorem shows that falsification is [i]very strong[/i] evidence compared to confirmation, but falsification is still probabilistic in nature; it is not governed by fundamentally different rules from confirmation, as Popper argued.

So we find that many phenomena in the cognitive sciences, plus the statistical methods used by scientists, plus the scientific method itself, are all turning out to be special cases of Bayes' Theorem. Hence the Bayesian revolution.