An Intuitive Explanation of Bayes' Theorem
Translations of this material:
- into Russian: Интуитивное объяснение теоремы Байеса. 66% translated in draft.
Submitted for translation by tymmym 21.02.2012
for the curious and bewildered;
an excruciatingly gentle introduction.
Your friends and colleagues are talking about something called "Bayes' Theorem" or "Bayes' Rule", or something called Bayesian reasoning. They sound really enthusiastic about it, too, so you google and find a webpage about Bayes' Theorem and...
It's this equation. That's all. Just one equation. The page you found gives a definition of it, but it doesn't say what it is, or why it's useful, or why your friends would be interested in it. It looks like this random statistics thing.
So you came here. Maybe you don't understand what the equation says. Maybe you understand it in theory, but every time you try to apply it in practice you get mixed up trying to remember the difference between p(a|x) and p(x|a), and whether p(a)*p(x|a) belongs in the numerator or the denominator. Maybe you see the theorem, and you understand the theorem, and you can use the theorem, but you can't understand why your friends and/or research colleagues seem to think it's the secret of the universe. Maybe your friends are all wearing Bayes' Theorem T-shirts, and you're feeling left out. Maybe you're a girl looking for a boyfriend, but the boy you're interested in refuses to date anyone who "isn't Bayesian". What matters is that Bayes is cool, and if you don't know Bayes, you aren't cool.
Why does a mathematical concept generate this strange enthusiasm in its students? What is the so-called Bayesian Revolution now sweeping through the sciences, which claims to subsume even the experimental method itself as a special case? What is the secret that the adherents of Bayes know? What is the light that they have seen?
Soon you will know. Soon you will be one of us.
While there are a few existing online explanations of Bayes' Theorem, my experience with trying to introduce people to Bayesian reasoning is that the existing online explanations are too abstract. Bayesian reasoning is very ''counterintuitive''. People do not employ Bayesian reasoning intuitively, find it very difficult to learn Bayesian reasoning when tutored, and rapidly forget Bayesian methods once the tutoring is over. This holds equally true for novice students and highly trained professionals in a field. Bayesian reasoning is apparently one of those things which, like quantum mechanics or the Wason Selection Test, is inherently difficult for humans to grasp with our built-in mental faculties.
Or so they claim. Here you will find an attempt to offer an ''intuitive'' explanation of Bayesian reasoning - an excruciatingly gentle introduction that invokes all the human ways of grasping numbers, from natural frequencies to spatial visualization. The intent is to convey, not abstract rules for manipulating numbers, but what the numbers mean, and why the rules are what they are (and cannot possibly be anything else). When you are finished reading this page, you will see Bayesian problems in your dreams.
And let's begin.
Here's a story problem about a situation that doctors often encounter:
1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
What do you think the answer is? If you haven't encountered this kind of problem before, please take a moment to come up with your own answer before continuing.
Next, suppose I told you that most doctors get the same wrong answer on this problem - usually, only around 15% of doctors get it right. ("Really? 15%? Is that a real number, or an urban legend based on an Internet poll?" It's a real number. See Casscells, Schoenberger, and Grayboys 1978; Eddy 1982; Gigerenzer and Hoffrage 1995; and many other studies. It's a surprising result which is easy to replicate, so it's been extensively replicated.)
On the story problem above, most doctors estimate the probability to be between 70% and 80%, which is wildly incorrect.
Here's an alternate version of the problem on which doctors fare somewhat better:
10 out of 1000 women at age forty who participate in routine screening have breast cancer. 800 out of 1000 women with breast cancer will get positive mammographies. 96 out of 1000 women without breast cancer will also get positive mammographies. If 1000 women in this age group undergo a routine screening, about what fraction of women with positive mammographies will actually have breast cancer?
And finally, here's the problem on which doctors fare best of all, with 46% - nearly half - arriving at the correct answer:
100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammography. 950 out of 9,900 women without breast cancer will also get a positive mammography. If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammographies will actually have breast cancer?
The correct answer is 7.8%, obtained as follows: Out of 10,000 women, 100 have breast cancer; 80 of those 100 have positive mammographies. From the same 10,000 women, 9,900 will not have breast cancer and of those 9,900 women, 950 will also get positive mammographies. This makes the total number of women with positive mammographies 950+80 or 1,030. Of those 1,030 women with positive mammographies, 80 will have cancer. Expressed as a proportion, this is 80/1,030 or 0.07767 or 7.8%.
To put it another way, before the mammography screening, the 10,000 women can be divided into two groups:
- Group 1: 100 women ''with'' breast cancer.
- Group 2: 9,900 women ''without'' breast cancer.
Summing these two groups gives a total of 10,000 patients, confirming that none have been lost in the math. After the mammography, the women can be divided into four groups:
- Group A: 80 women ''with'' breast cancer, and a ''positive'' mammography.
- Group B: 20 women ''with'' breast cancer, and a ''negative'' mammography.
- Group C: 950 women ''without'' breast cancer, and a ''positive'' mammography.
Group D: 8,950 women ''without'' breast cancer, and a ''negative'' mammography.
As you can check, the sum of all four groups is still 10,000. The sum of groups A and B, the groups with breast cancer, corresponds to group 1; and the sum of groups C and D, the groups without breast cancer, corresponds to group 2; so administering a mammography does not actually ''change'' the number of women with breast cancer. The proportion of the cancer patients (A + B) within the complete set of patients (A + B + C + D) is the same as the 1% prior chance that a woman has cancer: (80 + 20) / (80 + 20 + 950 + 8950) = 100 / 10000 = 1%.
The proportion of cancer patients with positive results, within the group of ''all'' patients with positive results, is the proportion of (A) within (A + C): 80 / (80 + 950) = 80 / 1030 = 7.8%. If you administer a mammography to 10,000 patients, then out of the 1030 with positive mammographies, 80 of those positive-mammography patients will have cancer. This is the correct answer, the answer a doctor should give a positive-mammography patient if she asks about the chance she has breast cancer; if thirteen patients ask this question, roughly 1 out of those 13 will have cancer.
The most common mistake is to ignore the original fraction of women with breast cancer, and the fraction of women without breast cancer who receive false positives, and focus only on the fraction of women with breast cancer who get positive results. For example, the vast majority of doctors in these studies seem to have thought that if around 80% of women with breast cancer have positive mammographies, then the probability of a women with a positive mammography having breast cancer must be around 80%.
Figuring out the final answer always requires ''all three'' pieces of information - the percentage of women with breast cancer, the percentage of women without breast cancer who receive false positives, and the percentage of women with breast cancer who receive (correct) positives.
To see that the final answer always depends on the original fraction of women with breast cancer, consider an alternate universe in which only one woman out of a million has breast cancer. Even if mammography in this world detects breast cancer in 8 out of 10 cases, while returning a false positive on a woman without breast cancer in only 1 out of 10 cases, there will still be a hundred thousand false positives for every real case of cancer detected. The original probability that a woman has cancer is so extremely low that, although a positive result on the mammography does ''increase'' the estimated probability, the probability isn't increased to certainty or even "a noticeable chance"; the probability goes from 1:1,000,000 to 1:100,000.
Similarly, in an alternate universe where only one out of a million women does not have breast cancer, a positive result on the patient's mammography obviously doesn't mean that she has an 80% chance of having breast cancer! If this were the case her estimated probability of having cancer would have been revised drastically ''downward'' after she got a ''positive'' result on her mammography - an 80% chance of having cancer is a lot less than 99.9999%! If you administer mammographies to ten million women in this world, around eight million women with breast cancer will get correct positive results, while one woman without breast cancer will get false positive results. Thus, if you got a positive mammography in this alternate universe, your chance of having cancer would go from 99.9999% up to 99.999987%. That is, your chance of being healthy would go from 1:1,000,000 down to 1:8,000,000.
These two extreme examples help demonstrate that the mammography result doesn't replace your old information about the patient's chance of having cancer; the mammography ''slides'' the estimated probability in the direction of the result. A positive result slides the original probability upward; a negative result slides the probability downward. For example, in the original problem where 1% of the women have cancer, 80% of women with cancer get positive mammographies, and 9.6% of women without cancer get positive mammographies, a positive result on the mammography ''slides'' the 1% chance upward to 7.8%.
Most people encountering problems of this type for the first time carry out the mental operation of ''replacing'' the original 1% probability with the 80% probability that a woman with cancer gets a positive mammography. It may seem like a good idea, but it just doesn't work. "The probability that a woman with a positive mammography has breast cancer" is not at all the same thing as "the probability that a woman with breast cancer has a positive mammography"; they are as unlike as apples and cheese. Finding the final answer, "the probability that a woman with a positive mammography has breast cancer", uses all three pieces of problem information - "the prior probability that a woman has breast cancer", "the probability that a woman with breast cancer gets a positive mammography", and "the probability that a woman without breast cancer gets a positive mammography".
Q. What is the Bayesian Conspiracy?
A. The Bayesian Conspiracy is a multinational, interdisciplinary, and shadowy group of scientists that controls publication, grants, tenure, and the illicit traffic in grad students. The best way to be accepted into the Bayesian Conspiracy is to join the Campus Crusade for Bayes in high school or college, and gradually work your way up to the inner circles. It is rumored that at the upper levels of the Bayesian Conspiracy exist nine silent figures known only as the Bayes Council.
To see that the final answer always depends on the chance that a woman ''without'' breast cancer gets a positive mammography, consider an alternate test, mammography+. Like the original test, mammography+ returns positive for 80% of women with breast cancer. However, mammography+ returns a positive result for only one out of a million women without breast cancer - mammography+ has the same rate of false negatives, but a vastly lower rate of false positives. Suppose a patient receives a positive mammography+. What is the chance that this patient has breast cancer? Under the new test, it is a virtual certainty - 99.988%, i.e., a 1 in 8082 chance of being healthy.
Remember, at this point, that neither mammography nor mammography+ actually ''change'' the number of women who have breast cancer. It may seem like "There is a virtual certainty you have breast cancer" is a terrible thing to say, causing much distress and despair; that the more hopeful verdict of the previous mammography test - a 7.8% chance of having breast cancer - was much to be preferred. This comes under the heading of "Don't shoot the messenger". The number of women who really do have cancer stays exactly the same between the two cases. Only the accuracy with which we ''detect'' cancer changes. Under the previous mammography test, 80 women with cancer (who ''already'' had cancer, before the mammography) are first told that they have a 7.8% chance of having cancer, creating X amount of uncertainty and fear, after which more detailed tests will inform them that they definitely do have breast cancer. The old mammography test also involves informing 950 women ''without'' breast cancer that they have a 7.8% chance of having cancer, thus creating twelve times as much additional fear and uncertainty. The new test, mammography+, does ''not'' give 950 women false positives, and the 80 women with cancer are told the same facts they would have learned eventually, only earlier and without an intervening period of uncertainty. Mammography+ is thus a better test in terms of its total emotional impact on patients, as well as being more accurate. Regardless of its emotional impact, it remains a fact that a patient with positive mammography+ has a 99.988% chance of having breast cancer.
Of course, that mammography+ does ''not'' give 950 healthy women false positives means that all 80 of the patients with positive mammography+ will be patients with breast cancer. Thus, if you have a positive mammography+, your chance of having cancer is a virtual certainty. It is ''because'' mammography+ does not generate as many false positives (and needless emotional stress), that the (much smaller) group of patients who ''do'' get positive results will be composed almost entirely of genuine cancer patients (who have bad news coming to them regardless of when it arrives).
Similarly, let's suppose that we have a ''less'' discriminating test, mammography*, that still has a 20% rate of false negatives, as in the original case. However, mammography* has an 80% rate of false positives. In other words, a patient ''without'' breast cancer has an 80% chance of getting a false positive result on her mammography* test. If we suppose the same 1% prior probability that a patient presenting herself for screening has breast cancer, what is the chance that a patient with positive mammography* has cancer?
Group 1: 100 patients with breast cancer.
Group 2: 9,900 patients without breast cancer.
After mammography* screening:
Group A: 80 patients with breast cancer and a "positive" mammography*.
Group B: 20 patients with breast cancer and a "negative" mammography*.
Group C: 7920 patients without breast cancer and a "positive" mammography*.
Group D: 1980 patients without breast cancer and a "negative" mammography*.
The result works out to 80 / 8,000, or 0.01. This is exactly the same as the 1% prior probability that a patient has breast cancer! A "positive" result on mammography* doesn't change the probability that a woman has breast cancer at all. You can similarly verify that a "negative" mammography* also counts for nothing. And in fact it ''must'' be this way, because if mammography* has an 80% hit rate for patients with breast cancer, and also an 80% rate of false positives for patients without breast cancer, then mammography* is completely ''uncorrelated'' with breast cancer. There's no reason to call one result "positive" and one result "negative"; in fact, there's no reason to call the test a "mammography". You can throw away your expensive mammography* equipment and replace it with a random number generator that outputs a red light 80% of the time and a green light 20% of the time; the results will be the same. Furthermore, there's no reason to call the red light a "positive" result or the green light a "negative" result. You could have a green light 80% of the time and a red light 20% of the time, or a blue light 80% of the time and a purple light 20% of the time, and it would all have the same bearing on whether the patient has breast cancer: i.e., no bearing whatsoever.
We can show algebraically that this must hold for any case where the chance of a true positive and the chance of a false positive are the same, i.e:
Group 1: 100 patients with breast cancer.
Group 2: 9,900 patients without breast cancer.
Now consider a test where the probability of a true positive and the probability of a false positive are the same number M (in the example above, M=80% or M = 0.8):
Group A: 100*M patients with breast cancer and a "positive" result.
Group B: 100*(1 - M) patients with breast cancer and a "negative" result.
Group C: 9,900*M patients without breast cancer and a "positive" result.
Group D: 9,900*(1 - M) patients without breast cancer and a "negative" result.
The proportion of patients with breast cancer, within the group of patients with a "positive" result, then equals 100*M / (100*M + 9900*M) = 100 / (100 + 9900) = 1%. This holds true regardless of whether M is 80%, 30%, 50%, or 100%. If we have a mammography* test that returns "positive" results for 90% of patients with breast cancer and returns "positive" results for 90% of patients without breast cancer, the proportion of "positive"-testing patients who have breast cancer will still equal the original proportion of patients with breast cancer, i.e., 1%.
You can run through the same algebra, replacing the prior proportion of patients with breast cancer with an arbitrary percentage P:
Group 1: Within some number of patients, a fraction P have breast cancer.
Group 2: Within some number of patients, a fraction (1 - P) do not have breast cancer.
After a "cancer test" that returns "positive" for a fraction M of patients with breast cancer, and also returns "positive" for the same fraction M of patients without cancer:
Group A: P*M patients have breast cancer and a "positive" result.
Group B: P*(1 - M) patients have breast cancer and a "negative" result.
Group C: (1 - P)*M patients have no breast cancer and a "positive" result.
Group D: (1 - P)*(1 - M) patients have no breast cancer and a "negative" result.
The chance that a patient with a "positive" result has breast cancer is then the proportion of group A within the combined group A + C, or P*M / [P*M + (1 - P)*M], which, cancelling the common factor M from the numerator and denominator, is P / [P + (1 - P)] or P / 1 or just P. If the rate of false positives is the same as the rate of true positives, you always have the same probability after the test as when you started.
Which is common sense. Take, for example, the "test" of flipping a coin; if the coin comes up heads, does it tell you anything about whether a patient has breast cancer? No; the coin has a 50% chance of coming up heads if the patient has breast cancer, and also a 50% chance of coming up heads if the patient does not have breast cancer. Therefore there is no reason to call either heads or tails a "positive" result. It's not the probability being "50/50" that makes the coin a bad test; it's that the two probabilities, for "cancer patient turns up heads" and "healthy patient turns up heads", are the same. If the coin was slightly biased, so that it had a 60% chance of coming up heads, it still wouldn't be a cancer test - what makes a coin a poor test is not that it has a 50/50 chance of coming up heads if the patient has cancer, but that it also has a 50/50 chance of coming up heads if the patient does not have cancer. You can even use a test that comes up "positive" for cancer patients 100% of the time, and still not learn anything. An example of such a test is "Add 2 + 2 and see if the answer is 4." This test returns positive 100% of the time for patients with breast cancer. It also returns positive 100% of the time for patients without breast cancer. So you learn nothing.
The original proportion of patients with breast cancer is known as the prior probability. The chance that a patient with breast cancer gets a positive mammography, and the chance that a patient without breast cancer gets a positive mammography, are known as the two conditional probabilities. Collectively, this initial information is known as the priors. The final answer - the estimated probability that a patient has breast cancer, given that we know she has a positive result on her mammography - is known as the revised probability or the posterior probability. What we've just shown is that if the two conditional probabilities are equal, the posterior probability equals the prior probability.
Fact! Q. How can I find the priors for a problem?
A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.
Q. Where do priors originally come from?
A. Never ask that question.
Q. Uh huh. Then where do scientists get their priors?
A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has become fractious and controversial, with widespread acrimony, factional polarization, and several outright assassinations. This may be a front for infighting within the Bayes Council, or it may be that the disputants have too much spare time. No one is really sure.
Q. I see. And where does everyone else get their priors?
A. They download their priors from Kazaa.
Q. What if the priors I want aren't available on Kazaa?
A. There's a small, cluttered antique shop in a back alley of San Francisco's Chinatown. Don't ask about the bronze rat.
Actually, priors are true or false just like the final answer - they reflect reality and can be judged by comparing them against reality. For example, if you think that 920 out of 10,000 women in a sample have breast cancer, and the actual number is 100 out of 10,000, then your priors are wrong. For our particular problem, the priors might have been established by three studies - a study on the case histories of women with breast cancer to see how many of them tested positive on a mammography, a study on women without breast cancer to see how many of them test positive on a mammography, and an epidemiological study on the prevalence of breast cancer in some specific demographic.
Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl? For this example the arithmetic is simple enough that you may be able to do it in your head, and I would suggest trying to do so.
But just in case... Result:
A more compact way of specifying the problem:
p(pearl) = 40%
p(blue|pearl) = 30%
p(blue|~pearl) = 10%
p(pearl|blue) = ?
"~" is shorthand for "not", so ~pearl reads "not pearl".
blue|pearl is shorthand for "blue given pearl" or "the probability that an egg is painted blue, given that the egg contains a pearl". One thing that's confusing about this notation is that the order of implication is read right-to-left, as in Hebrew or Arabic. blue|pearl means "blue<-pearl", the degree to which pearl-ness implies blue-ness, not the degree to which blue-ness implies pearl-ness. This is confusing, but it's unfortunately the standard notation in probability theory.
Readers familiar with quantum mechanics will have already encountered this peculiarity; in quantum mechanics, for example, <d|c><c|b><b|a> reads as "the probability that a particle at A goes to B, then to C, ending up at D". To follow the particle, you move your eyes from right to left. Reading from left to right, "|" means "given"; reading from right to left, "|" means "implies" or "leads to". Thus, moving your eyes from left to right, blue|pearl reads "blue given pearl" or "the probability that an egg is painted blue, given that the egg contains a pearl". Moving your eyes from right to left, blue|pearl reads "pearl implies blue" or "the probability that an egg containing a pearl is painted blue".