Concise Summary简洁概述
Most doctors get mammography probability problems wildly wrong — estimating 70-80% where the true answer is under 8%. Yudkowsky traces this to a single systematic error: people replace the prior probability with the hit rate of the test, rather than correctly updating the prior. Walking through cancer-screening numbers with spatial visualizations and natural-frequency framings, the essay builds up Bayes's Theorem from first principles. By the end, the formula P(A|X) = [P(X|A)·P(A)] / [P(X|A)·P(A) + P(X|¬A)·P(¬A)] should feel obvious rather than mysterious — a description of how evidence slides prior probabilities toward or away from a hypothesis.
大多数医生在乳腺癌筛查的概率题上严重出错——他们给出的答案在 70%–80% 之间,而正确答案不到 8%。Yudkowsky 将其根源追溯到一个系统性错误:人们用检测命中率取代先验概率,而非正确地更新先验。文章通过空间可视化和自然频率的表述方式,借助癌症筛查数据从第一性原理推导出贝叶斯定理。读完之后,公式 P(A|X) 应当显得理所当然,而非神秘——它不过是描述证据如何把先验概率向假设的方向「推动」。
Infographic信息图
Prior probability is never optional
先验概率从来不可省略
Even a highly accurate test produces mostly false positives when the base rate of disease is low. The mammogram slides the prior — it does not replace it.
即使是高度准确的检测,在疾病基础率极低时也会产生大量假阳性。乳腺摄影只是推动先验——而非取代它。
Natural frequencies beat percentages
自然频率优于百分比
Rephrasing '1% of women' as '10 out of 1,000 women' nearly triples the correct-answer rate, because it makes the subgroups concrete and countable.
将「1% 的女性」改为「每 1,000 名女性中有 10 人」,能使答对率大幅提升,因为这让子群体变得具体可数。
Likelihood ratios compress evidence
似然比是证据的压缩表示
The ratio P(X|A)/P(X|¬A) captures how much a positive result shifts belief; multiple independent tests combine by multiplying their likelihood ratios or adding their decibel values.
比值 P(X|A)/P(X|¬A) 捕捉了阳性结果改变信念的幅度;多个独立检测可通过乘以各自似然比(或相加分贝值)来合并。
Science is Bayes in disguise
科学是贝叶斯的化身
Experimental confirmation and Popperian falsification are both special cases of Bayesian updating — falsification is just extremely strong disconfirmatory evidence.
实验验证与波普式证伪,都是贝叶斯更新的特殊情况——证伪只不过是极强的否定性证据。
Mind on one side, reality on the other
等号一边是心智,另一边是现实
Bayes's Theorem puts rational inference (P(A|X)) on the left and physical causality (P(X|A)) on the right — the equation literally links how we reason to how the world works.
贝叶斯定理将理性推断(P(A|X))置于左侧,将物理因果(P(X|A))置于右侧——这个等式从字面上把我们的推理方式与世界的运作方式联系在一起。
Detailed Summary详细概述
Yudkowsky opens with a confession of method: there are already explanations of Bayes's Theorem online, but they are too abstract. Bayesian reasoning is "counterintuitive" — not just for novices, but for trained professionals — and existing treatments fail to convey what the numbers mean, only how to manipulate them. The essay promises an "excruciatingly gentle" introduction built around spatial visualization and natural frequencies.
The mammography problem
The canonical example: 1% of women at age forty have breast cancer. 80% of women with breast cancer get positive mammograms. 9.6% of women without breast cancer also get positive mammograms. Given a positive mammogram, what is the probability of cancer?
Most doctors answer around 70–80%. The true answer is 7.8%. The essay walks through the arithmetic twice: first with 10,000 women divided into groups A, B, C, D, then algebraically. The key move is to show all 1,030 positive-mammogram cases (80 cancerous + 950 non-cancerous) at once so that the tiny numerator becomes visible. Rephrasing the problem with "10 out of 1,000 women" rather than "1% of women" raises the correct-answer rate from ~15% of doctors to ~46%.
Prior, conditional, posterior
The essay names its components precisely:
- Prior probability: the original fraction with cancer (1%)
- Conditional probabilities: P(positive|cancer) and P(positive|¬cancer) — together forming "the priors"
- Posterior probability: P(cancer|positive) — the revised probability after seeing the test result
The mammogram slides the prior; it does not replace it. An alternate universe with a 1-in-a-million base rate would yield ~100,000 false positives for every real cancer detected even with a very accurate test.
Degrees of freedom and the full probability landscape
A substantial middle section shows that P(A|X), P(A,X), and P(X|A) are three different things, and that the four groups {A, B, C, D} have exactly three degrees of freedom among them once normalised to probabilities. This is why you always need exactly three pieces of independent information to specify a Bayesian problem — the prior plus two conditional probabilities have three degrees of freedom.
Likelihood ratios and decibels
The likelihood ratio P(X|A)/P(X|¬A) summarises how much a positive result shifts belief. Multiple independent tests can be combined by multiplying their likelihood ratios — or equivalently, by summing their log₁₀ values (measured in decibels, as suggested by E. T. Jaynes). A prior of 1:99 odds, updated by three tests with likelihood ratios of 8.33, 18.0, and 3.5, yields odds of 3150:594 ≈ 84% probability.
Science as Bayes's Theorem
The essay closes with the broader claim: Popper's falsificationism is a special case of Bayesian inference. A definite prediction P(X|A) ≈ 1 makes ¬X enormously disconfirming (high likelihood ratio for ¬A). But confirmation is always limited because you cannot control P(X|¬A) — there will always be alternative theories that also predict X. Falsification is asymmetrically stronger than confirmation, which is precisely why Popper's heuristic has value. And the scientific method itself — experimental evidence confirming or disconfirming theories — is just Bayesian updating at scale.
"Rational inference on the left end, physical causality on the right end; an equation with mind on one side and reality on the other."
Yudkowsky 开篇承认了自己的方法论:网上已经有贝叶斯定理的解释,但它们太抽象了。贝叶斯推理是「反直觉的」——不仅对初学者如此,对受过训练的专业人员也一样——而现有的讲解只告诉你如何操纵数字,却没有传达数字的含义。文章承诺提供一个「极度温和」的入门,以空间可视化和自然频率为核心。
乳腺摄影问题
典型例子:40 岁女性中有 1% 患有乳腺癌。患乳腺癌的女性中有 80% 会得到阳性乳腺摄影结果。未患乳腺癌的女性中有 9.6% 也会得到阳性结果。给定一个阳性乳腺摄影结果,癌症的概率是多少?
大多数医生的回答约为 70%–80%。正确答案是 7.8%。文章两次走完这道算术题:第一次用 10,000 名女性分成 A、B、C、D 四组,第二次用代数方式。关键步骤是同时展示全部 1,030 个阳性乳腺摄影病例(80 例患癌 + 950 例未患癌),使那个微小的分子变得可见。将题目从「1% 的女性」改为「每 1,000 名女性中有 10 人」,能将答对率从约 15% 的医生提升到约 46%。
先验、条件概率与后验概率
文章精确地命名了各组成部分:
- 先验概率:患癌的原始比例(1%)
- 条件概率:P(阳性|癌症) 和 P(阳性|¬癌症)——合称「先验」
- 后验概率:P(癌症|阳性)——看到检测结果后的修正概率
乳腺摄影推动先验,而不是取代它。在一个基础率为百万分之一的替代宇宙中,即使检测非常准确,每发现一例真正的癌症,也会产生约 100,000 个假阳性。
自由度与完整概率图景
中间有一个重要章节,展示了 P(A|X)、P(A,X) 和 P(X|A) 是三个不同的东西,并且四组 {A, B, C, D} 在归一化为概率后恰好有 三个自由度。这就是为什么你总是需要恰好三条独立信息来确定一个贝叶斯问题——先验加上两个条件概率合共三个自由度。
似然比与分贝
似然比 P(X|A)/P(X|¬A) 概括了阳性结果改变信念的幅度。多个独立检测可以通过乘以各自的似然比来合并——或等价地,通过将它们的以 10 为底的对数值(以分贝计量,如 E. T. Jaynes 所建议)相加来合并。先验概率为 1:99 的赔率,经三个似然比分别为 8.33、18.0 和 3.5 的检测更新后,得到 3150:594 的赔率,约等于 84% 的概率。
科学即贝叶斯定理
文章以更宏观的主张作结:波普的证伪主义是贝叶斯推断的特殊情况。明确的预测 P(X|A) ≈ 1 使观察到 ¬X 具有极强的否定作用。但确认总是有限的,因为你无法控制 P(X|¬A)——总会有其他理论也能预测 X。证伪在不对称意义上强于确认,这恰恰是波普启发法具有价值的精确原因。而科学方法本身——实验证据对理论进行确认或否定——只是大规模的贝叶斯更新。
「等号左端是理性推断,右端是物理因果;一个等式,一边是心智,另一边是现实。」
FAQ常见问答
Why do most doctors get the mammography problem so wrong?为什么大多数医生在乳腺摄影题上会犯如此严重的错误?
They commit the base-rate neglect error: they focus only on the 80% true-positive rate and ignore both the 1% prior prevalence of cancer and the 9.6% false-positive rate. This is cognitively natural — the test result feels like the most salient number — but mathematically it confuses P(positive|cancer) with P(cancer|positive), two very different quantities.
他们犯了忽视基础率的错误:只关注 80% 的真阳性率,而忽视了 1% 的癌症先验患病率和 9.6% 的假阳性率。这在认知上很自然——检测结果感觉是最突出的数字——但在数学上,这混淆了 P(阳性|癌症) 和 P(癌症|阳性),而这两者是截然不同的量。
What is the intuition behind needing all three pieces of information?为什么直觉上需要所有三条信息?
Think of it in terms of groups: you need to know (a) how many people are in the 'has disease' group to start with, (b) what fraction of that group tests positive, and (c) what fraction of the much larger 'no disease' group also tests positive — because false positives from the large group can overwhelm true positives from the small group. Any one or two of these is insufficient.
从分组的角度思考:你需要知道 (a) 一开始「患病」组有多少人,(b) 该组中有多少比例检测为阳性,以及 (c) 更大的「未患病」组中有多少比例也检测为阳性——因为来自大组的假阳性可能淹没来自小组的真阳性。任何一条或两条信息都不够。
What is a likelihood ratio and why does it matter?似然比是什么,为什么重要?
The likelihood ratio is P(X|A) / P(X|¬A) — how much more likely the evidence X is under hypothesis A than under its negation. It compactly describes how much a piece of evidence moves your belief. Multiple independent tests multiply their likelihood ratios, which is why adding decibel scores (log-scaled) is convenient for mental arithmetic with several tests.
似然比是 P(X|A) / P(X|¬A)——证据 X 在假设 A 成立时,比在其否命题成立时更可能出现多少倍。它简洁地描述了一条证据移动你信念的幅度。多个独立检测的似然比相乘,这就是为什么将分贝分数(对数刻度)相加在进行多个检测的心算时很方便。
How is Popper's falsificationism a special case of Bayes's Theorem?波普的证伪主义如何成为贝叶斯定理的特殊情况?
If a theory A predicts X with near-certainty (P(X|A) ≈ 1), then observing ¬X delivers an enormous likelihood ratio in favor of ¬A — the theory is strongly falsified. Confirmation is weaker because you cannot guarantee that no alternative theory also predicts X. So falsification is not a separate logical category from confirmation; it is just Bayesian updating with a very extreme likelihood ratio.
如果理论 A 以近乎确定的概率预测 X(P(X|A) ≈ 1),那么观察到 ¬X 会产生一个极大的支持 ¬A 的似然比——该理论被强烈证伪。确认更弱,因为你无法保证没有其他理论也能预测 X。因此,证伪在逻辑上并非与确认不同的类别;它只是具有极端似然比的贝叶斯更新。
Why does rephrasing probabilities as natural frequencies help so much?为什么将概率重述为自然频率有这么大的帮助?
Natural frequencies embed the prior probability into the conditional counts. Saying '12 out of 40 pearl-eggs are blue' already tells you that 40 out of 100 eggs have pearls — the prior is built in. This matches how we naturally sample and count objects, which is likely why it evokes correct reasoning more reliably than abstract percentages.
自然频率将先验概率嵌入到条件计数中。说「含有珍珠的 40 个蛋中有 12 个是蓝色的」已经告诉你 100 个蛋中有 40 个含有珍珠——先验信息已经内嵌其中。这与我们自然地对物体进行抽样和计数的方式相匹配,这可能正是它比抽象百分比更可靠地唤起正确推理的原因。
What does 'mind on one side, reality on the other' mean?「一边是心智,另一边是现实」这个等式意味着什么?
The left side of Bayes's Theorem, P(A|X), represents an inferential step: from observation X to updated belief about A. The right side contains only terms of the form P(X|A) or P(X|¬A), which represent causal directions — how facts about the world (having cancer) physically produce observations (a positive test). Bayes's Theorem is the bridge between these two directions: from physical causality to rational inference.
贝叶斯定理的左侧 P(A|X) 代表一个推断步骤:从观察 X 到对 A 的更新信念。右侧只包含 P(X|A) 或 P(X|¬A) 形式的项,代表因果方向——世界中的事实(患有癌症)如何物理地产生观察结果(阳性检测)。贝叶斯定理是连接这两个方向的桥梁:从物理因果到理性推断。
In-depth Analysis · Pros & Cons深入解读 · 优缺点
Written as an explicit pedagogical mission — to make Bayesian reasoning feel obvious rather than abstruse — this essay is arguably the most influential single piece of rationalist writing on the internet. It succeeds by grounding abstract probability theory in concrete counting problems, and by naming the cognitive error (base-rate neglect) rather than just correcting the arithmetic.
这篇文章以明确的教学使命写就——让贝叶斯推理感觉理所当然而非晦涩难懂——可以说是互联网上最具影响力的单篇理性主义写作。它的成功在于将抽象的概率论奠基于具体的计数问题,并命名了认知错误(基础率忽视),而不仅仅是纠正算术。
- Pedagogical scaffolding is masterful教学脚手架非常出色Yudkowsky presents the same problem three ways (percentages, frequencies, natural frequencies), progressively reducing cognitive load, and explicitly reports how performance changes at each step. This is rare intellectual honesty about what works pedagogically.Yudkowsky 三次呈现同一问题(百分比、频率、自然频率),逐步降低认知负荷,并明确报告了每一步性能的变化。这是关于教学效果的罕见智识诚实。
- The degrees-of-freedom analysis is underrated自由度分析被低估了The middle section showing that any Bayesian problem has exactly three degrees of freedom is mathematically clean and practically useful — it tells you precisely what information you can and cannot infer from a given dataset.中间章节展示任何贝叶斯问题恰好有三个自由度,在数学上简洁,在实践上有用——它精确地告诉你从给定数据集中可以和不可以推断什么信息。
- The Popper section is genuinely illuminating波普部分确实发人深省Explaining why falsification is asymmetrically stronger than confirmation — as a consequence of likelihood-ratio arithmetic, not as a philosophical axiom — demystifies a long-standing debate in philosophy of science.将证伪为何在不对称意义上强于确认解释为似然比算术的结果,而非哲学公理,这为科学哲学中一场长期争论去神秘化。
- The decibel insight is elegant分贝的洞察优雅至极Introducing Jaynes's suggestion to measure evidence in decibels — so that multiple independent tests simply add — gives readers a practical mental tool for rough Bayesian bookkeeping.引入 Jaynes 以分贝度量证据的建议——这样多个独立检测只需相加——为读者提供了一个实用的心算贝叶斯记账工具。
- The cancer-screening numbers are outdated癌症筛查数字已过时The specific figures used (1% prevalence, 80% sensitivity, 9.6% false-positive rate) are stylized teaching numbers, not current clinical ones. Modern mammography sensitivity and specificity differ, and the 'only 15% of doctors get it right' claim has been contested in subsequent literature.所使用的具体数字(1% 患病率、80% 敏感性、9.6% 假阳性率)是程式化的教学数字,而非当前临床数字。现代乳腺摄影的敏感性和特异性有所不同,而「只有 15% 的医生答对」的说法在后续文献中也受到质疑。
- Independence assumption goes unexamined独立性假设未经审视The section on combining multiple tests multiplies likelihood ratios, which is valid only if the tests are statistically independent. Real diagnostic tests are often correlated. The essay names this assumption once but does not stress how badly the calculation breaks if it fails.关于合并多个检测的章节乘以似然比,这只在检测统计独立时才有效。现实中的诊断检测通常是相关的。文章只提了一次这个假设,但没有强调如果它失败计算会有多糟糕。
- The priors section ducks the hard question先验部分回避了难题The humorous 'where do priors come from?' exchange avoids the hardest practical question in Bayesian reasoning. Choosing and justifying priors for real-world problems — especially with sparse data or strong theoretical disagreements — is genuinely difficult and deserves more than a joke.幽默的「先验从哪里来」对话回避了贝叶斯推理中最困难的实践问题。为现实世界问题选择和论证先验——尤其是在数据稀少或存在强烈理论分歧时——是真正困难的,值得比一个笑话更多的篇幅。
- Conflates Bayesian updating with rational agency将贝叶斯更新与理性主体混为一谈The claim that 'Bayesian reasoner' is 'the technically precise code word for rational mind' is contested even among Bayesians. Issues like computational tractability, the reference-class problem, and the status of non-probabilistic uncertainty are glossed over as if Bayes solves everything.「贝叶斯推理者」是「理性心智的技术精确代码词」这一说法,甚至在贝叶斯主义者中也存在争议。计算可行性、参考类问题以及非概率不确定性的地位等问题被一笔带过,仿佛贝叶斯解决了一切。
The essay achieves its stated purpose so well that it has shaped how a generation thinks about evidence and inference. Its weaknesses are mostly sins of omission — hard problems (priors, independence, computational limits) it gestures at but does not resolve. Read it to build the core intuition; supplement it with decision theory and statistics for everything else.
这篇文章极好地实现了其所述目的,以至于它塑造了一代人对证据与推断的思考方式。它的弱点主要是遗漏之罪——它所指向但未解决的难题(先验、独立性、计算限制)。阅读它以建立核心直觉;其余部分则以决策理论和统计学加以补充。
Original Text原文
(Note: The author now considers this explanation obsoleted by the Bayes' Rule Guide.)
[Editor’s Note: This is an abridgement of the original version of this essay, which contained many interactive elements.]
Your friends and colleagues are talking about something called “Bayes’s Theorem” or “Bayes’s Rule,” or something called Bayesian reasoning. They sound really enthusiastic about it, too, so you google and find a web page about Bayes’s Theorem and . . .
It’s this equation. That’s all. Just one equation. The page you found gives a definition of it, but it doesn’t say what it is, or why it’s useful, or why your friends would be interested in it. It looks like this random statistics thing.
Why does a mathematical concept generate this strange enthusiasm in its students? What is the so-called Bayesian Revolution now sweeping through the sciences, which claims to subsume even the experimental method itself as a special case? What is the secret that the adherents of Bayes know? What is the light that they have seen?
Soon you will know. Soon you will be one of us.
While there are a few existing online explanations of Bayes’s Theorem, my experience with trying to introduce people to Bayesian reasoning is that the existing online explanations are too abstract. Bayesian reasoning is very counterintuitive. People do not employ Bayesian reasoning intuitively, find it very difficult to learn Bayesian reasoning when tutored, and rapidly forget Bayesian methods once the tutoring is over. This holds equally true for novice students and highly trained professionals in a field. Bayesian reasoning is apparently one of those things which, like quantum mechanics or the Wason Selection Test, is inherently difficult for humans to grasp with our built-in mental faculties.
Or so they claim. Here you will find an attempt to offer an intuitive explanation of Bayesian reasoning—an excruciatingly gentle introduction that invokes all the human ways of grasping numbers, from natural frequencies to spatial visualization. The intent is to convey, not abstract rules for manipulating numbers, but what the numbers mean, and why the rules are what they are (and cannot possibly be anything else). When you are finished reading this, you will see Bayesian problems in your dreams.
And let’s begin.
Here’s a story problem about a situation that doctors often encounter:
1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammograms. 9.6% of women without breast cancer will also get positive mammograms. A woman in this age group had a positive mammogram in a routine screening. What is the probability that she actually has breast cancer?
What do you think the answer is? If you haven’t encountered this kind of problem before, please take a moment to come up with your own answer before continuing.
Next, suppose I told you that most doctors get the same wrong answer on this problem—usually, only around 15% of doctors get it right. (“Really? 15%? Is that a real number, or an urban legend based on an Internet poll?” It’s a real number. See Casscells, Schoenberger, and Graboys 1978;\[1\] Eddy 1982;\[2\] Gigerenzer and Hoffrage 1995;\[3\] and many other studies. It’s a surprising result which is easy to replicate, so it’s been extensively replicated.)
On the story problem above, most doctors estimate the probability to be between 70% and 80%, which is wildly incorrect.
Here’s an alternate version of the problem on which doctors fare somewhat better:
10 out of 1,000 women at age forty who participate in routine screening have breast cancer. 800 out of 1,000 women with breast cancer will get positive mammograms. 96 out of 1,000 women without breast cancer will also get positive mammograms. If 1,000 women in this age group undergo a routine screening, about what fraction of women with positive mammograms will actually have breast cancer?
And finally, here’s the problem on which doctors fare best of all, with 46%— nearly half—arriving at the correct answer:
100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammogram. 950 out of 9,900 women without breast cancer will also get a positive mammogram. If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammograms will actually have breast cancer?
The correct answer is 7.8%, obtained as follows: Out of 10,000 women, 100 have breast cancer; 80 of those 100 have positive mammograms. From the same 10,000 women, 9,900 will not have breast cancer and of those 9,900 women, 950 will also get positive mammograms. This makes the total number of women with positive mammograms 950 + 80 or 1,030. Of those 1,030 women with positive mammograms, 80 will have cancer. Expressed as a proportion, this is 80/1,030 or 0.07767 or 7.8%.
To put it another way, before the mammography, the 10,000 women can be divided into two groups:
- Group 1: 100 women with breast cancer.
- Group 2: 9,900 women without breast cancer.
Summing these two groups gives a total of 10,000 patients, confirming that none have been lost in the math. After the mammography, the women can be divided into four groups:
- Group A: 80 women with breast cancer and a positive mammogram.
- Group B: 20 women with breast cancer and a negative mammogram.
- Group C: 950 women without breast cancer and a positive mammogram.
- Group D: 8,950 women without breast cancer and a negative mammogram.
The sum of groups A and B, the groups with breast cancer, corresponds to group 1; and the sum of groups C and D, the groups without breast cancer, corresponds to group 2. If you administer a mammography to 10,000 patients, then out of the 1,030 with positive mammograms, eighty of those positive-mammogram patients will have cancer. This is the correct answer, the answer a doctor should give a positive-mammogram patient if she asks about the chance she has breast cancer; if thirteen patients ask this question, roughly one out of those thirteen will have cancer.
The most common mistake is to ignore the original fraction of women with breast cancer, and the fraction of women without breast cancer who receive false positives, and focus only on the fraction of women with breast cancer who get positive results. For example, the vast majority of doctors in these studies seem to have thought that if around 80% of women with breast cancer have positive mammograms, then the probability of a women with a positive mammogram having breast cancer must be around 80%.
Figuring out the final answer always requires all three pieces of information—the percentage of women with breast cancer, the percentage of women without breast cancer who receive false positives, and the percentage of women with breast cancer who receive (correct) positives.
The original proportion of patients with breast cancer is known as the prior probability. The chance that a patient with breast cancer gets a positive mammogram, and the chance that a patient without breast cancer gets a positive mammogram, are known as the two conditional probabilities. Collectively, this initial information is known as the priors. The final answer—the estimated probability that a patient has breast cancer, given that we know she has a positive result on her mammogram—is known as the revised probability or the posterior probability. What we’ve just seen is that the posterior probability depends in part on the prior probability.
To see that the final answer always depends on the original fraction of women with breast cancer, consider an alternate universe in which only one woman out of a million has breast cancer. Even if mammography in this world detects breast cancer in 8 out of 10 cases, while returning a false positive on a woman without breast cancer in only 1 out of 10 cases, there will still be a hundred thousand false positives for every real case of cancer detected. The original probability that a woman has cancer is so extremely low that, although a positive result on the mammogram does increase the estimated probability, the probability isn’t increased to certainty or even “a noticeable chance”; the probability goes from 1:1,000,000 to 1:100,000.
What this demonstrates is that the mammogram result doesn’t replace your old information about the patient’s chance of having cancer; the mammogram slides the estimated probability in the direction of the result. A positive result slides the original probability upward; a negative result slides the probability downward. For example, in the original problem where 1% of the women have cancer, 80% of women with cancer get positive mammograms, and 9.6% of women without cancer get positive mammograms, a positive result on the mammogram slides the 1% chance upward to 7.8%.
Most people encountering problems of this type for the first time carry out the mental operation of replacing the original 1% probability with the 80% probability that a woman with cancer gets a positive mammogram. It may seem like a good idea, but it just doesn’t work. “The probability that a woman with a positive mammogram has breast cancer” is not at all the same thing as “the probability that a woman with breast cancer has a positive mammogram”; they are as unlike as apples and cheese.
Q. Why did the Bayesian reasoner cross the road?
A. You need more information to answer this question.
Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl? For this example the arithmetic is simple enough that you may be able to do it in your head, and I would suggest trying to do so.
A more compact way of specifying the problem:
P (pearl) = 40%
P (blue|pearl) = 30%
P (blue|¬pearl) = 10%
P (pearl|blue) = ?
The symbol “¬” is shorthand for “not,” so ¬pearl reads “not pearl.”
The notation P (blue|pearl) is shorthand for “the probability of blue given pearl” or “the probability that an egg is painted blue, given that the egg contains a pearl.” The item on the right side is what you already know or the premise, and the item on the left side is the implication or conclusion. If we have P (blue|pearl) = 30%, and we already know that some egg contains a pearl, then we can conclude there is a 30% chance that the egg is painted blue. Thus, the final fact we’re looking for—“the chance that a blue egg contains a pearl” or “the probability that an egg contains a pearl, if we know the egg is painted blue”—reads P (pearl|blue).
40% of the eggs contain pearls, and 60% of the eggs contain nothing. 30% of the eggs containing pearls are painted blue, so 12% of the eggs altogether contain pearls and are painted blue. 10% of the eggs containing nothing are painted blue, so altogether 6% of the eggs contain nothing and are painted blue. A total of 18% of the eggs are painted blue, and a total of 12% of the eggs are painted blue and contain pearls, so the chance a blue egg contains a pearl is 12/18 or 2/3 or around 67%.
As before, we can see the necessity of all three pieces of information by considering extreme cases. In a (large) barrel in which only one egg out of a thousand contains a pearl, knowing that an egg is painted blue slides the probability from 0.1% to 0.3% (instead of sliding the probability from 40% to 67%). Similarly, if 999 out of 1,000 eggs contain pearls, knowing that an egg is blue slides the probability from 99.9% to 99.966%; the probability that the egg does not contain a pearl goes from 1/1,000 to around 1/3,000.
On the pearl-egg problem, most respondents unfamiliar with Bayesian reasoning would probably respond that the probability a blue egg contains a pearl is 30%, or perhaps 20% (the 30% chance of a true positive minus the 10% chance of a false positive). Even if this mental operation seems like a good idea at the time, it makes no sense in terms of the question asked. It’s like the experiment in which you ask a second-grader: “If eighteen people get on a bus, and then seven more people get on the bus, how old is the bus driver?” Many second-graders will respond: “Twenty-five.” They understand when they’re being prompted to carry out a particular mental procedure, but they haven’t quite connected the procedure to reality. Similarly, to find the probability that a woman with a positive mammogram has breast cancer, it makes no sense whatsoever to replace the original probability that the woman has cancer with the probability that a woman with breast cancer gets a positive mammogram. Neither can you subtract the probability of a false positive from the probability of the true positive. These operations are as wildly irrelevant as adding the number of people on the bus to find the age of the bus driver.
A study by Gigerenzer and Hoffrage in 1995 showed that some ways of phrasing story problems are much more evocative of correct Bayesian reasoning.\[4\] The least evocative phrasing used probabilities. A slightly more evocative phrasing used frequencies instead of probabilities; the problem remained the same, but instead of saying that 1% of women had breast cancer, one would say that 1 out of 100 women had breast cancer, that 80 out of 100 women with breast cancer would get a positive mammogram, and so on. Why did a higher proportion of subjects display Bayesian reasoning on this problem? Probably because saying “1 out of 100 women” encourages you to concretely visualize X women with cancer, leading you to visualize X women with cancer and a positive mammogram, etc.
The most effective presentation found so far is what’s known as natural frequencies—saying that 40 out of 100 eggs contain pearls, 12 out of 40 eggs containing pearls are painted blue, and 6 out of 60 eggs containing nothing are painted blue. A natural frequencies presentation is one in which the information about the prior probability is included in presenting the conditional probabilities. If you were just learning about the eggs’ conditional probabilities through natural experimentation, you would—in the course of cracking open a hundred eggs—crack open around 40 eggs containing pearls, of which 12 eggs would be painted blue, while cracking open 60 eggs containing nothing, of which about 6 would be painted blue. In the course of learning the conditional probabilities, you’d see examples of blue eggs containing pearls about twice as often as you saw examples of blue eggs containing nothing.
Unfortunately, while natural frequencies are a step in the right direction, it probably won’t be enough. When problems are presented in natural frequencies, the proportion of people using Bayesian reasoning rises to around half. A big improvement, but not big enough when you’re talking about real doctors and real patients.
Q. How can I find the priors for a problem?
A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.
Q. Where do priors originally come from?
A. Never ask that question.
Q. Uh huh. Then where do scientists get their priors?
A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has become fractious and controversial, with widespread acrimony, factional polarization, and several outright assassinations. This may be a front for infighting within the Bayes Council, or it may be that the disputants have too much spare time. No one is really sure.
Q. I see. And where does everyone else get their priors?
A. They download their priors from Kazaa.
Q. What if the priors I want aren’t available on Kazaa?
A. There’s a small, cluttered antique shop in a back alley of San Francisco’s Chinatown. Don’t ask about the bronze rat.
Actually, priors are true or false just like the final answer—they reflect reality and can be judged by comparing them against reality. For example, if you think that 920 out of 10,000 women in a sample have breast cancer, and the actual number is 100 out of 10,000, then your priors are wrong. For our particular problem, the priors might have been established by three studies—a study on the case histories of women with breast cancer to see how many of them get a positive mammogram, a study on women without breast cancer to see how many of them get a positive mammogram, and an epidemiological study on the prevalence of breast cancer in some specific demographic.
The probability P (A, B) is the same as P (B, A), but P (A|B) is not the same thing as P (B|A), and P (A, B) is completely different from P (A|B). It’s a common confusion to mix up some or all of these quantities.
To get acquainted with all the relationships between them, we’ll play “follow the degrees of freedom.” For example, the two quantities P (cancer) and P (¬cancer) have one degree of freedom between them, because of the general law P (A) + P (¬A) = 1. If you know that P (¬cancer) = 0.99, you can obtain P (cancer) = 1 − P (¬cancer) = 0.01.
The quantities P (positive|cancer) and P (¬positive|cancer) also have only one degree of freedom between them; either a woman with breast cancer gets a positive mammogram or she doesn’t. On the other hand, P (positive|cancer) and P (positive|¬cancer) have two degrees of freedom. You can have a mammography that returns positive for 80% of cancer patients and 9.6% of healthy patients, or that returns positive for 70% of cancer patients and 2% of healthy patients, or even a health test that returns “positive” for 30% of cancer patients and 92% of healthy patients. The two quantities, the output of the mammography for cancer patients and the output of the mammography for healthy patients, are in mathematical terms independent; one cannot be obtained from the other in any way, and so they have two degrees of freedom between them.
What about P(positive,cancer), P(positive|cancer), and P(cancer)? Here we have three quantities; how many degrees of freedom are there? In this case the equation that must hold is
P (positive, cancer) = P (positive|cancer) × P (cancer) .
This equality reduces the degrees of freedom by one. If we know the fraction of patients with cancer, and the chance that a cancer patient has a positive mammogram, we can deduce the fraction of patients who have breast cancer and a positive mammogram by multiplying.
Similarly, if we know the number of patients with breast cancer and positive mammograms, and also the number of patients with breast cancer, we can estimate the chance that a woman with breast cancer gets a positive mammogram by dividing: P (positive|cancer) = P (positive, cancer)/P (cancer). In fact, this is exactly how such medical diagnostic tests are calibrated; you do a study on 8,520 women with breast cancer and see that there are 6,816 (or thereabouts) women with breast cancer and positive mammograms, then divide 6,816 by 8,520 to find that 80% of women with breast cancer had positive mammograms. (Incidentally, if you accidentally divide 8,520 by 6,816 instead of the other way around, your calculations will start doing strange things, such as insisting that 125% of women with breast cancer and positive mammograms have breast cancer. This is a common mistake in carrying out Bayesian arithmetic, in my experience.) And finally, if you know P (positive, cancer) and P (positive|cancer), you can deduce how many cancer patients there must have been originally. There are two degrees of freedom shared out among the three quantities; if we know any two, we can deduce the third.
How about P (positive), P (positive, cancer), and P (positive, ¬cancer)? Again there are only two degrees of freedom among these three variables. The equation occupying the extra degree of freedom is
P (positive) = P (positive, cancer) + P (positive, ¬cancer) .
This is how P (positive) is computed to begin with; we figure out the number of women with breast cancer who have positive mammograms, and the number of women without breast cancer who have positive mammograms, then add them together to get the total number of women with positive mammograms. It would be very strange to go out and conduct a study to determine the number of women with positive mammograms— just that one number and nothing else—but in theory you could do so. And if you then conducted another study and found the number of those women who had positive mammograms and breast cancer, you would also know the number of women with positive mammograms and no breast cancer—either a woman with a positive mammogram has breast cancer or she doesn’t. In general, P (A, B) + P (A, ¬B) = P (A). Symmetrically, P (A, B) + P (¬A, B) = P (B).
What about P (positive, cancer), P (positive, ¬cancer), P (¬positive, cancer), and P (¬positive, ¬cancer)? You might at first be tempted to think that there are only two degrees of freedom for these four quantities—that you can, for example, get P (positive, ¬cancer) by multiplying P (positive) × P(¬cancer), and thus that all four quantities can be found given only the two quantities P(positive) and P(cancer). This is not the case! P (positive, ¬cancer) = P (positive) × P (¬cancer) only if the two probabilities are statistically independent—if the chance that a woman has breast cancer has no bearing on whether she has a positive mammogram. This amounts to requiring that the two conditional probabilities be equal to each other—a requirement which would eliminate one degree of freedom. If you remember that these four quantities are the groups A, B, C, and D, you can look over those four groups and realize that, in theory, you can put any number of people into the four groups. If you start with a group of 80 women with breast cancer and positive mammograms, there’s no reason why you can’t add another group of 500 women with breast cancer and negative mammograms, followed by a group of 3 women without breast cancer and negative mammograms, and so on. So now it seems like the four quantities have four degrees of freedom. And they would, except that in expressing them as probabilities, we need to normalize them to fractions of the complete group, which adds the constraint that P (positive, cancer) + P (positive, ¬cancer) + P (¬positive, cancer) + P (¬positive, ¬cancer) = 1. This equation takes up one degree of freedom, leaving three degrees of freedom among the four quantities. If you specify the fractions of women in groups A, B, and D, you can deduce the fraction of women in group C.
Given the four groups A, B, C, and D, it is very straightforward to compute everything else:
P(cancer) = (A + B) / (A + B + C + D)
P (¬positive|cancer) = B / (A + B)
and so on. Since {A, B, C, D} contains three degrees of freedom, it follows that the entire set of probabilities relating cancer rates to test results contains only three degrees of freedom. Remember that in our problems we always needed three pieces of information—the prior probability and the two conditional probabilities—which, indeed, have three degrees of freedom among them. Actually, for Bayesian problems, any three quantities with three degrees of freedom between them should logically specify the entire problem.
The probability that a test gives a true positive divided by the probability that a test gives a false positive is known as the likelihood ratio of that test. The likelihood ratio for a positive result summarizes how much a positive result will slide the prior probability. Does the likelihood ratio of a medical test then sum up everything there is to know about the usefulness of the test?
No, it does not! The likelihood ratio sums up everything there is to know about the meaning of a positive result on the medical test, but the meaning of a negative result on the test is not specified, nor is the frequency with which the test is useful. For example, a mammography with a hit rate of 80% for patients with breast cancer and a false positive rate of 9.6% for healthy patients has the same likelihood ratio as a test with an 8% hit rate and a false positive rate of 0.96%. Although these two tests have the same likelihood ratio, the first test is more useful in every way—it detects disease more often, and a negative result is stronger evidence of health.
Suppose that you apply two tests for breast cancer in succession—say, a standard mammogram and also some other test which is independent of mammography. Since I don’t know of any such test that is independent of mammography, I’ll invent one for the purpose of this problem, and call it the Tams-Braylor Division Test, which checks to see if any cells are dividing more rapidly than other cells. We’ll suppose that the Tams-Braylor gives a true positive for 90% of patients with breast cancer, and gives a false positive for 5% of patients without cancer. Let’s say the prior prevalence of breast cancer is 1%. If a patient gets a positive result on her mammogram and her Tams-Braylor, what is the revised probability she has breast cancer?
One way to solve this problem would be to take the revised probability for a positive mammogram, which we already calculated as 7.8%, and plug that into the Tams-Braylor test as the new prior probability. If we do this, we find that the result comes out to 60%.
Suppose that the prior prevalence of breast cancer in a demographic is 1%. Suppose that we, as doctors, have a repertoire of three independent tests for breast cancer. Our first test, test A, a mammography, has a likelihood ratio of 80%/9.6% = 8.33. The second test, test B, has a likelihood ratio of 18.0 (for example, from 90% versus 5%); and the third test, test C, has a likelihood ratio of 3.5 (which could be from 70% versus 20%, or from 35% versus 10%; it makes no difference). Suppose a patient gets a positive result on all three tests. What is the probability the patient has breast cancer?
Here’s a fun trick for simplifying the bookkeeping. If the prior prevalence of breast cancer in a demographic is 1%, then 1 out of 100 women have breast cancer, and 99 out of 100 women do not have breast cancer. So if we rewrite the probability of 1% as an odds ratio, the odds are 1:99.
And the likelihood ratios of the three tests A, B, and C are:
8.33 : 1 = 25 : 3
18.0 : 1 = 18 : 1
3.5 : 1 = 7 : 2 .
The odds for women with breast cancer who score positive on all three tests, versus women without breast cancer who score positive on all three tests, will equal:
1 × 25 × 18 × 7 : 99 × 3 × 1 × 2 = 3150 : 594.
To recover the probability from the odds, we just write:
3150/(3150 + 594) = 84% .
This always works regardless of how the odds ratios are written; i.e., 8.33:1 is just the same as 25:3 or 75:9. It doesn’t matter in what order the tests are administered, or in what order the results are computed. The proof is left as an exercise for the reader.
E. T. Jaynes, in Probability Theory With Applications in Science and Engineering, suggests that credibility and evidence should be measured in decibels.\[5\]
Decibels?
Decibels are used for measuring exponential differences of intensity. For example, if the sound from an automobile horn carries 10,000 times as much energy (per square meter per second) as the sound from an alarm clock, the automobile horn would be 40 decibels louder. The sound of a bird singing might carry 1,000 times less energy than an alarm clock, and hence would be 30 decibels softer. To get the number of decibels, you take the logarithm base 10 and multiply by 10:
decibels = 10log_10(intensity)
intensity = 10^(decibels/10) .
Suppose we start with a prior probability of 1% that a woman has breast cancer, corresponding to an odds ratio of 1:99. And then we administer three tests of likelihood ratios 25:3, 18:1, and 7:2. You could multiply those numbers . . . or you could just add their logarithms:
10log_10(1/99) ≈ −20
10log_10(25/3) ≈ 9
10log_10(18/1) ≈ 13
10 log_10(7/2) ≈ 5 .
It starts out as fairly unlikely that a woman has breast cancer—our credibility level is at −20 decibels. Then three test results come in, corresponding to 9, 13, and 5 decibels of evidence. This raises the credibility level by a total of 27 decibels, meaning that the prior credibility of −20 decibels goes to a posterior credibility of 7 decibels. So the odds go from 1:99 to 5:1, and the probability goes from 1% to around 83%.
You are a mechanic for gizmos. When a gizmo stops working, it is due to a blocked hose 30% of the time. If a gizmo’s hose is blocked, there is a 45% probability that prodding the gizmo will produce sparks. If a gizmo’s hose is unblocked, there is only a 5% chance that prodding the gizmo will produce sparks. A customer brings you a malfunctioning gizmo. You prod the gizmo and find that it produces sparks. What is the probability that a spark-producing gizmo has a blocked hose?
What is the sequence of arithmetical operations that you performed to solve this problem?
(45% × 30%)/(45% × 30% + 5% × 70%)
Similarly, to find the chance that a woman with a positive mammogram has breast cancer, we computed:
\[P (positive|cancer) × P (cancer)\] / \[P(positive|cancer)×P(cancer) + P (positive|¬cancer) × P (¬cancer)\]
which is
P (positive, cancer) / \[P (positive, cancer) + P (positive, ¬cancer)\]
which is
P (positive, cancer) / P (positive)
which is
P(cancer|positive) .
The fully general form of this calculation is known as Bayes’s Theorem or Bayes’s Rule.
Bayes’s Theorem:
P(A|X) = \[P(X|A) × P(A)\] / \[P(X|A) × P(A) + P(X|¬A) × P(¬A) \]
When there is some phenomenon A that we want to investigate, and an observation X that is evidence about A—for example, in the previous example, A is breast cancer and X is a positive mammogram—Bayes’s Theorem tells us how we should update our probability of A, given the new evidence X.
By this point, Bayes’s Theorem may seem blatantly obvious or even tautological, rather than exciting and new. If so, this introduction has entirely succeeded in its purpose.
Bayes’s Theorem describes what makes something “evidence” and how much evidence it is. Statistical models are judged by comparison to the Bayesian method because, in statistics, the Bayesian method is as good as it gets—the Bayesian method defines the maximum amount of mileage you can get out of a given piece of evidence, in the same way that thermodynamics defines the maximum amount of work you can get out of a temperature differential. This is why you hear cognitive scientists talking about Bayesian reasoners. In cognitive science, Bayesian reasoner is the technically precise code word that we use to mean rational mind.
There are also a number of general heuristics about human reasoning that you can learn from looking at Bayes’s Theorem.
For example, in many discussions of Bayes’s Theorem, you may hear cognitive psychologists saying that people do not take prior frequencies sufficiently into account, meaning that when people approach a problem where there’s some evidence X indicating that condition A might hold true, they tend to judge A’s likelihood solely by how well the evidence X seems to match A, without taking into account the prior frequency of A. If you think, for example, that under the mammography example, the woman’s chance of having breast cancer is in the range of 70%–80%, then this kind of reasoning is insensitive to the prior frequency given in the problem; it doesn’t notice whether 1% of women or 10% of women start out having breast cancer. “Pay more attention to the prior frequency!” is one of the many things that humans need to bear in mind to partially compensate for our built-in inadequacies.
A related error is to pay too much attention to P (X |A) and not enough to P(X|¬A) when determining how much evidence X is for A. The degree to which a result X is evidence for A depends not only on the strength of the statement we’d expect to see result X if A were true, but also on the strength of the statement we wouldn’t expect to see result X if A weren’t true. For example, if it is raining, this very strongly implies the grass is wet—P (wetgrass|rain) ≈ 1— but seeing that the grass is wet doesn’t necessarily mean that it has just rained; perhaps the sprinkler was turned on, or you’re looking at the early morning dew. Since P (wetgrass|¬rain) is substantially greater than zero, P (rain|wetgrass) is substantially less than one. On the other hand, if the grass was never wet when it wasn’t raining, then knowing that the grass was wet would always show that it was raining, P (rain|wetgrass) ≈ 1, even if P (wetgrass|rain) = 50%; that is, even if the grass only got wet 50% of the times it rained. Evidence is always the result of the differential between the two conditional probabilities. Strong evidence is not the product of a very high probability that A leads to X, but the product of a very low probability that not-A could have led to X.
The Bayesian revolution in the sciences is fueled, not only by more and more cognitive scientists suddenly noticing that mental phenomena have Bayesian structure in them; not only by scientists in every field learning to judge their statistical methods by comparison with the Bayesian method; but also by the idea that science itself is a special case of Bayes’s Theorem; experimental evidence is Bayesian evidence. The Bayesian revolutionaries hold that when you perform an experiment and get evidence that “confirms” or “disconfirms” your theory, this confirmation and disconfirmation is governed by the Bayesian rules. For example, you have to take into account not only whether your theory predicts the phenomenon, but whether other possible explanations also predict the phenomenon.
Previously, the most popular philosophy of science was probably Karl Popper’s falsificationism—this is the old philosophy that the Bayesian revolution is currently dethroning. Karl Popper’s idea that theories can be definitely falsified, but never definitely confirmed, is yet another special case of the Bayesian rules; if P(X|A) ≈ 1—if the theory makes a definite prediction—then observing ¬X very strongly falsifies A. On the other hand, if P(X|A) ≈ 1, and we observe X, this doesn’t definitely confirm the theory; there might be some other condition B such that P (X|B) ≈ 1, in which case observing X doesn’t favor A over B. For observing X to definitely confirm A, we would have to know, not that P(X|A) ≈ 1, but that P(X|¬A) ≈ 0, which is something that we can’t know because we can’t range over all possible alternative explanations. For example, when Einstein’s theory of General Relativity toppled Newton’s incredibly well-confirmed theory of gravity, it turned out that all of Newton’s predictions were just a special case of Einstein’s predictions.
You can even formalize Popper’s philosophy mathematically. The likelihood ratio for X, the quantity P(X|A)/P(X|¬A), determines how much observing X slides the probability for A; the likelihood ratio is what says how strong X is as evidence. Well, in your theory A, you can predict X with probability 1, if you like; but you can’t control the denominator of the likelihood ratio, P(X|¬A)—there will always be some alternative theories that also predict X, and while we go with the simplest theory that fits the current evidence, you may someday encounter some evidence that an alternative theory predicts but your theory does not. That’s the hidden gotcha that toppled Newton’s theory of gravity. So there’s a limit on how much mileage you can get from successful predictions; there’s a limit on how high the likelihood ratio goes for confirmatory evidence.
On the other hand, if you encounter some piece of evidence Y that is definitely not predicted by your theory, this is enormously strong evidence against your theory. If P (Y |A) is infinitesimal, then the likelihood ratio will also be infinitesimal. For example, if P (Y |A) is 0.0001%, and P (Y |¬A) is 1%, then the likelihood ratio P (Y |A)/P (Y |¬A) will be 1:10,000. That’s −40 decibels of evidence! Or, flipping the likelihood ratio, if P (Y |A) is very small, then P (Y |¬A)/P (Y |A) will be very large, meaning that observing Y greatly favors ¬A over A. Falsification is much stronger than confirmation. This is a consequence of the earlier point that very strong evidence is not the product of a very high probability that A leads to X, but the product of a very low probability that not-A could have led to X. This is the precise Bayesian rule that underlies the heuristic value of Popper’s falsificationism.
Similarly, Popper’s dictum that an idea must be falsifiable can be interpreted as a manifestation of the Bayesian conservation-of-probability rule; if a result X is positive evidence for the theory, then the result ¬X would have disconfirmed the theory to some extent. If you try to interpret both X and ¬X as “confirming” the theory, the Bayesian rules say this is impossible! To increase the probability of a theory you must expose it to tests that can potentially decrease its probability; this is not just a rule for detecting would-be cheaters in the social process of science, but a consequence of Bayesian probability theory. On the other hand, Popper’s idea that there is only falsification and no such thing as confirmation turns out to be incorrect. Bayes’s Theorem shows that falsification is very strong evidence compared to confirmation, but falsification is still probabilistic in nature; it is not governed by fundamentally different rules from confirmation, as Popper argued.
So we find that many phenomena in the cognitive sciences, plus the statistical methods used by scientists, plus the scientific method itself, are all turning out to be special cases of Bayes’s Theorem. Hence the Bayesian revolution.
Having introduced Bayes’s Theorem explicitly, we can explicitly discuss its components.
P(A|X) = \[P(X|A) × P(A)\] / \[ P(X|A) × P(A) + P(X|¬A) × P(¬A) \]
We’ll start with P(A|X). If you ever find yourself getting confused about what’s A and what’s X in Bayes’s Theorem, start with P(A|X) on the left side of the equation; that’s the simplest part to interpret. In P(A|X), A is the thing we want to know about. X is how we’re observing it; X is the evidence we’re using to make inferences about A. Remember that for every expression P(Q|P), we want to know about the probability for Q given P, the degree to which P implies Q—a more sensible notation, which it is now too late to adopt, would be P (Q ← P ).
P (Q|P ) is closely related to P (Q, P ), but they are not identical. Expressed as a probability or a fraction, P (Q, P ) is the proportion of things that have property Q and property P among all things; e.g., the proportion of “women with breast cancer and a positive mammogram” within the group of all women. If the total number of women is 10,000, and 80 women have breast cancer and a positive mammogram, then P (Q, P ) is 80/10,000 = 0.8%. You might say that the absolute quantity, 80, is being normalized to a probability relative to the group of all women. Or to make it clearer, suppose that there’s a group of 641 women with breast cancer and a positive mammogram within a total sample group of 89,031 women. Six hundred and forty-one is the absolute quantity. If you pick out a random woman from the entire sample, then the probability you’ll pick a woman with breast cancer and a positive mammogram is P (Q, P ), or 0.72% (in this example).
On the other hand, P (Q|P ) is the proportion of things that have property Q and property P among all things that have P ; e.g., the proportion of women with breast cancer and a positive mammogram within the group of all women with positive mammograms. If there are 641 women with breast cancer and positive mammograms, 7,915 women with positive mammograms, and 89,031 women, then P (Q, P ) is the probability of getting one of those 641 women if you’re picking at random from the entire group of 89,031, while P (Q|P ) is the probability of getting one of those 641 women if you’re picking at random from the smaller group of 7,915.
In a sense, P (Q|P ) really means P (Q, P |P ), but specifying the extra P all the time would be redundant. You already know it has property P, so the property you’re investigating is Q—even though you’re looking at the size of group (Q,P) within group P, not the size of group Q within group P (which would be nonsense). This is what it means to take the property on the right-hand side as given; it means you know you’re working only within the group of things that have property P. When you constrict your focus of attention to see only this smaller group, many other probabilities change. If you’re taking P as given, then P (Q, P ) equals just P (Q)—at least, relative to the group P . The old P (Q), the frequency of “things that have property Q within the entire sample,” is revised to the new frequency of “things that have property Q within the subsample of things that have property P. ” If P is given, if P is our entire world, then looking for (Q, P ) is the same as looking for just Q.
If you constrict your focus of attention to only the population of eggs that are painted blue, then suddenly “the probability that an egg contains a pearl” becomes a different number; this proportion is different for the population of blue eggs than the population of all eggs. The given, the property that constricts our focus of attention, is always on the right side of P (Q|P ); the P becomes our world, the entire thing we see, and on the other side of the “given” P always has probability 1—that is what it means to take P as given. So P (Q|P ) means “If P has probability 1, what is the probability of Q?” or “If we constrict our attention to only things or events where P is true, what is the probability of Q?” The statement Q, on the other side of the given, is not certain—its probability may be 10% or 90% or any other number. So when you use Bayes’s Theorem, and you write the part on the left side as P(A|X)—how to update the probability of A after seeing X, the new probability of A given that we know X, the degree to which X implies A—you can tell that X is always the observation or the evidence, and A is the property being investigated, the thing you want to know about.
The right side of Bayes’s Theorem is derived from the left side through these steps:
P (A|X) = P (A|X)
P(A|X)= P(X,A) / P(X)
P(A|X) = P(X,A) / \[ P(X,A) + P(X,¬A)\]
P(A|X) =\[ P(X|A) × P(A)\] / \[ P(X|A) × P(A) + P(X|¬A) × P(¬A) \] .
Once the derivation is finished, all the implications on the right side of the equation are of the form P(X|A) or P(X|¬A), while the implication on the left side is P(A|X). The symmetry arises because the elementary causal relations are generally implications from facts to observations, e.g., from breast cancer to positive mammogram. The elementary steps in reasoning are generally implications from observations to facts, e.g., from a positive mammogram to breast cancer. The left side of Bayes’s Theorem is an elementary inferential step from the observation of positive mammogram to the conclusion of an increased probability of breast cancer. Implication is written right-to-left, so we write P (cancer|positive) on the left side of the equation. The right side of Bayes’s Theorem describes the elementary causal steps—for example, from breast cancer to a positive mammogram—and so the implications on the right side of Bayes’s Theorem take the form P (positive|cancer) or P (positive|¬cancer).
And that’s Bayes’s Theorem. Rational inference on the left end, physical causality on the right end; an equation with mind on one side and reality on the other. Remember how the scientific method turned out to be a special case of Bayes’s Theorem? If you wanted to put it poetically, you could say that Bayes’s Theorem binds reasoning into the physical universe.
Okay, we’re done.
Reverend Bayes says:

You are now an initiate of the Bayesian Conspiracy.
1\. Ward Casscells, Arno Schoenberger, and Thomas Graboys, “Interpretation by Physicians of Clinical Laboratory Results,” New England Journal of Medicine 299 (1978): 999–1001.
2\. David M. Eddy, “Probabilistic Reasoning in Clinical Medicine: Problems and Opportunities,” in Judgement Under Uncertainty: Heuristics and Biases, ed. Daniel Kahneman, Paul Slovic, and Amos Tversky (Cambridge University Press, 1982).
3\. Gerd Gigerenzer and Ulrich Hoffrage, “How to Improve Bayesian Reasoning without Instruction: Frequency Formats,” Psychological Review 102 (1995): 684–704.
4\. Ibid.
5\. Edwin T. Jaynes, “Probability Theory, with Applications in Science and Engineering,” Unpublished manuscript (1974).
The first publication of this post is here.
(注:作者现在认为此解释已被 贝叶斯法则指南 所取代。)
[编者按:这是本文原版的删节版,原版包含许多互动元素。]
你的朋友和同事在谈论一个叫"贝叶斯定理"或"贝叶斯法则"的东西,或者某种叫作贝叶斯推理的东西。他们谈论时似乎非常热情,于是你去谷歌搜索,找到了一个关于贝叶斯定理的网页,然后……
就是这个方程式。就这些。只有一个方程式。你找到的页面给出了它的定义,但没有说明它是什么,为什么有用,或者为什么你的朋友们会对它感兴趣。它看起来像是某个随机的统计学东西。
为什么一个数学概念会在学习它的人中激起这种奇特的热情?所谓正在席卷科学界的"贝叶斯革命"是什么,它宣称甚至把实验方法本身都纳入为一个特殊情况?贝叶斯信徒所掌握的秘密是什么?他们所见到的光是什么?
很快你就会知道。很快你就会成为我们中的一员。
虽然网上已经有一些关于贝叶斯定理的解释,但根据我向人们介绍贝叶斯推理的经验,现有的网络解释太抽象了。贝叶斯推理非常反直觉。人们不会凭直觉运用贝叶斯推理,在接受辅导时发现贝叶斯推理非常难学,一旦辅导结束又会迅速忘记贝叶斯方法。这对初学者和某一领域经过专业培训的专业人员来说同样成立。贝叶斯推理显然是那种对人类来说天生难以用内置心智能力来掌握的东西,就像量子力学或韦森选择测验一样。
或者他们是这么说的。在这里,你会找到一个试图提供贝叶斯推理直觉化解释的尝试——一个异常温和的入门,调动了人类理解数字的所有方式,从自然频率到空间可视化。其意图是传达数字的含义,以及规则为何如此(且不可能是其他样子),而非传达操纵数字的抽象规则。当你读完这篇文章时,你将在梦中看见贝叶斯问题。
让我们开始吧。
这里有一个关于医生经常遇到的情景的故事题:
在参与常规筛查的 40 岁女性中,有 1% 患有乳腺癌。80% 患有乳腺癌的女性会得到阳性乳腺摄影结果。9.6% 没有患乳腺癌的女性也会得到阳性乳腺摄影结果。这个年龄组中的一名女性在常规筛查中得到了阳性乳腺摄影结果。她实际上患有乳腺癌的概率是多少?
你认为答案是什么?如果你以前没有遇到过这种问题,请在继续之前花点时间给出自己的答案。
接下来,假设我告诉你,大多数医生在这道题上得出了相同的错误答案——通常只有约 15% 的医生答对。("真的吗?15%?这是一个真实的数字,还是基于网络投票的都市传说?"这是一个真实的数字。参见 Casscells、Schoenberger 和 Graboys 1978;\[1\] Eddy 1982;\[2\] Gigerenzer 和 Hoffrage 1995;\[3\] 以及许多其他研究。这是一个令人惊讶的结果,很容易被复现,因此已被广泛复现。)
对于上面的故事题,大多数医生估计概率在 70% 到 80% 之间,这完全是错误的。
以下是一个换了说法的版本,医生的表现稍好一些:
在参与常规筛查的 1,000 名 40 岁女性中,有 10 名患有乳腺癌。1,000 名患有乳腺癌的女性中有 800 名会得到阳性乳腺摄影结果。1,000 名没有患乳腺癌的女性中有 96 名也会得到阳性乳腺摄影结果。如果这个年龄组中的 1,000 名女性接受常规筛查,大约有多少比例的阳性乳腺摄影女性实际上患有乳腺癌?
最后,这是医生表现最好的版本,有 46%——近半数——得出了正确答案:
在参与常规筛查的 10,000 名 40 岁女性中,有 100 名患有乳腺癌。每 100 名患有乳腺癌的女性中有 80 名会得到阳性乳腺摄影结果。9,900 名没有患乳腺癌的女性中有 950 名也会得到阳性乳腺摄影结果。如果这个年龄组中的 10,000 名女性接受常规筛查,大约有多少比例的阳性乳腺摄影女性实际上患有乳腺癌?
正确答案是 7.8%,计算过程如下:在 10,000 名女性中,有 100 名患有乳腺癌;这 100 名中有 80 名乳腺摄影结果为阳性。同样是这 10,000 名女性,9,900 名没有患乳腺癌,其中 950 名也会得到阳性乳腺摄影结果。这使得阳性乳腺摄影结果的女性总人数为 950 + 80 = 1,030。在这 1,030 名阳性乳腺摄影女性中,有 80 名患有癌症。以比例表示,即 80/1,030 = 0.07767,约为 7.8%。
换句话说,在乳腺摄影之前,10,000 名女性可以分为两组:
- 第一组:100 名患有乳腺癌的女性。
- 第二组:9,900 名未患有乳腺癌的女性。
这两组之和为 10,000 名患者,确认在计算中没有遗漏任何人。乳腺摄影之后,这些女性可以分为四组:
- A 组:80 名患有乳腺癌且乳腺摄影结果为阳性的女性。
- B 组:20 名患有乳腺癌且乳腺摄影结果为阴性的女性。
- C 组:950 名未患有乳腺癌且乳腺摄影结果为阳性的女性。
- D 组:8,950 名未患有乳腺癌且乳腺摄影结果为阴性的女性。
A 组和 B 组(患有乳腺癌的组)之和对应第一组;C 组和 D 组(未患有乳腺癌的组)之和对应第二组。如果对 10,000 名患者进行乳腺摄影,那么在 1,030 名阳性乳腺摄影患者中,有 80 名阳性乳腺摄影患者患有癌症。这是正确答案,是医生应该告诉阳性乳腺摄影患者的答案,如果她询问自己患乳腺癌的可能性;如果十三名患者提出这个问题,大约有一名患有癌症。
最常见的错误是忽略了女性患乳腺癌的原始比例,以及未患乳腺癌的女性中收到假阳性结果的比例,而只关注患有乳腺癌的女性中得到阳性结果的比例。例如,这些研究中的绝大多数医生似乎都认为,如果约 80% 患有乳腺癌的女性有阳性乳腺摄影结果,那么阳性乳腺摄影女性患乳腺癌的概率一定约为 80%。
求得最终答案始终需要全部三条信息——患有乳腺癌的女性百分比、未患乳腺癌的女性中收到假阳性结果的百分比,以及患有乳腺癌的女性中收到(正确)阳性结果的百分比。
患者患乳腺癌的原始比例称为先验概率。患乳腺癌的患者得到阳性乳腺摄影结果的概率,以及未患乳腺癌的患者得到阳性乳腺摄影结果的概率,称为两个条件概率。这些初始信息统称为先验。最终答案——已知患者乳腺摄影结果为阳性时,她患乳腺癌的估计概率——称为修正概率或后验概率。我们刚刚看到,后验概率在一定程度上取决于先验概率。
为了看清最终答案始终取决于女性患乳腺癌的原始比例,请考虑另一个宇宙,其中每百万名女性中只有一名患有乳腺癌。即使这个世界中的乳腺摄影能在 10 例癌症中检测出 8 例,同时对未患乳腺癌的女性只有 1/10 的假阳性率,每检测到一例真正的癌症,仍然会有十万个假阳性。女性患癌的原始概率极其低,以至于虽然乳腺摄影的阳性结果确实增加了估计概率,但概率并未增加到确定或"明显的机会";概率从 1:1,000,000 增加到 1:100,000。
这表明,乳腺摄影结果并不会取代你对患者患癌概率的旧信息;乳腺摄影会推动估计概率朝结果的方向移动。阳性结果将原始概率向上推动;阴性结果将概率向下推动。例如,在原始问题中,1% 的女性患有癌症,80% 患癌女性的乳腺摄影结果为阳性,9.6% 未患癌女性的乳腺摄影结果为阳性,乳腺摄影的阳性结果将 1% 的概率推动向上至 7.8%。
大多数第一次遇到此类问题的人会进行这样的心理操作:用患有癌症的女性得到阳性乳腺摄影结果的 80% 概率取代原始的 1% 概率。这看起来似乎是个好主意,但根本行不通。"阳性乳腺摄影的女性患乳腺癌的概率"和"患有乳腺癌的女性得到阳性乳腺摄影结果的概率"根本不是同一回事;它们就像苹果和奶酪一样不同。
问:贝叶斯推理者为什么要过马路?
答:你需要更多信息才能回答这个问题。
假设一个桶里装着许多小塑料蛋。有些蛋被涂成红色,有些被涂成蓝色。桶中 40% 的蛋含有珍珠,60% 什么都没有。含有珍珠的蛋中有 30% 被涂成蓝色,什么都没有的蛋中有 10% 被涂成蓝色。一个蓝色蛋含有珍珠的概率是多少?对于这个例子,算术足够简单,你可能能在脑中完成,我建议你尝试这样做。
更简洁地表述这个问题:
P(珍珠)= 40%
P(蓝色|珍珠)= 30%
P(蓝色|¬珍珠)= 10%
P(珍珠|蓝色)= ?
符号"¬"是"非"的简写,因此¬珍珠读作"没有珍珠"。
符号 P(蓝色|珍珠)是"已知珍珠时蓝色的概率"或"已知蛋含有珍珠时蛋被涂成蓝色的概率"的简写。右侧的项是你已经知道的或前提,左侧的项是含义或结论。如果我们有 P(蓝色|珍珠)= 30%,并且我们已经知道某个蛋含有珍珠,那么我们可以得出结论,这个蛋被涂成蓝色的概率为 30%。因此,我们寻找的最终事实——"蓝色蛋含有珍珠的概率"或"如果我们知道蛋被涂成蓝色,蛋含有珍珠的概率"——读作 P(珍珠|蓝色)。
40% 的蛋含有珍珠,60% 的蛋什么都没有。含有珍珠的蛋中有 30% 被涂成蓝色,因此总共有 12% 的蛋含有珍珠且被涂成蓝色。什么都没有的蛋中有 10% 被涂成蓝色,因此总共有 6% 的蛋什么都没有且被涂成蓝色。总共有 18% 的蛋被涂成蓝色,总共有 12% 的蛋被涂成蓝色且含有珍珠,因此蓝色蛋含有珍珠的概率为 12/18 = 2/3,约为 67%。
和之前一样,我们可以通过考虑极端情况来看出三条信息都是必要的。在一个(大)桶中,每千个蛋中只有一个含有珍珠,知道一个蛋被涂成蓝色会将概率从 0.1% 推高到 0.3%(而不是从 40% 推高到 67%)。类似地,如果 1,000 个蛋中有 999 个含有珍珠,知道一个蛋是蓝色的会将概率从 99.9% 推高到 99.966%;蛋不含珍珠的概率从 1/1,000 降至约 1/3,000。
对于珍珠蛋问题,大多数不熟悉贝叶斯推理的受访者可能会回答,蓝色蛋含有珍珠的概率为 30%,或者可能是 20%(30% 的真阳性概率减去 10% 的假阳性概率)。即使这种心理操作当时看起来是个好主意,但就所问问题而言它毫无意义。这就像一个实验,你问一个二年级学生:"如果 18 人上了公共汽车,然后又有 7 人上了公共汽车,司机多大了?"许多二年级学生会回答:"25 岁。"他们知道自己被提示进行某种心理程序,但还没有完全把程序与现实联系起来。类似地,要找到阳性乳腺摄影女性患乳腺癌的概率,用患有乳腺癌的女性得到阳性乳腺摄影结果的概率取代女性患癌的原始概率,根本毫无意义。你也不能用真阳性概率减去假阳性概率。这些操作就像用公共汽车上的人数来求司机年龄一样毫无关联。
Gigerenzer 和 Hoffrage 在 1995 年的一项研究表明,某些表述故事题的方式更容易引发正确的贝叶斯推理。\[4\] 最不容易引发正确推理的表述使用概率。稍微更容易引发正确推理的表述使用频率而不是概率;问题保持不变,但不是说 1% 的女性患有乳腺癌,而是说每 100 名女性中有 1 名患有乳腺癌,100 名患乳腺癌的女性中有 80 名会得到阳性乳腺摄影结果,等等。为什么更高比例的受试者在这个问题上表现出贝叶斯推理?可能是因为说"每 100 名女性中有 1 名"会鼓励你具体地想象 X 名患癌女性,从而想象到 X 名患癌且乳腺摄影结果为阳性的女性,等等。
迄今为止发现的最有效的表述方式是所谓的自然频率——说 100 个蛋中有 40 个含有珍珠,含有珍珠的 40 个蛋中有 12 个被涂成蓝色,什么都没有的 60 个蛋中有 6 个被涂成蓝色。自然频率的表述是指在呈现条件概率时包含了先验概率的信息。如果你是通过自然实验来了解蛋的条件概率,你会在敲开一百个蛋的过程中——敲开约 40 个含有珍珠的蛋,其中 12 个被涂成蓝色,同时敲开 60 个什么都没有的蛋,其中约 6 个被涂成蓝色。在了解条件概率的过程中,你看到含有珍珠的蓝色蛋的例子,大约是看到不含珍珠的蓝色蛋的例子的两倍。
不幸的是,虽然自然频率是朝正确方向迈出的一步,但它可能还不够。当问题以自然频率呈现时,使用贝叶斯推理的人的比例上升到约一半。这是一个很大的改进,但当我们谈论真实的医生和真实的患者时,还不够大。
问:如何找到一个问题的先验?
答:许多常用先验列在《化学与物理手册》中。
**问:先验最初从哪里来?**
答:永远不要问这个问题。
问:嗯嗯。那么科学家从哪里得到他们的先验?
答:科学问题的先验由美国科学促进会每年投票决定。近年来,投票变得激烈而有争议,充满广泛的怨恨、派系极化,甚至发生了几起公然的暗杀事件。这可能是贝叶斯委员会内部争权的幌子,也可能是争论者有太多空闲时间。没有人真的确定。
问:我明白了。那么其他人从哪里得到他们的先验?
答:他们从 Kazaa 下载先验。
问:如果我想要的先验在 Kazaa 上找不到怎么办?
答:旧金山唐人街小巷里有一家陈旧杂乱的古董店。不要问那只铜老鼠的事。
实际上,先验就像最终答案一样是真是假——它们反映现实,可以通过与现实比较来判断。例如,如果你认为样本中 10,000 名女性中有 920 名患有乳腺癌,而实际数字是 10,000 名中有 100 名,那么你的先验是错误的。对于我们特定的问题,先验可能是通过三项研究建立的——一项关于患乳腺癌女性病史的研究,以了解她们中有多少人得到了阳性乳腺摄影结果;一项关于未患乳腺癌女性的研究,以了解她们中有多少人得到了阳性乳腺摄影结果;以及一项关于某特定人群中乳腺癌患病率的流行病学研究。
概率 P(A, B)与 P(B, A)相同,但 P(A|B)与 P(B|A)不同,P(A, B)与 P(A|B)完全不同。混淆其中一些或全部这些量是一个常见的错误。
为了熟悉它们之间的所有关系,我们来玩"追踪自由度"的游戏。例如,P(癌症)和 P(¬癌症)这两个量之间有一个自由度,因为有一般定律 P(A)+ P(¬A)= 1。如果你知道 P(¬癌症)= 0.99,你可以得到 P(癌症)= 1 − P(¬癌症)= 0.01。
P(阳性|癌症)和 P(¬阳性|癌症)之间也只有一个自由度;患有乳腺癌的女性要么得到阳性乳腺摄影结果,要么不会。另一方面,P(阳性|癌症)和 P(阳性|¬癌症)有两个自由度。你可以有一种乳腺摄影,对 80% 的癌症患者和 9.6% 的健康患者返回阳性;或者对 70% 的癌症患者和 2% 的健康患者返回阳性;甚至一种健康测试,对 30% 的癌症患者和 92% 的健康患者返回"阳性"。这两个量,即乳腺摄影对癌症患者的输出和乳腺摄影对健康患者的输出,在数学上是独立的;它们之间无法以任何方式互相推导,因此它们之间有两个自由度。
P(阳性, 癌症)、P(阳性|癌症)和 P(癌症)呢?这里有三个量;有多少个自由度?在这种情况下,必须成立的等式是
P(阳性, 癌症)= P(阳性|癌症)× P(癌症)。
这个等式减少了一个自由度。如果我们知道患癌患者的比例,以及癌症患者乳腺摄影结果为阳性的概率,我们可以通过乘法推断同时患有乳腺癌且乳腺摄影结果为阳性的患者比例。
类似地,如果我们知道同时患有乳腺癌和乳腺摄影结果为阳性的患者人数,以及患有乳腺癌的患者人数,我们可以通过除法估计患有乳腺癌的女性得到阳性乳腺摄影结果的概率:P(阳性|癌症)= P(阳性, 癌症)/P(癌症)。事实上,医学诊断测试正是以这种方式校准的;你对 8,520 名患有乳腺癌的女性进行研究,发现有 6,816 名(大约)患有乳腺癌且乳腺摄影结果为阳性的女性,然后用 6,816 除以 8,520,得出 80% 的乳腺癌女性有阳性乳腺摄影结果。(顺便说一句,如果你不小心用 8,520 除以 6,816 而不是反过来,你的计算就会开始出现奇怪的结果,例如坚持认为 125% 的患有乳腺癌且乳腺摄影结果为阳性的女性患有乳腺癌。根据我的经验,这是进行贝叶斯算术时的一个常见错误。)最后,如果你知道 P(阳性, 癌症)和 P(阳性|癌症),你可以推断最初必然存在多少癌症患者。三个量之间共有两个自由度;如果我们知道其中任意两个,我们就能推断第三个。
P(阳性)、P(阳性, 癌症)和 P(阳性, ¬癌症)呢?这三个变量之间同样只有两个自由度。占据额外自由度的等式是
P(阳性)= P(阳性, 癌症)+ P(阳性, ¬癌症)。
这就是 P(阳性)最初的计算方式;我们计算患有乳腺癌且乳腺摄影结果为阳性的女性人数,以及未患乳腺癌但乳腺摄影结果为阳性的女性人数,然后将它们相加得到乳腺摄影结果为阳性的女性总人数。出去进行一项研究来确定乳腺摄影结果为阳性的女性人数——仅仅是那一个数字而已——会是非常奇怪的,但理论上你可以这样做。如果你然后进行另一项研究,找出那些乳腺摄影结果为阳性且患有乳腺癌的女性人数,你也会知道乳腺摄影结果为阳性但没有乳腺癌的女性人数——因为乳腺摄影结果为阳性的女性要么患有乳腺癌,要么没有。一般而言,P(A, B)+ P(A, ¬B)= P(A)。对称地,P(A, B)+ P(¬A, B)= P(B)。
P(阳性, 癌症)、P(阳性, ¬癌症)、P(¬阳性, 癌症)和 P(¬阳性, ¬癌症)呢?你可能一开始会认为这四个量只有两个自由度——例如,你可以通过 P(阳性)× P(¬癌症)得到 P(阳性, ¬癌症),因此只要给定两个量 P(阳性)和 P(癌症),就可以找到所有四个量。这是不对的!P(阳性, ¬癌症)= P(阳性)× P(¬癌症)仅在两个概率统计独立时成立——即女性患乳腺癌的概率对其是否有阳性乳腺摄影结果没有影响。这相当于要求两个条件概率彼此相等——这一要求会消除一个自由度。如果你记得这四个量是 A、B、C 和 D 组,你可以看这四个组并意识到,理论上你可以将任意数量的人放入这四个组中。如果你从 80 名患有乳腺癌且乳腺摄影结果为阳性的女性开始,没有理由不能再加入 500 名患有乳腺癌且乳腺摄影结果为阴性的女性,随后加入 3 名未患乳腺癌且乳腺摄影结果为阴性的女性,等等。所以现在看来这四个量有四个自由度。确实如此,只是当我们将它们表示为概率时,我们需要将它们归一化为完整组的分数,这增加了约束条件 P(阳性, 癌症)+ P(阳性, ¬癌症)+ P(¬阳性, 癌症)+ P(¬阳性, ¬癌症)= 1。这个等式占用了一个自由度,使四个量之间剩余三个自由度。如果你指定了 A、B 和 D 组中女性的比例,你就可以推断 C 组中女性的比例。
给定四个组 A、B、C 和 D,计算其他所有内容非常简单:
P(癌症)=(A + B)/(A + B + C + D)
P(¬阳性|癌症)= B /(A + B)
等等。由于 {A, B, C, D} 包含三个自由度,因此与癌症率相关的整套概率只包含三个自由度。记住,在我们的问题中,我们总是需要三条信息——先验概率和两个条件概率——它们之间确实有三个自由度。实际上,对于贝叶斯问题,任何三个具有三个自由度的量在逻辑上都应该完全指定整个问题。
检测给出真阳性的概率除以检测给出假阳性的概率被称为该检测的似然比。阳性结果的似然比概括了阳性结果将在多大程度上推动先验概率。那么医学检测的似然比是否概括了关于该检测有用性的全部信息呢?
不,它没有!似然比概括了关于医学检测阳性结果含义的全部信息,但没有指定检测阴性结果的含义,也没有指定检测有用的频率。例如,对于癌症患者命中率为 80%、对健康患者假阳性率为 9.6% 的乳腺摄影,其似然比与命中率为 8%、假阳性率为 0.96% 的检测相同。尽管这两种检测的似然比相同,但第一种检测在各方面都更有用——它更频繁地检测出疾病,阴性结果是更强的健康证据。
假设你对乳腺癌连续进行两次检测——比如说,标准乳腺摄影和另一种独立于乳腺摄影的检测。由于我不知道任何与乳腺摄影独立的检测,我将为这个问题发明一种,称之为 Tams-Braylor 分裂测试,它检查是否有任何细胞的分裂速度比其他细胞更快。我们假设 Tams-Braylor 对 90% 的乳腺癌患者给出真阳性,对 5% 的未患癌患者给出假阳性。假设乳腺癌的先验患病率为 1%。如果一名患者在乳腺摄影和 Tams-Braylor 检测中都得到阳性结果,她患有乳腺癌的修正概率是多少?
解决此问题的一种方法是,取阳性乳腺摄影结果的修正概率(我们已经计算为 7.8%),并将其作为新的先验概率代入 Tams-Braylor 检测。如果我们这样做,我们发现结果为 60%。
假设某人群中乳腺癌的先验患病率为 1%。假设我们作为医生有三种独立的乳腺癌检测方法。我们的第一种检测,检测 A,乳腺摄影,其似然比为 80%/9.6% = 8.33。第二种检测,检测 B,其似然比为 18.0(例如,来自 90% 对 5%);第三种检测,检测 C,其似然比为 3.5(可以来自 70% 对 20%,或 35% 对 10%;没有区别)。假设一名患者在所有三种检测中都得到阳性结果。患者患有乳腺癌的概率是多少?
这里有一个简化记账的有趣技巧。如果某人群中乳腺癌的先验患病率为 1%,那么 100 名女性中有 1 名患有乳腺癌,99 名没有。所以如果我们将 1% 的概率重写为赔率比,赔率为 1:99。
三种检测 A、B 和 C 的似然比为:
8.33 : 1 = 25 : 3
18.0 : 1 = 18 : 1
3.5 : 1 = 7 : 2。
在所有三种检测中得分为阳性的患乳腺癌女性,与在所有三种检测中得分为阳性的未患乳腺癌女性的赔率,等于:
1 × 25 × 18 × 7 : 99 × 3 × 1 × 2 = 3150 : 594。
要从赔率中恢复概率,我们只需写出:
3150/(3150 + 594)= 84%。
无论赔率比如何写,这始终有效;即 8.33:1 与 25:3 或 75:9 完全相同。检测以何种顺序进行,或以何种顺序计算结果,都无关紧要。证明留作读者练习。
E. T. Jaynes 在《科学与工程中的应用概率论》中建议,可信度和证据应该用分贝来衡量。\[5\]
分贝?
分贝用于测量强度的指数差异。例如,如果汽车喇叭的声音携带的能量(每平方米每秒)是闹钟声音的 10,000 倍,那么汽车喇叭会比闹钟响 40 分贝。鸟鸣的声音可能携带的能量是闹钟的 1,000 倍少,因此会比闹钟轻 30 分贝。要获得分贝数,你取以 10 为底的对数并乘以 10:
分贝 = 10log₁₀(强度)
强度 = 10^(分贝/10)。
假设我们从 1% 的先验概率开始,即女性患乳腺癌,对应于赔率比 1:99。然后我们进行三种似然比分别为 25:3、18:1 和 7:2 的检测。你可以将这些数字相乘……或者你可以只是将它们的对数相加:
10log₁₀(1/99)≈ −20
10log₁₀(25/3)≈ 9
10log₁₀(18/1)≈ 13
10log₁₀(7/2)≈ 5。
起初,女性患乳腺癌的可能性相当低——我们的可信度水平在 −20 分贝。然后三个检测结果出现,对应 9、13 和 5 分贝的证据。这将可信度水平提高了总共 27 分贝,意味着 −20 分贝的先验可信度上升到 7 分贝的后验可信度。所以赔率从 1:99 变为 5:1,概率从 1% 变为约 83%。
你是小发明的机械师。当小发明停止工作时,30% 的情况是由于软管堵塞。如果小发明的软管堵塞了,戳它产生火花的概率为 45%。如果小发明的软管没有堵塞,戳它只有 5% 的概率产生火花。一位顾客带来了一个故障的小发明。你戳了戳小发明,发现它产生了火花。产生火花的小发明软管堵塞的概率是多少?
你为解决这个问题进行了哪些算术运算?
(45% × 30%)/(45% × 30% + 5% × 70%)
类似地,为了求阳性乳腺摄影女性患乳腺癌的概率,我们计算:
\[P(阳性|癌症)× P(癌症)\] / \[P(阳性|癌症)× P(癌症)+ P(阳性|¬癌症)× P(¬癌症)\]
即
P(阳性, 癌症)/ \[P(阳性, 癌症)+ P(阳性, ¬癌症)\]
即
P(阳性, 癌症)/ P(阳性)
即
P(癌症|阳性)。
这一计算的完全一般形式称为贝叶斯定理或贝叶斯法则。
贝叶斯定理:
P(A|X)= \[P(X|A)× P(A)\] / \[P(X|A)× P(A)+ P(X|¬A)× P(¬A)\]
当存在我们想要研究的某个现象 A,以及关于 A 的证据观察 X 时——例如,在前面的例子中,A 是乳腺癌,X 是阳性乳腺摄影结果——贝叶斯定理告诉我们如何根据新证据 X 更新 A 的概率。
到这一点,贝叶斯定理可能看起来显而易见,甚至是同义反复,而不是令人兴奋的新事物。如果是这样,这个介绍就完全成功地实现了它的目的。
贝叶斯定理描述了什么使某事物成为"证据",以及它是多少证据。统计模型通过与贝叶斯方法的比较来判断,因为在统计学中,贝叶斯方法是最好的——贝叶斯方法定义了你从给定证据中能获得的最大收益,就像热力学定义了你从温度差中能获得的最大功一样。这就是为什么你会听到认知科学家谈论贝叶斯推理者。在认知科学中,贝叶斯推理者是我们用来表示理性心智的技术精确代码词。
从贝叶斯定理中,你还可以学到一些关于人类推理的一般启发。
例如,在关于贝叶斯定理的许多讨论中,你可能听到认知心理学家说人们对先验频率的考虑不够充分,意思是当人们面对一个有证据 X 表明条件 A 可能成立的问题时,他们倾向于仅凭证据 X 与 A 的匹配程度来判断 A 的可能性,而不考虑 A 的先验频率。例如,如果你认为在乳腺摄影例子中,女性患乳腺癌的概率在 70%–80% 范围内,那么这种推理对问题中给定的先验频率不敏感;它无论 1% 还是 10% 的女性最初患有乳腺癌都无所谓。"更多地注意先验频率!"是人类需要牢记的许多事情之一,以部分补偿我们内置的不足。
一个相关的错误是,在确定证据 X 对 A 的支持程度时,过多关注 P(X|A)而对 P(X|¬A)关注不够。结果 X 作为 A 的证据的程度不仅取决于"如果 A 为真,我们预期会看到结果 X"这一陈述的强度,还取决于"如果 A 不为真,我们不会预期看到结果 X"这一陈述的强度。例如,如果下雨,这非常有力地暗示草是湿的——P(草湿|下雨)≈ 1——但看到草是湿的并不一定意味着刚刚下了雨;也许洒水器开了,或者你看到的是清晨的露水。由于 P(草湿|¬下雨)大大高于零,P(下雨|草湿)大大低于 1。另一方面,如果不下雨草从不会湿,那么知道草是湿的就会总是表明下了雨,P(下雨|草湿)≈ 1,即使 P(草湿|下雨)= 50%;即使草只在下雨的 50% 情况下才会变湿。证据始终是两个条件概率之间差异的结果。强证据不是 A 导致 X 的概率非常高的产物,而是非 A 导致 X 的概率非常低的产物。
科学中的贝叶斯革命的推动力,不仅在于越来越多的认知科学家突然注意到心理现象中具有贝叶斯结构;不仅在于每个领域的科学家学会用贝叶斯方法来评判他们的统计方法;还在于科学本身是贝叶斯定理的特殊情况;实验证据是贝叶斯证据这一理念。贝叶斯革命者认为,当你进行实验并得到"确认"或"否认"你的理论的证据时,这种确认和否认受贝叶斯规则支配。例如,你不仅要考虑你的理论是否预测了该现象,还要考虑其他可能的解释是否也预测了该现象。
以前,最流行的科学哲学可能是卡尔·波普的证伪主义——这是贝叶斯革命目前正在颠覆的旧哲学。卡尔·波普关于理论可以被明确证伪但永远无法被明确确认的想法,是贝叶斯规则的另一个特殊情况;如果 P(X|A)≈ 1——如果理论做出了明确的预测——那么观察到 ¬X 会非常强烈地证伪 A。另一方面,如果 P(X|A)≈ 1,并且我们观察到 X,这并不能明确地确认该理论;可能存在某种其他条件 B 使得 P(X|B)≈ 1,在这种情况下观察到 X 并不会使 A 优于 B。为了让观察到 X 能明确地确认 A,我们必须知道的不是 P(X|A)≈ 1,而是 P(X|¬A)≈ 0,这是我们无法知道的,因为我们无法遍历所有可能的替代解释。例如,当爱因斯坦的广义相对论推翻牛顿经过极好确认的引力理论时,结果表明牛顿的所有预测都只是爱因斯坦预测的特殊情况。
你甚至可以数学地形式化波普的哲学。X 的似然比,即 P(X|A)/P(X|¬A),决定了观察到 X 会将 A 的概率推动多少;似然比说明了 X 作为证据有多强。在你的理论 A 中,如果你喜欢,你可以以概率 1 预测 X;但你无法控制似然比的分母 P(X|¬A)——总会有一些替代理论也能预测 X,虽然我们选择最简单的拟合当前证据的理论,但你可能有一天会遇到某些替代理论能预测但你的理论不能预测的证据。这是推翻牛顿引力理论的隐藏陷阱。所以从成功预测中能获得的收益是有限的;确认性证据的似然比的上限存在。
另一方面,如果你遇到某个明确不被你的理论预测的证据 Y,这是对你的理论的极其强烈的反证。如果 P(Y|A)是无穷小的,那么似然比也将是无穷小的。例如,如果 P(Y|A)是 0.0001%,而 P(Y|¬A)是 1%,那么似然比 P(Y|A)/P(Y|¬A)将是 1:10,000。这是 −40 分贝的证据!或者,翻转似然比,如果 P(Y|A)非常小,那么 P(Y|¬A)/P(Y|A)将非常大,意味着观察到 Y 大大支持 ¬A 而非 A。证伪比确认强得多。这是前面观点的结果,即非常强的证据不是 A 导致 X 的概率非常高的产物,而是非 A 导致 X 的概率非常低的产物。这是支撑波普证伪主义启发性价值的精确贝叶斯规则。
类似地,波普关于一个想法必须是可证伪的这一格言,可以被解释为贝叶斯概率守恒规则的体现;如果结果 X 对理论是正面证据,那么结果 ¬X 在某种程度上会否认理论。如果你试图将 X 和 ¬X 都解释为"确认"理论,贝叶斯规则说这是不可能的!要增加一个理论的概率,你必须让它接受可能降低其概率的测试;这不仅仅是用于检测社会科学过程中潜在欺骗者的规则,而是贝叶斯概率论的结果。另一方面,波普关于只有证伪而没有确认这一想法,结果是错误的。贝叶斯定理表明,证伪是比确认强得多的证据,但证伪在本质上仍然是概率性的;它不受与确认根本不同的规则支配,如波普所主张的那样。
所以我们发现,认知科学中的许多现象,加上科学家使用的统计方法,再加上科学方法本身,都在证明是贝叶斯定理的特殊情况。因此,贝叶斯革命。
在明确介绍了贝叶斯定理之后,我们可以明确讨论它的各个组成部分。
P(A|X)= \[P(X|A)× P(A)\] / \[P(X|A)× P(A)+ P(X|¬A)× P(¬A)\]
我们从 P(A|X)开始。如果你发现自己对贝叶斯定理中哪个是 A、哪个是 X 感到困惑,从方程左侧的 P(A|X)开始;那是最简单的解释部分。在 P(A|X)中,A 是我们想要了解的东西。X 是我们观察它的方式;X 是我们用来对 A 进行推断的证据。记住,对于每个表达式 P(Q|P),我们想要了解在已知 P 的情况下 Q 的概率,P 意味着 Q 的程度——一种更合理的符号,但现在采用已经太晚了,应该是 P(Q ← P)。
P(Q|P)与 P(Q, P)密切相关,但它们不完全相同。以概率或分数表示,P(Q, P)是所有事物中既有属性 Q 又有属性 P 的比例;例如,"患有乳腺癌且乳腺摄影结果为阳性的女性"在所有女性中的比例。如果女性总数为 10,000,有 80 名女性既患有乳腺癌又乳腺摄影结果为阳性,那么 P(Q, P)为 80/10,000 = 0.8%。你可以说绝对数量 80 被归一化为相对于所有女性群体的概率。或者更清楚地说,假设在总样本群体 89,031 名女性中,有 641 名患有乳腺癌且乳腺摄影结果为阳性的女性。六百四十一是绝对数量。如果你从整个样本中随机挑选一名女性,那么你挑到患有乳腺癌且乳腺摄影结果为阳性的女性的概率就是 P(Q, P),即本例中的 0.72%。
另一方面,P(Q|P)是在所有具有属性 P 的事物中既有属性 Q 又有属性 P 的比例;例如,"患有乳腺癌且乳腺摄影结果为阳性的女性"在所有乳腺摄影结果为阳性的女性中的比例。如果有 641 名患有乳腺癌且乳腺摄影结果为阳性的女性、7,915 名乳腺摄影结果为阳性的女性和 89,031 名女性,那么 P(Q, P)是从整个 89,031 人群体中随机挑选到这 641 名女性之一的概率,而 P(Q|P)是从较小的 7,915 人群体中随机挑选到这 641 名女性之一的概率。
在某种意义上,P(Q|P)实际上意味着 P(Q, P|P),但一直指定额外的 P 会是多余的。你已经知道它具有属性 P,所以你调查的属性是 Q——尽管你在看的是 P 组内(Q,P)组的大小,而不是 P 组内 Q 组的大小(这会是无意义的)。这就是把右侧的属性视为已知的含义;它意味着你知道你只在具有属性 P 的事物群体内工作。当你将注意力限制在只看这个较小的群体时,许多其他概率会改变。如果你把 P 视为已知,那么 P(Q, P)就等于只有 P(Q)——至少,相对于 P 群体而言。旧的 P(Q),即"整个样本中具有属性 Q 的事物的频率",被修正为"在具有属性 P 的事物的子样本中具有属性 Q 的事物的新频率"。如果 P 是已知的,如果 P 是我们的整个世界,那么查找(Q, P)就等于只查找 Q。
如果你将注意力限制在只有被涂成蓝色的蛋的群体,那么"蛋含有珍珠的概率"突然变成了一个不同的数字;这个比例在蓝色蛋的群体中与所有蛋的群体中是不同的。已知量,即限制我们注意焦点的属性,始终在 P(Q|P)的右侧;P 成为我们的世界,我们所见的全部,在"已知"的另一侧 P 始终具有概率 1——这就是把 P 视为已知的含义。所以 P(Q|P)意味着"如果 P 的概率为 1,Q 的概率是多少?"或"如果我们将注意力限制在只有 P 为真的事物或事件上,Q 的概率是多少?"另一侧的陈述 Q 并不确定——其概率可能是 10% 或 90% 或任何其他数字。所以当你使用贝叶斯定理,并将左侧写为 P(A|X)时——在看到 X 之后如何更新 A 的概率,已知我们知道 X 时 A 的新概率,X 意味着 A 的程度——你可以得知 X 始终是观察或证据,而 A 是被调查的属性,你想要了解的东西。
贝叶斯定理右侧通过以下步骤从左侧推导出来:
P(A|X)= P(A|X)
P(A|X)= P(X, A)/ P(X)
P(A|X)= P(X, A)/ \[P(X, A)+ P(X, ¬A)\]
P(A|X)= \[P(X|A)× P(A)\] / \[P(X|A)× P(A)+ P(X|¬A)× P(¬A)\]。
推导完成后,方程右侧的所有含义都是 P(X|A)或 P(X|¬A)的形式,而左侧的含义是 P(A|X)。这种对称性的出现是因为基本因果关系通常是从事实到观察的含义,例如从乳腺癌到阳性乳腺摄影结果。推理中的基本步骤通常是从观察到事实的含义,例如从阳性乳腺摄影结果到乳腺癌。贝叶斯定理左侧是从观察到阳性乳腺摄影结果到得出乳腺癌概率增加的结论这一基本推断步骤。含义是从右向左写的,所以我们将 P(癌症|阳性)写在方程左侧。贝叶斯定理右侧描述了基本因果步骤——例如,从乳腺癌到阳性乳腺摄影结果——因此贝叶斯定理右侧的含义采用 P(阳性|癌症)或 P(阳性|¬癌症)的形式。
这就是贝叶斯定理。左端是理性推断,右端是物理因果;一个等式,一边是心智,另一边是现实。还记得科学方法如何成为贝叶斯定理的特殊情况吗?如果你想诗意地表达,可以说贝叶斯定理将推理与物理宇宙绑在一起。
好了,我们完成了。
贝叶斯牧师说:

你现在是贝叶斯阴谋的入门级成员了。
1\. Ward Casscells、Arno Schoenberger 和 Thomas Graboys,《内科医生对临床实验室结果的解读》,《新英格兰医学杂志》299(1978):999–1001。
2\. David M. Eddy,《临床医学中的概率推理:问题与机会》,收录于《不确定性下的判断:启发法与偏见》,Daniel Kahneman、Paul Slovic 和 Amos Tversky 编(剑桥大学出版社,1982)。
3\. Gerd Gigerenzer 和 Ulrich Hoffrage,《如何在不接受指导的情况下改善贝叶斯推理:频率格式》,《心理学评论》102(1995):684–704。
4\. 同上。
5\. Edwin T. Jaynes,《概率论,在科学与工程中的应用》,未出版手稿(1974)。
本文首次发表于此处。