P(Wet Ground | Rain) = ~100%
If it's raining, the ground is almost certainly wet.
P(Rain | Wet Ground) = ???
If the ground is wet, is it raining? Not necessarily! Could be sprinklers, spilled water, morning dew...
These probabilities are DIFFERENT, even though they involve the same two things (rain and wet ground).
What Each Approach Believes
Frequentist
Bayesian
What is probability?
Long-run frequency of events
Degree of belief/confidence
Can hypotheses have probabilities?
NO — hypothesis is either true or false
YES — we have degrees of confidence
What question do we answer?
P(data | hypothesis) — p-value
P(hypothesis | data) — what we want!
Do we use prior knowledge?
NO — only this experiment's data
YES — we update prior beliefs with data
Main tool
Null hypothesis significance testing
Bayes' Theorem
Example 1: The Medical Test
This is the classic example that shows why P(D|H) ≠ P(H|D) — and why confusing them can be dangerous.
📋 The Scenario
You get tested for a rare disease. The test comes back POSITIVE.
Here's what you know about the test:
Sensitivity: 99% — If you HAVE the disease, test is positive 99% of the time
Specificity: 95% — If you DON'T have the disease, test is negative 95% of the time (5% false positive)
Disease prevalence: 1 in 1000 people have it (0.1%)
Question: What's the probability you actually have the disease?
🔴 What Many People (and Doctors!) Think
"The test is 99% accurate, so there's a 99% chance I have the disease!"
Or: "Only 5% false positive rate, so 95% chance I have it!"
This is WRONG. This confuses P(Positive | Disease) with P(Disease | Positive).
🟢 The Bayesian Answer: Let's Calculate P(Disease | Positive)
Imagine 100,000 people get tested:
Of 100,000 people:
• 100 actually have the disease (0.1%)
• 99,900 don't have the disease
Test results for the 100 WITH disease:
• 99 test positive (99% sensitivity) ✓ True positives
• 1 tests negative (false negative)
Test results for the 99,900 WITHOUT disease:
• 94,905 test negative (95% specificity)
• 4,995 test positive (5% false positive rate)
Total positive tests: 99 + 4,995 = 5,094
Of those 5,094 positive tests:
• Only 99 actually have the disease
• 4,995 are false positives!
P(Disease | Positive) = 99 / 5,094 = 1.9%
99% → 1.9%
THE SHOCKING RESULT:
Even with a "99% accurate" test coming back positive, there's only a 1.9% chance you actually have the disease!
Why? Because the disease is so rare (1 in 1000) that even a small false positive rate (5%) creates more false positives than true positives.
Why This Matters
The frequentist gives you P(Positive | Disease) = 99% — the sensitivity.
But what you WANT to know is P(Disease | Positive) = 1.9% — your actual risk.
Doctors who don't understand this difference cause unnecessary panic, unnecessary follow-up tests, and sometimes unnecessary treatment.
The Bayesian key insight: The prior probability (how common is the disease?) matters enormously. You can't just look at test accuracy — you have to factor in how likely you were to have the disease BEFORE the test.
Example 2: Is the Coin Fair?
This example shows how the two approaches think about the same problem completely differently.
🪙 The Scenario
Someone hands you a coin. You flip it 10 times and get 8 heads.
Question: Is this coin fair (50/50) or biased?
🔴 Frequentist Approach
1
Set up null hypothesis: "The coin is fair (P(heads) = 0.5)"
2
Ask: "IF the coin is fair, what's the probability of getting 8 or more heads in 10 flips?"
This is P(Data | Hypothesis) — the p-value calculation
3
Calculate: P(≥8 heads | fair coin) = 0.055 = 5.5%
4
Conclude: p = 0.055 > 0.05, so we "fail to reject the null hypothesis"
Translation: "We can't say the coin is biased" (barely!)
Frequentist says: "There's a 5.5% chance of seeing this result if the coin is fair. Since that's above our arbitrary 5% threshold, we can't conclude it's biased."
What frequentist CANNOT say: "What's the probability the coin is biased?" — That question is meaningless to a frequentist because the coin either IS or ISN'T biased. It's not probabilistic.
🟢 Bayesian Approach
1
Start with prior belief: "Before seeing any data, what do I believe about this coin?"
Maybe: 90% chance it's fair (most coins are), 10% chance it's biased
2
Observe data: 8 heads in 10 flips
3
Update belief using Bayes' Theorem:
How much more likely is this data if the coin is biased vs fair?
4
Calculate posterior: Given the data, maybe now 70% fair, 30% biased
(exact numbers depend on what "biased" means and your prior)
Bayesian says: "After seeing 8 heads, my confidence that the coin is fair dropped from 90% to 70%. I now think there's a 30% chance it's biased. If I flip more and keep getting lots of heads, that probability will increase."
What Bayesian CAN say: "There's a 30% probability the coin is biased." This is P(Hypothesis | Data) — exactly what we want to know!
The Key Differences
Frequentist
Bayesian
Question asked
How likely is this data if coin is fair?
How likely is coin biased given this data?
Uses prior info?
No — only uses this experiment
Yes — incorporates what we knew before
Can update with more flips?
Need new experiment, new p-value
Naturally — each flip updates probability
Answer format
"Reject" or "fail to reject" null
"X% probability hypothesis is true"
Context matters to Bayesians:
If this coin came from a magic shop known for trick coins, your prior might be 50% biased.
If it came from the US Mint, your prior might be 99.9% fair.
Same data (8 heads) → different conclusions based on reasonable prior knowledge.
Frequentists ignore this context entirely.
Example 3: The Drug Trial
This shows how the p-value approach leads researchers and the public astray.
💊 The Scenario
A pharmaceutical company tests a new drug for headaches.
100 people get the drug, 100 get placebo
Drug group: 60% report headache relief
Placebo group: 50% report headache relief
Difference: 10 percentage points
Statistical test gives: p = 0.04
🔴 Frequentist Interpretation
What p = 0.04 actually means:
"IF the drug has NO effect (null hypothesis), there's only a 4% chance of seeing a difference this large or larger by random chance."
Since p < 0.05, we "reject the null" and conclude:
"The drug has a statistically significant effect!"
What people THINK this means:
❌ "There's a 96% chance the drug works"
❌ "There's only a 4% chance this result is a fluke"
❌ "The drug is 96% likely to be effective"
ALL WRONG! P-value does NOT tell you probability the drug works.
🟢 Bayesian Interpretation
A Bayesian would ask: "What's P(Drug Works | This Data)?"
To answer this, we need to consider:
1
Prior probability: Before this trial, what was our belief?
Most drug candidates fail. Maybe 10% of drugs that reach trials actually work.
2
Likelihood of data under each hypothesis:
How likely is 60% vs 50% if drug works? If drug doesn't work?
3
Update using Bayes' Theorem:
Calculate posterior probability
Depending on priors and assumptions, the actual probability the drug works might be something like 30-50% — much lower than the intuitive "96%" people assume from p = 0.04!
Why This Matters for Broken Science:
When you hear "Study finds drug X works (p < 0.05)", your brain thinks "95%+ chance it works."
Reality: Depending on prior probability and study design, actual chance might be 30%, 50%, maybe even less.
This is why 50-90% of findings don't replicate — p < 0.05 doesn't mean what people think it means!
The Replication Problem Explained
🔬 Why Do "Significant" Findings Fail to Replicate?
Imagine 1000 researchers each testing a different potential drug. Assume:
Only 10% of the drugs actually work (100 drugs)
90% don't work (900 drugs)
Studies have 80% power to detect a real effect
Using p < 0.05 threshold
Results:
100 drugs that actually work:
• 80 get p < 0.05 (true positives) ✓
• 20 get p > 0.05 (false negatives)
900 drugs that don't work:
• 855 get p > 0.05 (true negatives)
• 45 get p < 0.05 (false positives!)
Total "significant" findings: 80 + 45 = 125 Of those, actually true: 80/125 = 64%
So even with good study design, 36% of "significant" findings are FALSE!
If the prior is worse (only 5% of drugs work), or studies are lower powered, or researchers p-hack, the false discovery rate gets even higher — explaining the replication crisis.
Why P-Values Mislead
Now you can see exactly why the p-value approach creates broken science.
What P-Value Actually Is
p-value = P(data this extreme or more | null hypothesis true)
"IF there's no effect, what's the probability of seeing results this extreme?"
What P-Value Is NOT
❌ P-value is NOT:
The probability that the null hypothesis is true
The probability that the finding is a fluke
The probability that you'll get the same result if you replicate
The probability that your hypothesis is correct
The "strength" of the evidence
Common P-Value Fallacies
Fallacy 1: "p = 0.03 means 97% chance the effect is real"
Wrong. P(Data | No Effect) ≠ P(No Effect | Data)
The p-value tells you about the data assuming no effect. It tells you nothing about the probability of there being an effect given the data.
Fallacy 2: "p = 0.01 is stronger evidence than p = 0.04"
Not necessarily. The p-value doesn't measure strength of evidence. A tiny effect in a huge sample can give p = 0.001 while being practically meaningless. A large effect in a small sample might give p = 0.06 while being potentially important.
Fallacy 3: "p > 0.05 means there's no effect"
Wrong. "Absence of evidence is not evidence of absence." A non-significant p-value might just mean your sample was too small to detect a real effect.
Fallacy 4: "Two studies with p = 0.03 and p = 0.06 found opposite results"
Wrong. They might have found very similar effect sizes! The difference between "significant" and "not significant" is often not itself statistically significant. The p = 0.05 cutoff is arbitrary.
How P-Hacking Works
Because p < 0.05 is the goal (for publication), researchers consciously or unconsciously manipulate analyses to achieve it:
P-Hacking Techniques:
Optional stopping: Keep collecting data until p < 0.05
Multiple comparisons: Test 20 things, report the one that "works"
Outlier removal: Remove data points that hurt your p-value
Subgroup analysis: "It didn't work overall, but in left-handed women over 40..."
Outcome switching: Original outcome didn't work? Try a different one!
Analytical flexibility: Try different statistical tests until one gives p < 0.05
With enough flexibility, you can get p < 0.05 for almost anything — even if there's no real effect. This is why so much "significant" research doesn't replicate.
The Bayesian Alternative:
Instead of asking "is p < 0.05?", Bayesians ask:
"Given all the evidence (including prior knowledge), what's our rational degree of belief that this hypothesis is true?"
This directly answers what we actually want to know, incorporates context, and isn't as easily gamed.
Summary: The Two Approaches
🔴 FREQUENTIST
Probability = frequency in repeated trials
Hypotheses are true or false (not probabilistic)
Asks: P(Data | Hypothesis)
Tool: p-values, significance testing
Ignores prior knowledge
Binary output: "significant" or "not"
vs
🟢 BAYESIAN
Probability = degree of belief
Hypotheses have probabilities (our confidence)
Asks: P(Hypothesis | Data)
Tool: Bayes' Theorem
Incorporates prior knowledge
Continuous output: probability updates
Key Takeaways
1. P(D|H) ≠ P(H|D)
The probability of the data given a hypothesis is NOT the probability of the hypothesis given the data. Confusing these is the heart of broken science.
2. P-values don't tell you what you think
p = 0.03 does NOT mean "97% chance the effect is real." It means "if there's no effect, there's a 3% chance of data this extreme."
3. Prior probability matters
A positive test for a rare disease might still mean low probability of having the disease. Context matters. Frequentists ignore this.
4. Bayesian asks what we actually want to know
"What's the probability this hypothesis is true given all the evidence?" — That's P(H|D), and only Bayesian thinking can answer it.
How This Broke Science
The frequentist approach became dominant in academic research because:
It gives a simple yes/no answer (p < 0.05 or not)
It doesn't require specifying prior beliefs (seems "objective")
It's computationally simpler (mattered before computers)
It became institutionalized — journals require it
But this created a system optimized for p-values, not truth. The result: replication crisis.
The Glassman Insight:
Real science — physics, chemistry, engineering — works because it's judged by predictive power, not p-values.
Does the bridge stand? Does the rocket fly? Does the drug actually help patients?
That's P(H|D) thinking — does reality match our predictions? — even if they don't call it Bayesian.
The "ologies" (psychology, sociology, much of medicine) got stuck on P(D|H) and p-values, and that's why they don't replicate.