← Curriculum Module 2 of 8

Frequentist vs Bayesian Statistics

Clear examples that make the difference click

The Core Difference: Two Different Questions

Frequentist and Bayesian statistics ask fundamentally different questions. Understanding this is the key to everything.

🔴 FREQUENTIST

P(Data | Hypothesis)

"If my hypothesis is TRUE, how likely is this data?"

🟢 BAYESIAN

P(Hypothesis | Data)

"Given this data, how likely is my hypothesis TRUE?"

THE KEY INSIGHT:

These are NOT the same question and they give DIFFERENT answers!

P(Rain | Clouds) ≠ P(Clouds | Rain)
P(Symptoms | Disease) ≠ P(Disease | Symptoms)
P(Data | No Effect) ≠ P(No Effect | Data)

A Simple Intuition

🌧️ Weather Example

P(Wet Ground | Rain) = ~100%
If it's raining, the ground is almost certainly wet.

P(Rain | Wet Ground) = ???
If the ground is wet, is it raining? Not necessarily! Could be sprinklers, spilled water, morning dew...

These probabilities are DIFFERENT, even though they involve the same two things (rain and wet ground).

What Each Approach Believes

Frequentist Bayesian
What is probability? Long-run frequency of events Degree of belief/confidence
Can hypotheses have probabilities? NO — hypothesis is either true or false YES — we have degrees of confidence
What question do we answer? P(data | hypothesis) — p-value P(hypothesis | data) — what we want!
Do we use prior knowledge? NO — only this experiment's data YES — we update prior beliefs with data
Main tool Null hypothesis significance testing Bayes' Theorem

Example 1: The Medical Test

This is the classic example that shows why P(D|H) ≠ P(H|D) — and why confusing them can be dangerous.

📋 The Scenario

You get tested for a rare disease. The test comes back POSITIVE.

Here's what you know about the test:

  • Sensitivity: 99% — If you HAVE the disease, test is positive 99% of the time
  • Specificity: 95% — If you DON'T have the disease, test is negative 95% of the time (5% false positive)
  • Disease prevalence: 1 in 1000 people have it (0.1%)

Question: What's the probability you actually have the disease?

🔴 What Many People (and Doctors!) Think

"The test is 99% accurate, so there's a 99% chance I have the disease!"

Or: "Only 5% false positive rate, so 95% chance I have it!"

This is WRONG. This confuses P(Positive | Disease) with P(Disease | Positive).

🟢 The Bayesian Answer: Let's Calculate P(Disease | Positive)

Imagine 100,000 people get tested:

Of 100,000 people:
• 100 actually have the disease (0.1%)
• 99,900 don't have the disease

Test results for the 100 WITH disease:
• 99 test positive (99% sensitivity) ✓ True positives
• 1 tests negative (false negative)

Test results for the 99,900 WITHOUT disease:
• 94,905 test negative (95% specificity)
4,995 test positive (5% false positive rate)

Total positive tests: 99 + 4,995 = 5,094

Of those 5,094 positive tests:
• Only 99 actually have the disease
4,995 are false positives!
P(Disease | Positive) = 99 / 5,094 = 1.9%
99% → 1.9%
THE SHOCKING RESULT:

Even with a "99% accurate" test coming back positive, there's only a 1.9% chance you actually have the disease!

Why? Because the disease is so rare (1 in 1000) that even a small false positive rate (5%) creates more false positives than true positives.

Why This Matters

The frequentist gives you P(Positive | Disease) = 99% — the sensitivity.

But what you WANT to know is P(Disease | Positive) = 1.9% — your actual risk.

Doctors who don't understand this difference cause unnecessary panic, unnecessary follow-up tests, and sometimes unnecessary treatment.

The Bayesian key insight: The prior probability (how common is the disease?) matters enormously. You can't just look at test accuracy — you have to factor in how likely you were to have the disease BEFORE the test.

Example 2: Is the Coin Fair?

This example shows how the two approaches think about the same problem completely differently.

🪙 The Scenario

Someone hands you a coin. You flip it 10 times and get 8 heads.

Question: Is this coin fair (50/50) or biased?

🔴 Frequentist Approach
1
Set up null hypothesis: "The coin is fair (P(heads) = 0.5)"
2
Ask: "IF the coin is fair, what's the probability of getting 8 or more heads in 10 flips?"
This is P(Data | Hypothesis) — the p-value calculation
3
Calculate: P(≥8 heads | fair coin) = 0.055 = 5.5%
4
Conclude: p = 0.055 > 0.05, so we "fail to reject the null hypothesis"
Translation: "We can't say the coin is biased" (barely!)

Frequentist says: "There's a 5.5% chance of seeing this result if the coin is fair. Since that's above our arbitrary 5% threshold, we can't conclude it's biased."

What frequentist CANNOT say: "What's the probability the coin is biased?" — That question is meaningless to a frequentist because the coin either IS or ISN'T biased. It's not probabilistic.

🟢 Bayesian Approach
1
Start with prior belief: "Before seeing any data, what do I believe about this coin?"
Maybe: 90% chance it's fair (most coins are), 10% chance it's biased
2
Observe data: 8 heads in 10 flips
3
Update belief using Bayes' Theorem:
How much more likely is this data if the coin is biased vs fair?
4
Calculate posterior: Given the data, maybe now 70% fair, 30% biased
(exact numbers depend on what "biased" means and your prior)

Bayesian says: "After seeing 8 heads, my confidence that the coin is fair dropped from 90% to 70%. I now think there's a 30% chance it's biased. If I flip more and keep getting lots of heads, that probability will increase."

What Bayesian CAN say: "There's a 30% probability the coin is biased." This is P(Hypothesis | Data) — exactly what we want to know!

The Key Differences

Frequentist Bayesian
Question asked How likely is this data if coin is fair? How likely is coin biased given this data?
Uses prior info? No — only uses this experiment Yes — incorporates what we knew before
Can update with more flips? Need new experiment, new p-value Naturally — each flip updates probability
Answer format "Reject" or "fail to reject" null "X% probability hypothesis is true"
Context matters to Bayesians:

If this coin came from a magic shop known for trick coins, your prior might be 50% biased.
If it came from the US Mint, your prior might be 99.9% fair.

Same data (8 heads) → different conclusions based on reasonable prior knowledge.
Frequentists ignore this context entirely.

Example 3: The Drug Trial

This shows how the p-value approach leads researchers and the public astray.

💊 The Scenario

A pharmaceutical company tests a new drug for headaches.

  • 100 people get the drug, 100 get placebo
  • Drug group: 60% report headache relief
  • Placebo group: 50% report headache relief
  • Difference: 10 percentage points
  • Statistical test gives: p = 0.04
🔴 Frequentist Interpretation

What p = 0.04 actually means:

"IF the drug has NO effect (null hypothesis), there's only a 4% chance of seeing a difference this large or larger by random chance."

Since p < 0.05, we "reject the null" and conclude:

"The drug has a statistically significant effect!"

What people THINK this means:

  • ❌ "There's a 96% chance the drug works"
  • ❌ "There's only a 4% chance this result is a fluke"
  • ❌ "The drug is 96% likely to be effective"

ALL WRONG! P-value does NOT tell you probability the drug works.

🟢 Bayesian Interpretation

A Bayesian would ask: "What's P(Drug Works | This Data)?"

To answer this, we need to consider:

1
Prior probability: Before this trial, what was our belief?
Most drug candidates fail. Maybe 10% of drugs that reach trials actually work.
2
Likelihood of data under each hypothesis:
How likely is 60% vs 50% if drug works? If drug doesn't work?
3
Update using Bayes' Theorem:
Calculate posterior probability

Depending on priors and assumptions, the actual probability the drug works might be something like 30-50% — much lower than the intuitive "96%" people assume from p = 0.04!

Why This Matters for Broken Science:

When you hear "Study finds drug X works (p < 0.05)", your brain thinks "95%+ chance it works."

Reality: Depending on prior probability and study design, actual chance might be 30%, 50%, maybe even less.

This is why 50-90% of findings don't replicate — p < 0.05 doesn't mean what people think it means!

The Replication Problem Explained

🔬 Why Do "Significant" Findings Fail to Replicate?

Imagine 1000 researchers each testing a different potential drug. Assume:

  • Only 10% of the drugs actually work (100 drugs)
  • 90% don't work (900 drugs)
  • Studies have 80% power to detect a real effect
  • Using p < 0.05 threshold
Results:

100 drugs that actually work:
• 80 get p < 0.05 (true positives) ✓
• 20 get p > 0.05 (false negatives)

900 drugs that don't work:
• 855 get p > 0.05 (true negatives)
45 get p < 0.05 (false positives!)

Total "significant" findings: 80 + 45 = 125
Of those, actually true: 80/125 = 64%

So even with good study design, 36% of "significant" findings are FALSE!

If the prior is worse (only 5% of drugs work), or studies are lower powered, or researchers p-hack, the false discovery rate gets even higher — explaining the replication crisis.

Why P-Values Mislead

Now you can see exactly why the p-value approach creates broken science.

What P-Value Actually Is

p-value = P(data this extreme or more | null hypothesis true)

"IF there's no effect, what's the probability of seeing results this extreme?"

What P-Value Is NOT

❌ P-value is NOT:

  • The probability that the null hypothesis is true
  • The probability that the finding is a fluke
  • The probability that you'll get the same result if you replicate
  • The probability that your hypothesis is correct
  • The "strength" of the evidence

Common P-Value Fallacies

Fallacy 1: "p = 0.03 means 97% chance the effect is real"

Wrong. P(Data | No Effect) ≠ P(No Effect | Data)

The p-value tells you about the data assuming no effect. It tells you nothing about the probability of there being an effect given the data.

Fallacy 2: "p = 0.01 is stronger evidence than p = 0.04"

Not necessarily. The p-value doesn't measure strength of evidence. A tiny effect in a huge sample can give p = 0.001 while being practically meaningless. A large effect in a small sample might give p = 0.06 while being potentially important.

Fallacy 3: "p > 0.05 means there's no effect"

Wrong. "Absence of evidence is not evidence of absence." A non-significant p-value might just mean your sample was too small to detect a real effect.

Fallacy 4: "Two studies with p = 0.03 and p = 0.06 found opposite results"

Wrong. They might have found very similar effect sizes! The difference between "significant" and "not significant" is often not itself statistically significant. The p = 0.05 cutoff is arbitrary.

How P-Hacking Works

Because p < 0.05 is the goal (for publication), researchers consciously or unconsciously manipulate analyses to achieve it:

P-Hacking Techniques:
  • Optional stopping: Keep collecting data until p < 0.05
  • Multiple comparisons: Test 20 things, report the one that "works"
  • Outlier removal: Remove data points that hurt your p-value
  • Subgroup analysis: "It didn't work overall, but in left-handed women over 40..."
  • Outcome switching: Original outcome didn't work? Try a different one!
  • Analytical flexibility: Try different statistical tests until one gives p < 0.05

With enough flexibility, you can get p < 0.05 for almost anything — even if there's no real effect. This is why so much "significant" research doesn't replicate.

The Bayesian Alternative:

Instead of asking "is p < 0.05?", Bayesians ask:
"Given all the evidence (including prior knowledge), what's our rational degree of belief that this hypothesis is true?"

This directly answers what we actually want to know, incorporates context, and isn't as easily gamed.

Summary: The Two Approaches

🔴 FREQUENTIST

  • Probability = frequency in repeated trials
  • Hypotheses are true or false (not probabilistic)
  • Asks: P(Data | Hypothesis)
  • Tool: p-values, significance testing
  • Ignores prior knowledge
  • Binary output: "significant" or "not"
vs

🟢 BAYESIAN

  • Probability = degree of belief
  • Hypotheses have probabilities (our confidence)
  • Asks: P(Hypothesis | Data)
  • Tool: Bayes' Theorem
  • Incorporates prior knowledge
  • Continuous output: probability updates

Key Takeaways

1. P(D|H) ≠ P(H|D)
The probability of the data given a hypothesis is NOT the probability of the hypothesis given the data. Confusing these is the heart of broken science.
2. P-values don't tell you what you think
p = 0.03 does NOT mean "97% chance the effect is real." It means "if there's no effect, there's a 3% chance of data this extreme."
3. Prior probability matters
A positive test for a rare disease might still mean low probability of having the disease. Context matters. Frequentists ignore this.
4. Bayesian asks what we actually want to know
"What's the probability this hypothesis is true given all the evidence?" — That's P(H|D), and only Bayesian thinking can answer it.

How This Broke Science

The frequentist approach became dominant in academic research because:

But this created a system optimized for p-values, not truth. The result: replication crisis.

The Glassman Insight:

Real science — physics, chemistry, engineering — works because it's judged by predictive power, not p-values.

Does the bridge stand? Does the rocket fly? Does the drug actually help patients?

That's P(H|D) thinking — does reality match our predictions? — even if they don't call it Bayesian.

The "ologies" (psychology, sociology, much of medicine) got stuck on P(D|H) and p-values, and that's why they don't replicate.