Naive Bayes — The Visual Lesson

🧠 The Big Idea

Naïve Bayes asks: "Given what I know about this student, which outcome is more probable — Pass or Fail?"

It calculates two probabilities — one for Pass, one for Fail — and picks the higher one. That's the prediction.

Unlike KNN (measure distance) or Decision Tree (learn rules), Naïve Bayes works entirely with probability math learned from the data.

☂️ Analogy — The Weather Doctor

A doctor sees you and asks: "Does it hurt? Any fever? Cough?"

She doesn't measure how far you are from other patients. She doesn't follow a flowchart. She thinks:

"Given this combination of symptoms, what is the probability of flu vs cold? Which is higher?"

She picks the most probable diagnosis. That's Naïve Bayes — a probability doctor.

📊 Same 5 Students — We Add Categories

Naïve Bayes works best with categories. We convert our numbers into buckets:

Student	Hours	Attendance	Marks	Result
A	Low (2)	Low (60)	Low (45)	Fail
B	High (5)	High (75)	High (60)	Pass
C	High (8)	High (90)	High (80)	Pass
D	Low (1)	Low (50)	Low (35)	Fail
E	High (6)	High (80)	High (70)	Pass

Rule: Hours ≥ 4 = High, else Low | Attendance ≥ 70 = High, else Low | Marks ≥ 55 = High, else Low

⚔️ Three Algorithms — One Dataset

🌳 Decision Tree

Builds if-else rules
Uses entropy + info gain
Picks the best split

🔵 KNN

Stores all data
Measures distance
Majority vote of neighbors

🎲 Naïve Bayes

Counts probabilities
Uses Bayes' theorem
Picks highest probability

📐 Bayes' Theorem

The core formula. Don't panic — we'll break it down word by word.

P(Class | Features) = P(Features | Class) × P(Class) ───────────────────────────── P(Features) In plain English: "Probability of this class, GIVEN these features" = How often features appear in this class × How common is this class ────────────────────────────────────────────────────────────────── How common are these features overall

We calculate this for Pass and for Fail. Whichever is bigger — that's the answer.

Good news: Since P(Features) is the same for both, we can skip it! We just compare the numerators.

🔤 Each Term Explained

P(Class)

Prior

How common is Pass or Fail overall?
e.g. 3/5 = 60% pass

P(Feature | Class)

Likelihood

Among Pass students, how many had High Hours?
e.g. 3/3 = 100%

P(Class | Features)

Posterior

What we want!
Probability of Pass GIVEN the student's features

🤔 Why "Naïve"?

Because it makes a naïve assumption: all features are independent of each other.

In reality, students who study more also tend to have higher marks. The two are correlated. But Naïve Bayes ignores that and treats each feature as if it has nothing to do with the others.

This lets us multiply the probabilities together:

P(Hours=High AND Att=High AND Marks=High | Pass) = P(Hours=High | Pass) × P(Att=High | Pass) × P(Marks=High | Pass)

Surprisingly, this naïve simplification still works very well in practice — especially for text classification like spam detection.

📊 Step 1 — Count the Classes (Prior Probability)

From our 5 students, how many Passed and Failed?

Pass

3/5

B, C, E → P(Pass) = 0.60

Fail

2/5

A, D → P(Fail) = 0.40

These are the prior probabilities — our starting belief before seeing any features.

📊 Step 2 — Count Feature Likelihoods

For each feature value, we count: how often does it appear within each class?

Among the 3 Pass students (B, C, E):

Feature	Value	Count	Probability
Hours	High	3 of 3	3/3 = 1.00
Hours	Low	0 of 3	0/3 = 0.00 *
Attendance	High	3 of 3	3/3 = 1.00
Attendance	Low	0 of 3	0/3 = 0.00 *
Marks	High	3 of 3	3/3 = 1.00
Marks	Low	0 of 3	0/3 = 0.00 *

Among the 2 Fail students (A, D):

Feature	Value	Count	Probability
Hours	High	0 of 2	0/2 = 0.00 *
Hours	Low	2 of 2	2/2 = 1.00
Attendance	High	0 of 2	0/2 = 0.00 *
Attendance	Low	2 of 2	2/2 = 1.00
Marks	High	0 of 2	0/2 = 0.00 *
Marks	Low	2 of 2	2/2 = 1.00

* Zero probabilities would kill the entire calculation (multiplying by 0). Fix: use Laplace Smoothing — add 1 to every count. See Tab 4.

💊 Laplace Smoothing — Fixing Zeros

When a feature value never appeared in a class, P = 0, which zeros out the whole calculation. The fix: add 1 to every count and add the number of possible values to the denominator.

Without smoothing: P(Hours=Low | Pass) = 0/3 = 0 ← dangerous! With Laplace (+1): P(Hours=Low | Pass) = (0+1)/(3+2) = 1/5 = 0.20 P(Hours=High| Pass) = (3+1)/(3+2) = 4/5 = 0.80 P(Hours=Low | Fail) = (2+1)/(2+2) = 3/4 = 0.75 P(Hours=High| Fail) = (0+1)/(2+2) = 1/4 = 0.25 (+2 in denominator because Hours has 2 possible values: High, Low)

🚶 Full Walkthrough — Predict Student X

New student X = Hours: High (4), Attendance: High (70), Marks: High (55)

We calculate the Naïve Bayes score for Pass and for Fail, then compare.

1

Prior probabilities from training data

📦 Prior Probabilities

Total students = 5 Pass students = 3 → P(Pass) = 3/5 = 0.60 Fail students = 2 → P(Fail) = 2/5 = 0.40

2

Likelihood of each feature (with Laplace smoothing)

🔢 Feature Likelihoods

Student X has: Hours=High, Attendance=High, Marks=High

Given PASS

P(Hours=High | Pass) = (3+1)/(3+2) = 4/5 = 0.80 P(Att=High | Pass) = (3+1)/(3+2) = 4/5 = 0.80 P(Marks=High | Pass) = (3+1)/(3+2) = 4/5 = 0.80

Given FAIL

P(Hours=High | Fail) = (0+1)/(2+2) = 1/4 = 0.25 P(Att=High | Fail) = (0+1)/(2+2) = 1/4 = 0.25 P(Marks=High | Fail) = (0+1)/(2+2) = 1/4 = 0.25

3

Multiply everything together

✖️ Naïve Bayes Score

Multiply all likelihoods × prior. Compare the two scores.

Score for PASS

P(Pass) × P(H|Pass) × P(A|Pass) × P(M|Pass) = 0.60 × 0.80 × 0.80 × 0.80

= 0.3072

Score for FAIL

P(Fail) × P(H|Fail) × P(A|Fail) × P(M|Fail) = 0.40 × 0.25 × 0.25 × 0.25

= 0.00625

4

Optional — convert to real probabilities

📊 Normalize to Get Real Probabilities

To get actual probabilities (not just scores), divide each by the total:

Total = 0.3072 + 0.00625 = 0.31345 P(Pass | X) = 0.3072 / 0.31345 ≈ 0.980 → 98% P(Fail | X) = 0.00625 / 0.31345 ≈ 0.020 → 2% Prediction: PASS with 98% confidence ✅

🧮 As Code

def naive_bayes(new_student, data): classes = ["Pass", "Fail"] scores = {} for cls in classes: subset = [s for s in data if s.result == cls] # Prior: how common is this class? prior = len(subset) / len(data) # Likelihood: for each feature, count with Laplace likelihood = 1.0 for feature in new_student.features: count = sum(1 for s in subset if s[feature] == new_student[feature]) n_vals = 2 # High or Low (number of possible values) prob = (count + 1) / (len(subset) + n_vals) likelihood *= prob scores[cls] = prior * likelihood return max(scores, key=scores.get) # class with highest score

🔮 Try Naïve Bayes — Predict a New Student

Adjust the sliders. The algorithm shows you both probability scores in real time.

Study Hours/day

4 hrs

Attendance %

70%

Previous Marks

55

📊 Probability Bars

✅ P(Pass)—

❌ P(Fail)—

🧠 Key Takeaways

Prior = how common is each class in the training data
Likelihood = how often does each feature value appear in each class
Posterior = prior × all likelihoods multiplied together
Pick the class with the highest posterior score
Laplace Smoothing prevents zero probabilities from killing the result
"Naïve" = assumes all features are independent (simplification that works surprisingly well)

🔁 All Three Algorithms — Same Student X

🌳 Decision Tree

Checked Attendance ≥ 70? → YES
Checked Hours ≥ 4? → YES
→ PASS

🔵 KNN (K=3)

Nearest: B (Pass), A (Fail), E (Pass)
Vote: 2 Pass, 1 Fail
→ PASS

🎲 Naïve Bayes

Score Pass: 0.307
Score Fail: 0.006
→ PASS (98%)

All three algorithms agree on Student X. They'll disagree more on edge cases — that's where algorithm choice matters.

🧠 The Big Idea

☂️ Analogy — The Weather Doctor

📊 Same 5 Students — We Add Categories

⚔️ Three Algorithms — One Dataset

🌳 Decision Tree

🔵 KNN

🎲 Naïve Bayes

📐 Bayes' Theorem

🔤 Each Term Explained

P(Class)

P(Feature | Class)

P(Class | Features)

🤔 Why "Naïve"?

📊 Step 1 — Count the Classes (Prior Probability)

Pass

Fail

📊 Step 2 — Count Feature Likelihoods

💊 Laplace Smoothing — Fixing Zeros

🚶 Full Walkthrough — Predict Student X

📦 Prior Probabilities

🔢 Feature Likelihoods

Given PASS

Given FAIL

✖️ Naïve Bayes Score

Score for PASS

Score for FAIL

✅ PASS wins! (0.3072 > 0.00625)

📊 Normalize to Get Real Probabilities

🧮 As Code

🔮 Try Naïve Bayes — Predict a New Student

📊 Probability Bars

🧠 Key Takeaways

🔁 All Three Algorithms — Same Student X

🌳 Decision Tree

🔵 KNN (K=3)

🎲 Naïve Bayes