Naïve Bayes

Same students. This time we use probability — not rules, not distance.

1 · The Idea
2 · Bayes Theorem
3 · Count Probs
4 · Walkthrough
5 · Try It!

🧠 The Big Idea

Naïve Bayes asks: "Given what I know about this student, which outcome is more probable — Pass or Fail?"


It calculates two probabilities — one for Pass, one for Fail — and picks the higher one. That's the prediction.


Unlike KNN (measure distance) or Decision Tree (learn rules), Naïve Bayes works entirely with probability math learned from the data.

☂️ Analogy — The Weather Doctor

A doctor sees you and asks: "Does it hurt? Any fever? Cough?"


She doesn't measure how far you are from other patients. She doesn't follow a flowchart. She thinks:


"Given this combination of symptoms, what is the probability of flu vs cold? Which is higher?"


She picks the most probable diagnosis. That's Naïve Bayes — a probability doctor.

📊 Same 5 Students — We Add Categories

Naïve Bayes works best with categories. We convert our numbers into buckets:

Student Hours Attendance Marks Result
ALow (2)Low (60)Low (45)Fail
BHigh (5)High (75)High (60)Pass
CHigh (8)High (90)High (80)Pass
DLow (1)Low (50)Low (35)Fail
EHigh (6)High (80)High (70)Pass

Rule: Hours ≥ 4 = High, else Low  |  Attendance ≥ 70 = High, else Low  |  Marks ≥ 55 = High, else Low

⚔️ Three Algorithms — One Dataset

🌳 Decision Tree

  • Builds if-else rules
  • Uses entropy + info gain
  • Picks the best split

🔵 KNN

  • Stores all data
  • Measures distance
  • Majority vote of neighbors

🎲 Naïve Bayes

  • Counts probabilities
  • Uses Bayes' theorem
  • Picks highest probability

📐 Bayes' Theorem

The core formula. Don't panic — we'll break it down word by word.

P(Class | Features) = P(Features | Class) × P(Class) ───────────────────────────── P(Features) In plain English: "Probability of this class, GIVEN these features" = How often features appear in this class × How common is this class ────────────────────────────────────────────────────────────────── How common are these features overall

We calculate this for Pass and for Fail. Whichever is bigger — that's the answer.


Good news: Since P(Features) is the same for both, we can skip it! We just compare the numerators.

🔤 Each Term Explained

P(Class)

Prior
How common is Pass or Fail overall?
e.g. 3/5 = 60% pass

P(Feature | Class)

Likelihood
Among Pass students, how many had High Hours?
e.g. 3/3 = 100%

P(Class | Features)

Posterior
What we want!
Probability of Pass GIVEN the student's features

🤔 Why "Naïve"?

Because it makes a naïve assumption: all features are independent of each other.


In reality, students who study more also tend to have higher marks. The two are correlated. But Naïve Bayes ignores that and treats each feature as if it has nothing to do with the others.


This lets us multiply the probabilities together:

P(Hours=High AND Att=High AND Marks=High | Pass) = P(Hours=High | Pass) × P(Att=High | Pass) × P(Marks=High | Pass)

Surprisingly, this naïve simplification still works very well in practice — especially for text classification like spam detection.

📊 Step 1 — Count the Classes (Prior Probability)

From our 5 students, how many Passed and Failed?

Pass

3/5
B, C, E → P(Pass) = 0.60

Fail

2/5
A, D → P(Fail) = 0.40

These are the prior probabilities — our starting belief before seeing any features.

📊 Step 2 — Count Feature Likelihoods

For each feature value, we count: how often does it appear within each class?


Among the 3 Pass students (B, C, E):

FeatureValueCountProbability
HoursHigh3 of 33/3 = 1.00
HoursLow0 of 30/3 = 0.00 *
AttendanceHigh3 of 33/3 = 1.00
AttendanceLow0 of 30/3 = 0.00 *
MarksHigh3 of 33/3 = 1.00
MarksLow0 of 30/3 = 0.00 *

Among the 2 Fail students (A, D):

FeatureValueCountProbability
HoursHigh0 of 20/2 = 0.00 *
HoursLow2 of 22/2 = 1.00
AttendanceHigh0 of 20/2 = 0.00 *
AttendanceLow2 of 22/2 = 1.00
MarksHigh0 of 20/2 = 0.00 *
MarksLow2 of 22/2 = 1.00

* Zero probabilities would kill the entire calculation (multiplying by 0). Fix: use Laplace Smoothing — add 1 to every count. See Tab 4.

💊 Laplace Smoothing — Fixing Zeros

When a feature value never appeared in a class, P = 0, which zeros out the whole calculation. The fix: add 1 to every count and add the number of possible values to the denominator.

Without smoothing: P(Hours=Low | Pass) = 0/3 = 0 ← dangerous! With Laplace (+1): P(Hours=Low | Pass) = (0+1)/(3+2) = 1/5 = 0.20 P(Hours=High| Pass) = (3+1)/(3+2) = 4/5 = 0.80 P(Hours=Low | Fail) = (2+1)/(2+2) = 3/4 = 0.75 P(Hours=High| Fail) = (0+1)/(2+2) = 1/4 = 0.25 (+2 in denominator because Hours has 2 possible values: High, Low)

🚶 Full Walkthrough — Predict Student X

New student X = Hours: High (4), Attendance: High (70), Marks: High (55)


We calculate the Naïve Bayes score for Pass and for Fail, then compare.

1
Prior probabilities from training data

📦 Prior Probabilities

Total students = 5 Pass students = 3 → P(Pass) = 3/5 = 0.60 Fail students = 2 → P(Fail) = 2/5 = 0.40
2
Likelihood of each feature (with Laplace smoothing)

🔢 Feature Likelihoods

Student X has: Hours=High, Attendance=High, Marks=High

Given PASS

P(Hours=High | Pass) = (3+1)/(3+2) = 4/5 = 0.80 P(Att=High | Pass) = (3+1)/(3+2) = 4/5 = 0.80 P(Marks=High | Pass) = (3+1)/(3+2) = 4/5 = 0.80

Given FAIL

P(Hours=High | Fail) = (0+1)/(2+2) = 1/4 = 0.25 P(Att=High | Fail) = (0+1)/(2+2) = 1/4 = 0.25 P(Marks=High | Fail) = (0+1)/(2+2) = 1/4 = 0.25
3
Multiply everything together

✖️ Naïve Bayes Score

Multiply all likelihoods × prior. Compare the two scores.

Score for PASS

P(Pass) × P(H|Pass) × P(A|Pass) × P(M|Pass) = 0.60 × 0.80 × 0.80 × 0.80
= 0.3072

Score for FAIL

P(Fail) × P(H|Fail) × P(A|Fail) × P(M|Fail) = 0.40 × 0.25 × 0.25 × 0.25
= 0.00625

✅ PASS wins! (0.3072 > 0.00625)

Pass score is ~49× larger → Student X is predicted: PASS

4
Optional — convert to real probabilities

📊 Normalize to Get Real Probabilities

To get actual probabilities (not just scores), divide each by the total:

Total = 0.3072 + 0.00625 = 0.31345 P(Pass | X) = 0.3072 / 0.31345 ≈ 0.980 → 98% P(Fail | X) = 0.00625 / 0.31345 ≈ 0.020 → 2% Prediction: PASS with 98% confidence ✅

🧮 As Code

def naive_bayes(new_student, data): classes = ["Pass", "Fail"] scores = {} for cls in classes: subset = [s for s in data if s.result == cls] # Prior: how common is this class? prior = len(subset) / len(data) # Likelihood: for each feature, count with Laplace likelihood = 1.0 for feature in new_student.features: count = sum(1 for s in subset if s[feature] == new_student[feature]) n_vals = 2 # High or Low (number of possible values) prob = (count + 1) / (len(subset) + n_vals) likelihood *= prob scores[cls] = prior * likelihood return max(scores, key=scores.get) # class with highest score

🔮 Try Naïve Bayes — Predict a New Student

Adjust the sliders. The algorithm shows you both probability scores in real time.

4 hrs
70%
55

📊 Probability Bars

✅ P(Pass)
❌ P(Fail)

🧠 Key Takeaways

🔁 All Three Algorithms — Same Student X

🌳 Decision Tree

  • Checked Attendance ≥ 70? → YES
  • Checked Hours ≥ 4? → YES
  • → PASS

🔵 KNN (K=3)

  • Nearest: B (Pass), A (Fail), E (Pass)
  • Vote: 2 Pass, 1 Fail
  • → PASS

🎲 Naïve Bayes

  • Score Pass: 0.307
  • Score Fail: 0.006
  • → PASS (98%)

All three algorithms agree on Student X. They'll disagree more on edge cases — that's where algorithm choice matters.