Same students. This time we use probability — not rules, not distance.
Naïve Bayes asks: "Given what I know about this student, which outcome is more probable — Pass or Fail?"
It calculates two probabilities — one for Pass, one for Fail — and picks the higher one. That's the prediction.
Unlike KNN (measure distance) or Decision Tree (learn rules), Naïve Bayes works entirely with probability math learned from the data.
A doctor sees you and asks: "Does it hurt? Any fever? Cough?"
She doesn't measure how far you are from other patients. She doesn't follow a flowchart. She thinks:
"Given this combination of symptoms, what is the probability of flu vs cold? Which is higher?"
She picks the most probable diagnosis. That's Naïve Bayes — a probability doctor.
Naïve Bayes works best with categories. We convert our numbers into buckets:
| Student | Hours | Attendance | Marks | Result |
|---|---|---|---|---|
| A | Low (2) | Low (60) | Low (45) | Fail |
| B | High (5) | High (75) | High (60) | Pass |
| C | High (8) | High (90) | High (80) | Pass |
| D | Low (1) | Low (50) | Low (35) | Fail |
| E | High (6) | High (80) | High (70) | Pass |
Rule: Hours ≥ 4 = High, else Low | Attendance ≥ 70 = High, else Low | Marks ≥ 55 = High, else Low
The core formula. Don't panic — we'll break it down word by word.
We calculate this for Pass and for Fail. Whichever is bigger — that's the answer.
Good news: Since P(Features) is the same for both, we can skip it! We just compare the numerators.
Because it makes a naïve assumption: all features are independent of each other.
In reality, students who study more also tend to have higher marks. The two are correlated. But Naïve Bayes ignores that and treats each feature as if it has nothing to do with the others.
This lets us multiply the probabilities together:
Surprisingly, this naïve simplification still works very well in practice — especially for text classification like spam detection.
From our 5 students, how many Passed and Failed?
These are the prior probabilities — our starting belief before seeing any features.
For each feature value, we count: how often does it appear within each class?
Among the 3 Pass students (B, C, E):
| Feature | Value | Count | Probability |
|---|---|---|---|
| Hours | High | 3 of 3 | 3/3 = 1.00 |
| Hours | Low | 0 of 3 | 0/3 = 0.00 * |
| Attendance | High | 3 of 3 | 3/3 = 1.00 |
| Attendance | Low | 0 of 3 | 0/3 = 0.00 * |
| Marks | High | 3 of 3 | 3/3 = 1.00 |
| Marks | Low | 0 of 3 | 0/3 = 0.00 * |
Among the 2 Fail students (A, D):
| Feature | Value | Count | Probability |
|---|---|---|---|
| Hours | High | 0 of 2 | 0/2 = 0.00 * |
| Hours | Low | 2 of 2 | 2/2 = 1.00 |
| Attendance | High | 0 of 2 | 0/2 = 0.00 * |
| Attendance | Low | 2 of 2 | 2/2 = 1.00 |
| Marks | High | 0 of 2 | 0/2 = 0.00 * |
| Marks | Low | 2 of 2 | 2/2 = 1.00 |
* Zero probabilities would kill the entire calculation (multiplying by 0). Fix: use Laplace Smoothing — add 1 to every count. See Tab 4.
When a feature value never appeared in a class, P = 0, which zeros out the whole calculation. The fix: add 1 to every count and add the number of possible values to the denominator.
New student X = Hours: High (4), Attendance: High (70), Marks: High (55)
We calculate the Naïve Bayes score for Pass and for Fail, then compare.
Student X has: Hours=High, Attendance=High, Marks=High
Multiply all likelihoods × prior. Compare the two scores.
To get actual probabilities (not just scores), divide each by the total:
Adjust the sliders. The algorithm shows you both probability scores in real time.
All three algorithms agree on Student X. They'll disagree more on edge cases — that's where algorithm choice matters.