Machine Learning Fundamentals

Data Splitting, Bias-Variance Tradeoff, and Regularization

Dr. Dhaval Patel • 2025

What We'll Learn Today

We'll explore three interconnected ML concepts using a house price prediction example throughout:

  • Data Splitting Strategies: Why and how we split data into train/validation/test sets
  • Bias-Variance Tradeoff: The fundamental tension in machine learning models
  • Regularization Techniques: L1/L2 methods to control overfitting
  • Practical Implementation: Real-world application and code examples
🎯 Goal: Understand how these concepts work together to build robust ML models

Our Running Example: Flat Price Prediction in Mumbai

🏠 Predicting Flat Prices in Mumbai Metropolitan Region

Dataset: 20,000 flats across Mumbai, Thane, and Navi Mumbai

Feature Example Value Type
Carpet Area (sq ft) 850 sq ft Continuous
BHK Configuration 2 BHK Discrete
Metro Distance (km) 0.8 km Continuous
Building Age (years) 8 years Continuous
Area (Locality) Bandra West Categorical
Price ₹1.2 Crore Target
Why This Example Works: Mumbai real estate is complex with factors like metro connectivity, locality premium, and building amenities. Perfect for demonstrating overfitting risks - a model might memorize "Flat in Bandra = ₹1.2Cr" instead of learning the underlying patterns.

Part 1: Data Splitting

The Foundation of Reliable ML

Why We Can't Use All Data for Training

Imagine: You're studying for IIT-JEE using previous year questions. If the actual exam contains the exact same questions you practiced with, how would you know if you truly understand concepts or just memorized answers?

The same problem exists in machine learning: if we test our model on the same data we trained it on, we can't tell if it's truly learned patterns or just memorized the training examples.

In our Mumbai flat price example:

  • Training on all 20,000 flats and testing on the same flats would give misleadingly perfect results
  • The model might memorize "Flat #1234 in Bandra with 850 sq ft = ₹1.2 Crore" rather than learning the relationship between area and locality
  • When new flats come to market, the memorized model would fail catastrophically

Training / Validation / Test Split

🏠 20,000 Mumbai Flats Visual Split

Training Set
12,000 flats (60%)
Validation
4,000 flats (20%)
Test Set
4,000 flats (20%)
Train
Learn patterns
Validate
Tune parameters
Test
Final evaluation

Key Rules

  • Never let the model see test data during development
  • Validation set guides model selection and hyperparameter tuning
  • Test set gives unbiased estimate of real-world performance
Critical: Test set is used exactly once - at the very end to report final results!

Cross-Validation: Making the Most of Limited Data

5-Fold Cross-Validation Visual Demo

Click to see different folds:
Click a fold button to see the train/validation split

Cross-Validation Scores

8.1L
Fold 1
8.5L
Fold 2
7.8L
Fold 3
8.9L
Fold 4
7.2L
Fold 5
8.1L
Average
Benefits: More robust evaluation, better use of data, reduces variance in performance estimates. Each flat gets to be in the validation set exactly once, and in the training set four times.

Part 2: Bias-Variance Tradeoff

The Core Tension in Machine Learning

Bias vs Variance: The Fundamental Tradeoff

🎯 Bias (Underfitting)

Definition: Error from overly simplistic assumptions

In our Mumbai flat example:

  • Model: "All flats cost ₹80 Lakh"
  • Ignores carpet area, BHK, locality, metro distance
  • Consistently wrong but predictably so
High Bias = Model too simple to capture true patterns

📊 Variance (Overfitting)

Definition: Error from sensitivity to small changes in training data

In our Mumbai flat example:

  • Model memorizes every training flat
  • Perfect on training data
  • Wildly wrong on new flats
High Variance = Model too complex, learns noise instead of signal

The Mathematical Framework

The total error of any machine learning model can be decomposed into three components:

Total Error = Bias² + Variance + Irreducible Error

For our Mumbai flat price model:

  • Bias²: How far off our model's average prediction is from the true flat price
  • Variance: How much our predictions change when we retrain on different flat data
  • Irreducible Error: Random noise we can't eliminate (market fluctuations, measurement errors, sudden policy changes)
🎯 The Art of ML: Finding the sweet spot where Bias² + Variance is minimized

Why it's a tradeoff: Typically, reducing bias increases variance and vice versa. More complex models have lower bias but higher variance.

Bias-Variance in Action: Mumbai Flat Price Models

🎯 Target Practice Analogy

Imagine predicting the price of identical 2BHK flats in Bandra West (True Price: ₹1.2 Crore)

High Bias Model

Always predicts ₹80L

Consistent but wrong

High Variance Model

₹50L to ₹3Cr range

Unpredictable

Balanced Model

₹1.1L to ₹1.3Cr range

✅ Accurate & Consistent

🎯 Target Practice Lesson:

  • High Bias: Always misses in the same direction (systematic error)
  • High Variance: Shots all over the place (inconsistent)
  • Balanced: Shots clustered around the bullseye (accurate & precise)

Performance Comparison

Model Training Error Test Error
High Bias ₹15 Lakh ₹14.8 Lakh
Balanced ₹7 Lakh ₹7.6 Lakh
High Variance ₹1,000 ₹19 Lakh
Notice: The balanced model hits closest to the target consistently!

Part 3: Regularization

Controlling Model Complexity

What is Regularization?

Regularization is like adding a "complexity penalty" to prevent overfitting. Think of it as a speed limit for your model's complexity.

🎯 Goal: Find models that perform well on training data WITHOUT becoming too complex

In our Mumbai flat price example: Instead of letting our model create wild, complex relationships between carpet area and locality, regularization keeps it reasonable and generalizable.

The regularized objective becomes:

Minimize: Training Error + λ × Complexity Penalty
  • Training Error: How well the model fits the training flat data
  • Complexity Penalty: Punishment for overly complex models
  • λ (lambda): Hyperparameter controlling the tradeoff strength

L1 (Lasso) vs L2 (Ridge) Regularization

L1 Regularization (Lasso)

Penalty = λ × Σ|wi|

Effect: Encourages sparsity - sets some weights to exactly zero

In our Mumbai flat model:

  • Might eliminate "building age" feature entirely
  • Performs automatic feature selection
  • Creates simpler, more interpretable models
Best when: You have many features and want automatic feature selection

L2 Regularization (Ridge)

Penalty = λ × Σwi²

Effect: Shrinks all weights toward zero but doesn't eliminate features

In our Mumbai flat model:

  • Keeps all features but reduces their impact
  • Prevents any single feature from dominating
  • More stable when features are correlated
Best when: All features are potentially useful and you want to keep them all
VS

Regularization Effect on Our Mumbai Flat Price Model

Model Weights Comparison Chart

Feature
No Regularization
L1 (λ=0.1)
L2 (λ=0.1)
Visual
Carpet Area
5,750
4,900
5,365
BHK
8,42,000
6,60,000
7,76,000
Metro Distance
-1,25,000
-98,000
-1,15,000
Building Age
-15,420
0
-11,200
Parking
2,34,000
0
1,89,000

Key Observations

  • L1 eliminated Building Age and Parking features completely (weights = 0)
  • L2 shrunk all weights but kept all features
  • Both methods reduced the magnitude of coefficients
Result: Both regularized models generalize better to new Mumbai flats!
Performance:
No Reg: ₹19L test error
L1: ₹8.4L test error
L2: ₹7.6L test error

Choosing the Right λ (Lambda) Value

Interactive λ Effect Visualization

0 (No Reg) 0.01 0.1 1.0 10.0
₹5.6L
Training Error
₹8.2L
Validation Error
Sweet spot - balanced complexity

λ = 0

⚠️ Overfitting
Memorizes training data

λ = 0.01

⚡ Some overfitting
Still learning noise

λ = 0.1

✅ Optimal
Balanced complexity

λ = 1.0

📉 Underfitting
Too simple

λ = 10.0

🚫 Severe underfitting
Ignores data patterns
How to Choose λ: Use cross-validation on the training set to find the λ that minimizes validation error. This is exactly why we need that separate validation set - to tune hyperparameters like λ without touching our test set!

How Everything Connects: The Complete Pipeline

Let's see how all three concepts work together in our Mumbai flat price prediction:

🔄 Complete ML Pipeline Visualization

🏠 20,000 Flats
Raw Data
📊 Split Data
60-20-20 Rule
⚖️ Train Models
Address Bias-Variance
🛡️ Regularize
Optimal λ
✅ Final Model
Test & Deploy
❌ Without Our Techniques
₹19L
Test Error
Overfitted Model
✅ With Our Techniques
₹7.6L
Test Error
Optimized Model

🎉 60% Error Reduction!

From ₹19 Lakh average error to ₹7.6 Lakh - much more reliable Mumbai flat price predictions!

See Our Model in Action: Predicting a Bandra Flat

🏠 Flat Price Prediction Demo

Input Features:

950 sq ft
2 BHK
0.5 km
Bandra West
5 years
Yes
⬇️
Our Regularized Model Processes...

🎯 Model Predictions

No Regularization
₹2.8 Cr
Overconfident
L2 Regularized
₹1.45 Cr
✅ Most Reliable
L1 Regularized
₹1.52 Cr
Good Alternative

Why Our Model Works

  • Proper Data Splitting: Trained on 12k flats, validated carefully
  • Bias-Variance Balance: Neither too simple nor too complex
  • L2 Regularization: Prevents overfitting to training quirks
  • Cross-Validation: Robust parameter selection
Market Reality: A 950 sq ft 2BHK in Bandra West near metro typically costs ₹1.4-1.6 Cr - our model's prediction is spot on! 🎯

Key Takeaways & Best Practices

🎯 Data Splitting

  • Always split data before any model development
  • Use cross-validation to make the most of your data
  • Test set is sacred - touch it only once for final evaluation

⚖️ Bias-Variance Tradeoff

  • Simple models: High bias, low variance (underfitting)
  • Complex models: Low bias, high variance (overfitting)
  • Sweet spot: Balance both for optimal test performance

🛡️ Regularization

  • L1 (Lasso): Feature selection, sparse models
  • L2 (Ridge): Shrinks all weights, handles correlated features well
  • Always tune λ using cross-validation on training data
Remember: These aren't separate techniques - they work together to build robust, generalizable machine learning models!

Questions?

Ready to apply these concepts to your own ML projects!

Next: Practice implementing these techniques with different datasets

Slide 1 of 20