Machine Learning Fundamentals

Data Splitting, Bias-Variance Tradeoff, and Regularization

Dr. Dhaval Patel • 2025

What We'll Learn Today

We'll explore three interconnected ML concepts using a house price prediction example throughout:

Data Splitting Strategies: Why and how we split data into train/validation/test sets
Bias-Variance Tradeoff: The fundamental tension in machine learning models
Regularization Techniques: L1/L2 methods to control overfitting
Practical Implementation: Real-world application and code examples

                    🎯 Goal: Understand how these concepts work together to build robust ML models
                

Our Running Example: Flat Price Prediction in Mumbai

🏠 Predicting Flat Prices in Mumbai Metropolitan Region

Dataset: 20,000 flats across Mumbai, Thane, and Navi Mumbai

Feature	Example Value	Type
Carpet Area (sq ft)	850 sq ft	Continuous
BHK Configuration	2 BHK	Discrete
Metro Distance (km)	0.8 km	Continuous
Building Age (years)	8 years	Continuous
Area (Locality)	Bandra West	Categorical
Price	₹1.2 Crore	Target

Why This Example Works: Mumbai real estate is complex with factors like metro connectivity, locality premium, and building amenities. Perfect for demonstrating overfitting risks - a model might memorize "Flat in Bandra = ₹1.2Cr" instead of learning the underlying patterns.

Part 1: Data Splitting

The Foundation of Reliable ML

Why We Can't Use All Data for Training

Imagine: You're studying for IIT-JEE using previous year questions. If the actual exam contains the exact same questions you practiced with, how would you know if you truly understand concepts or just memorized answers?

                    The same problem exists in machine learning: if we test our model on the same data we trained it on, we can't tell if it's truly learned patterns or just memorized the training examples.
                

In our Mumbai flat price example:

Training on all 20,000 flats and testing on the same flats would give misleadingly perfect results
The model might memorize "Flat #1234 in Bandra with 850 sq ft = ₹1.2 Crore" rather than learning the relationship between area and locality
When new flats come to market, the memorized model would fail catastrophically

Training / Validation / Test Split

🏠 20,000 Mumbai Flats Visual Split

Training Set
12,000 flats (60%)

Validation
4,000 flats (20%)

Test Set
4,000 flats (20%)

Train
Learn patterns

→

Validate
Tune parameters

→

Test
Final evaluation

Cross-Validation: Making the Most of Limited Data

5-Fold Cross-Validation Visual Demo

Click to see different folds:

Click a fold button to see the train/validation split

Cross-Validation Scores

8.1L

Fold 1

8.5L

Fold 2

7.8L

Fold 3

8.9L

Fold 4

7.2L

Fold 5

8.1L

Average

Benefits: More robust evaluation, better use of data, reduces variance in performance estimates. Each flat gets to be in the validation set exactly once, and in the training set four times.

Part 2: Bias-Variance Tradeoff

The Core Tension in Machine Learning

Bias vs Variance: The Fundamental Tradeoff

🎯 Bias (Underfitting)

Definition: Error from overly simplistic assumptions

In our Mumbai flat example:

Model: "All flats cost ₹80 Lakh"
Ignores carpet area, BHK, locality, metro distance
Consistently wrong but predictably so

High Bias = Model too simple to capture true patterns

📊 Variance (Overfitting)

Definition: Error from sensitivity to small changes in training data

In our Mumbai flat example:

Model memorizes every training flat
Perfect on training data
Wildly wrong on new flats

High Variance = Model too complex, learns noise instead of signal

The Mathematical Framework

The total error of any machine learning model can be decomposed into three components:

Total Error = Bias² + Variance + Irreducible Error

For our Mumbai flat price model:

Bias²: How far off our model's average prediction is from the true flat price
Variance: How much our predictions change when we retrain on different flat data
Irreducible Error: Random noise we can't eliminate (market fluctuations, measurement errors, sudden policy changes)

🎯 The Art of ML: Finding the sweet spot where Bias² + Variance is minimized

Why it's a tradeoff: Typically, reducing bias increases variance and vice versa. More complex models have lower bias but higher variance.

Bias-Variance in Action: Mumbai Flat Price Models

🎯 Target Practice Analogy

Imagine predicting the price of identical 2BHK flats in Bandra West (True Price: ₹1.2 Crore)

High Bias Model

Always predicts ₹80L

Consistent but wrong

High Variance Model

₹50L to ₹3Cr range

Unpredictable

Balanced Model

₹1.1L to ₹1.3Cr range

✅ Accurate & Consistent

🎯 Target Practice Lesson:

High Bias: Always misses in the same direction (systematic error)
High Variance: Shots all over the place (inconsistent)
Balanced: Shots clustered around the bullseye (accurate & precise)

Performance Comparison

Model	Training Error	Test Error
High Bias	₹15 Lakh	₹14.8 Lakh
Balanced	₹7 Lakh	₹7.6 Lakh
High Variance	₹1,000	₹19 Lakh

Notice: The balanced model hits closest to the target consistently!

Part 3: Regularization

Controlling Model Complexity

What is Regularization?

Regularization is like adding a "complexity penalty" to prevent overfitting. Think of it as a speed limit for your model's complexity.

                    🎯 Goal: Find models that perform well on training data WITHOUT becoming too complex
                

In our Mumbai flat price example: Instead of letting our model create wild, complex relationships between carpet area and locality, regularization keeps it reasonable and generalizable.

The regularized objective becomes:

Minimize: Training Error + λ × Complexity Penalty

Training Error: How well the model fits the training flat data
Complexity Penalty: Punishment for overly complex models
λ (lambda): Hyperparameter controlling the tradeoff strength

L1 (Lasso) vs L2 (Ridge) Regularization

L1 Regularization (Lasso)

Penalty = λ × Σ|wi|

Effect: Encourages sparsity - sets some weights to exactly zero

In our Mumbai flat model:

Might eliminate "building age" feature entirely
Performs automatic feature selection
Creates simpler, more interpretable models

Best when: You have many features and want automatic feature selection

L2 Regularization (Ridge)

Penalty = λ × Σwi²

Effect: Shrinks all weights toward zero but doesn't eliminate features

In our Mumbai flat model:

Keeps all features but reduces their impact
Prevents any single feature from dominating
More stable when features are correlated

Best when: All features are potentially useful and you want to keep them all

VS

Regularization Effect on Our Mumbai Flat Price Model

Model Weights Comparison Chart

Feature

No Regularization

L1 (λ=0.1)

L2 (λ=0.1)

Visual

Carpet Area

5,750

4,900

5,365

BHK

8,42,000

6,60,000

7,76,000

Metro Distance

-1,25,000

-98,000

-1,15,000

Building Age

-15,420

0

-11,200

Parking

2,34,000

0

1,89,000

Key Observations

L1 eliminated Building Age and Parking features completely (weights = 0)
L2 shrunk all weights but kept all features
Both methods reduced the magnitude of coefficients

Result: Both regularized models generalize better to new Mumbai flats!

Performance:
No Reg: ₹19L test error
L1: ₹8.4L test error
L2: ₹7.6L test error

Choosing the Right λ (Lambda) Value

Interactive λ Effect Visualization

Adjust λ value:

0 (No Reg) 0.01 0.1 1.0 10.0

₹5.6L

Training Error

₹8.2L

Validation Error

Sweet spot - balanced complexity

λ = 0

⚠️ Overfitting

Memorizes training data

λ = 0.01

⚡ Some overfitting

Still learning noise

λ = 0.1

✅ Optimal

Balanced complexity

λ = 1.0

📉 Underfitting

Too simple

λ = 10.0

🚫 Severe underfitting

Ignores data patterns

How to Choose λ: Use cross-validation on the training set to find the λ that minimizes validation error. This is exactly why we need that separate validation set - to tune hyperparameters like λ without touching our test set!

How Everything Connects: The Complete Pipeline

Let's see how all three concepts work together in our Mumbai flat price prediction:

🔄 Complete ML Pipeline Visualization

🏠 20,000 Flats
Raw Data

→

📊 Split Data
60-20-20 Rule

→

⚖️ Train Models
Address Bias-Variance

→

🛡️ Regularize
Optimal λ

→

✅ Final Model
Test & Deploy

❌ Without Our Techniques

₹19L

Test Error

Overfitted Model

✅ With Our Techniques

₹7.6L

Test Error

Optimized Model

🎉 60% Error Reduction!

From ₹19 Lakh average error to ₹7.6 Lakh - much more reliable Mumbai flat price predictions!

See Our Model in Action: Predicting a Bandra Flat

🏠 Flat Price Prediction Demo

Input Features:

📐 Carpet Area:

950 sq ft

🏠 BHK:

2 BHK

🚇 Metro Distance:

0.5 km

📍 Locality:

Bandra West

🏗️ Building Age:

5 years

🚗 Parking:

Yes

⬇️

Our Regularized Model Processes...

🎯 Model Predictions