🤖 Natural Language Processing

From Basic Text Processing to Large Language Models

GTU 31711105 AI NLP Course • Complete Journey • 2024

📚 Introduction to NLP

Understanding Human-Computer Language Bridge

🧠 What is Natural Language Processing?

NLP is a branch of AI that helps computers understand, interpret, and generate human language in valuable ways.

"I love this amazing movie but the ending was disappointing."
🤔 Human: Understands mixed emotions, context, and sentiment
🤖 Computer: Sees only characters and words without meaning
  • Challenge: Bridge the gap between human language and computer understanding
  • Goal: Make computers understand context, sentiment, and meaning
  • Impact: Enable natural human-computer interaction

🌍 Real-World NLP Applications

💬
Chatbots
Customer service, virtual assistants like Siri, Alexa
🌐
Translation
Google Translate breaking language barriers
😊
Sentiment
Analyzing opinions in social media, reviews
🔍
Search
Google understanding your search queries
📧
Email
Spam detection, smart categorization
📰
Summarization
Auto-generating article summaries

🔄 NLU vs NLG: Two Sides of NLP

📥 NLU

Natural Language Understanding

Direction: Human Language → Computer Understanding

Example:
Input: "I love this amazing movie"
Output: Intent=opinion, Sentiment=positive, Entity=movie
  • Speech Recognition
  • Intent Detection
  • Entity Extraction
  • Sentiment Analysis

📤 NLG

Natural Language Generation

Direction: Computer Data → Human Language

Example:
Input: Weather_data={temp:25, condition:'sunny'}
Output: "It's a sunny 25°C day in Mumbai!"
  • Report Generation
  • Chatbot Responses
  • Story Writing
  • Code Documentation
VS

🔧 Unit 1: NLP Fundamentals

Text Processing Pipeline

🧹 Text Preprocessing

Original: "I love this amazing movie but the ending was disappointing."

⬇️ Cleaning Process ⬇️

Processed: "i love this amazing movie but the ending was disappointing"
Why Preprocessing? Raw text is messy! We need to clean and standardize before computers can work with it effectively.

Steps:
1. Lowercase: "I Love" → "i love"
2. Remove Punctuation: "movie!" → "movie"
3. Remove Extra Spaces: "this amazing" → "this amazing"
4. Handle Contractions: "don't" → "do not"

✂️ Tokenization: Breaking Text Apart

Sentence Tokenization:
Text → Sentences

Word Tokenization:
"I love this amazing movie"

["I", "love", "this", "amazing", "movie"]

Key Concepts

  • Token: Individual unit (word, sentence)
  • Sentence Split: Break on periods, exclamation marks
  • Word Split: Break on spaces, punctuation
  • Subword: Modern approach for unknown words
Challenge: How to handle "don't", "state-of-the-art", URLs?

🔗 N-grams: Capturing Word Sequences

N-grams capture sequences of N consecutive words to understand context and patterns.

"I love this amazing movie but the ending was disappointing"
Unigrams (1-gram): ["I", "love", "this", "amazing", "movie", "but", "the", "ending", "was", "disappointing"] Bigrams (2-gram): ["I love", "love this", "this amazing", "amazing movie", "movie but", "but the", "the ending", "ending was", "was disappointing"] Trigrams (3-gram): ["I love this", "love this amazing", "this amazing movie", "amazing movie but", "movie but the", "but the ending", "the ending was", "ending was disappointing"]
  • Language Models: Predict next word based on previous N-1 words
  • Applications: Autocomplete, spell checking, speech recognition
  • Trade-off: Higher N = more context but more sparsity

📝 Unit 2: Morphological & Syntactic Analysis

Understanding Word Structure and Grammar

🔬 Morphological Analysis

Word Structure

Breaking words into meaningful parts (morphemes).

Example: "disappointing"
• Root: "appoint" (assign)
• Prefix: "dis-" (negation)
• Suffix: "-ing" (ongoing action)
  • Helps understand word meaning
  • Reduces vocabulary size
  • Handles unknown words

Applications

Stemming: "disappointing" → "disappoint" "amazing" → "amaz" "loves" → "love" Lemmatization: "disappointing" → "disappoint" "amazing" → "amazing" "better" → "good"
  • Stemming: Crude suffix removal
  • Lemmatization: Dictionary-based reduction
  • Goal: Group related words together

🏷️ Part-of-Speech (POS) Tagging

"I love this amazing movie but the ending was disappointing."

I/PRP love/VBP this/DT amazing/JJ movie/NN
but/CC the/DT ending/NN was/VBD disappointing/VBG
POS Tag Meanings:
PRP = Personal Pronoun (I, he, she)
VBP = Verb, present tense (love, like)
DT = Determiner (the, this, a)
JJ = Adjective (amazing, good)
NN = Noun, singular (movie, ending)
CC = Coordinating Conjunction (but, and)
VBD = Verb, past tense (was, had)
VBG = Verb, gerund (disappointing, running)

🌳 Syntactic Parsing

Constituency Parse Tree:

S
├── NP (I)
└── VP
    ├── VB (love)
    └── NP
        ├── DT (this)
        ├── JJ (amazing)
        └── NN (movie)

Parsing Types

  • Constituency: Phrase structure (NP, VP)
  • Dependency: Word relationships
  • Goal: Understand grammatical structure
Dependency Example:
love → root
I → nsubj(love)
movie → dobj(love)
amazing → amod(movie)

🧠 Unit 3: Semantic Analysis

Understanding Meaning and Context

🤔 Word Sense Disambiguation

Multiple Meanings

Words can have different meanings in different contexts.

Word: "love"
• Romantic affection
• Strong preference
• Enthusiasm for activity

Challenge: Which meaning is intended?

Context Solution

Use surrounding words to determine correct meaning.

"I love this amazing movie"

Context clues:
• "movie" (entertainment)
• "amazing" (quality)

Conclusion: Preference meaning
WSD

🎯 Named Entity Recognition (NER)

Identify and classify named entities (people, places, organizations) in text.

"I love Marvel's amazing movie but Avengers ending was disappointing."
Entity Recognition: I → O (Outside) love → O (Outside) Marvel → B-ORG (Organization - Beginning) 's → O (Outside) amazing → O (Outside) movie → O (Outside) but → O (Outside) Avengers → B-MISC (Movie title - Beginning) ending → O (Outside) was → O (Outside) disappointing → O (Outside)
  • PERSON: People's names (Tom Hanks, Amitabh Bachchan)
  • ORGANIZATION: Companies (Marvel, Disney, Google)
  • LOCATION: Places (Mumbai, India, Hollywood)
  • MISC: Movies, books, products (Avengers, iPhone)

🚀 Unit 4: NLP Applications

Real-World Problem Solving

😊 Sentiment Analysis in Action

Input: "I love this amazing movie but the ending was disappointing."

Analysis Process:
✅ Positive: "love" (+2), "amazing" (+2)
❌ Negative: "disappointing" (-2)
⚠️ Contrast: "but" (indicates mixed sentiment)

Result: Mixed/Neutral (Slightly Positive)
Advanced Considerations:
Aspect-based: Positive about movie, negative about ending
Context: "but" indicates the negative part may be more important
Intensity: "amazing" is stronger than "good"
Real conclusion: Mixed sentiment with contrasting aspects

🌐 Machine Translation

Translation Process

English: "I love this amazing movie" Word-by-word: I → मैं (main) love → प्यार (pyaar) / पसंद (pasand) this → यह (yah) amazing → अद्भुत (adbhut) movie → फिल्म (film) Result: "मैं इस अद्भुत फिल्म को पसंद करता हूं"
  • Word order changes (SVO → SOV)
  • Cultural context matters
  • Multiple possible translations

Translation Challenges

Ambiguity Example:
"love" in movie context:
• प्यार (romantic love) ❌
• पसंद (like/prefer) ✅
  • Context: Same word, different meanings
  • Grammar: Different sentence structures
  • Culture: Concepts that don't translate directly
  • Idioms: "Break a leg" ≠ "पैर तोड़ना"

📊 Unit 5: Statistical Methods

Traditional Machine Learning Approaches

🔗 Hidden Markov Models (HMM)

POS Tagging with HMM:

Hidden States: [PRP, VB, DT, JJ, NN]
Observations: [I, love, this, amazing, movie]

Probabilities:
P(VB|PRP) = 0.8
P(love|VB) = 0.1
P(DT|VB) = 0.3
P(this|DT) = 0.8

HMM Components

  • States: Hidden (POS tags)
  • Observations: Visible (words)
  • Transitions: P(tag₂|tag₁)
  • Emissions: P(word|tag)
Viterbi Algorithm
Finds most likely sequence of hidden states

🧠 Neural NLP Revolution

Beyond Traditional Methods

🔢 Word Embeddings: Words as Vectors

Converting Words to Numbers:

"love" → [0.2, 0.8, -0.1, 0.5, 0.3]
"amazing" → [0.3, 0.7, 0.0, 0.4, 0.2]
"disappointing" → [-0.1, 0.2, 0.8, -0.3, 0.1]

Mathematical Magic:
king - man + woman ≈ queen
Why Embeddings? Traditional one-hot encoding can't capture relationships. Embeddings place similar words closer in vector space.

Popular Methods:
Word2Vec: Skip-gram and CBOW
GloVe: Global Vectors for Word Representation
FastText: Subword information

🔄 RNNs and LSTMs: Sequential Processing

The Problem

Traditional methods process words independently, losing sequence information.

"I love this amazing movie"
Traditional:
Each word processed separately
No memory of previous words
Context lost

The Solution

RNN Processing: h₁ = RNN("I", h₀) h₂ = RNN("love", h₁) h₃ = RNN("this", h₂) h₄ = RNN("amazing", h₃) h₅ = RNN("movie", h₄) Final state h₅ contains information about entire sequence!
  • RNN: Remembers previous words
  • LSTM: Better long-term memory
  • Result: Context-aware processing

⚡ The Transformer Era

"Attention is All You Need"

👁️ Attention Mechanism: Focus on What Matters

Self-Attention for "love" in our sentence:

"love" pays attention to:
• "I" (subject): 0.8
• "love" (self): 0.1
• "this": 0.2
• "amazing": 0.6
• "movie" (object): 0.9
• "but": 0.4
• "ending": 0.3
• "disappointing": 0.2
Key Innovation: Unlike RNNs that process sequentially, attention allows every word to "look at" every other word simultaneously.

Benefits:
• Parallelizable (faster training)
• Better long-range dependencies
• Interpretable (can see what model focuses on)

🤖 BERT vs GPT: Two Transformer Giants

BERT

Bidirectional Encoder

Strength: Understanding context from both directions

Task: Fill in the blank
"I love this ___ movie"
BERT looks at both left and right context to predict "amazing"
  • Pre-trained on masked language modeling
  • Great for classification, QA
  • Bidirectional understanding

GPT

Autoregressive Decoder

Strength: Generating natural text continuation

Task: Continue the text
"I love this amazing movie but"
GPT generates: "the pacing could be better in some scenes"
  • Pre-trained on next word prediction
  • Excellent for text generation
  • Left-to-right processing
VS

🦣 Large Language Models

The ChatGPT Revolution

🔥 What Makes an LLM "Large"?

175B
Parameters
GPT-3 has 175 billion parameters (connections between neurons)
570GB
Training Data
Massive text corpora from books, web, articles
$4.6M
Training Cost
Estimated cost to train GPT-3 from scratch
🚀
Emergent
Abilities that emerge at scale: reasoning, coding, creativity

🏞️ The Modern LLM Landscape

Today's landscape features powerful models from different organizations, each with unique strengths.

LLM Family Tree: GPT Series (OpenAI): ├── GPT-3.5 → ChatGPT ├── GPT-4 → Advanced reasoning, multimodal └── GPT-4 Turbo → Faster, cheaper Google Family: ├── BERT → Understanding tasks ├── T5 → Text-to-text transfer transformer ├── PaLM → Pathways Language Model └── Gemini → Multimodal competitor to GPT-4 Others: ├── Claude (Anthropic) → Constitutional AI, safety-focused ├── LLaMA (Meta) → Open research model └── Mixtral (Mistral) → Mixture of Experts
  • ChatGPT Moment: Made AI accessible to everyone
  • Multimodal: Text + images + code understanding
  • Reasoning: Chain-of-thought, step-by-step problem solving

🛠️ LLM Applications Today

Prompt:
"Analyze this movie review: 'I love this amazing movie but the ending was disappointing.'"

LLM Response:
"This review expresses mixed sentiment. The reviewer has a positive overall opinion of the movie, using strong positive words like 'love' and 'amazing.' However, the conjunction 'but' signals a contrasting viewpoint about the ending, which they found 'disappointing.' This suggests the movie was well-executed overall but had a weak conclusion that affected their satisfaction."

Current Applications

  • Code Generation: GitHub Copilot, ChatGPT coding
  • Content Creation: Articles, emails, marketing copy
  • Customer Service: Advanced chatbots
  • Education: Tutoring, explanation, homework help
RAG Systems:
Retrieval Augmented Generation combines LLMs with knowledge bases for accurate, up-to-date information.

🔮 Future of NLP & LLMs

What's Next?

🚀 Cutting-Edge Trends

2024

Multimodal AI

GPT-4V, DALL-E integration. AI that understands text, images, audio, and video together. Creating rich, multimodal experiences.

2025

AI Agents

LLMs that can use tools, browse internet, execute code, and take actions. Moving from chatbots to autonomous assistants.

2026

Mixture of Experts

Scaling models efficiently using specialized sub-models. Better performance with lower computational cost.

2027+

Artificial General Intelligence?

AI systems that match or exceed human performance across all cognitive tasks. The ultimate goal of AI research.

💼 Career Opportunities in NLP/AI

💻
Prompt Engineer
Design effective prompts for LLMs. New field with high demand.
🔬
ML Engineer
Build and deploy NLP models in production systems.
📊
AI Researcher
Advance the field with new algorithms and techniques.
🏢
AI Product Manager
Lead AI product development and strategy.
🛡️
AI Safety Engineer
Ensure AI systems are safe, aligned, and beneficial.
💰
High Salaries
₹15-50 LPA in India, $150k-500k+ globally for AI roles.

🎯 Skills to Master

Technical Skills

  • Programming: Python, PyTorch, TensorFlow
  • NLP Libraries: NLTK, spaCy, Hugging Face
  • Cloud Platforms: AWS, GCP, Azure
  • APIs: OpenAI, Anthropic, Google AI
Hot Skill: Prompt Engineering
Learning to communicate effectively with AI systems

Soft Skills

  • Problem Solving: Breaking down complex tasks
  • Communication: Explaining AI to non-technical users
  • Ethics: Understanding AI bias and fairness
  • Continuous Learning: Field evolves rapidly
Remember: AI amplifies human capabilities but doesn't replace human judgment and creativity

🎓 Journey Summary

We've traveled from basic text processing to the frontiers of artificial intelligence!

"I love this amazing movie but the ending was disappointing."
  • Traditional NLP: Tokenization, POS tagging, parsing, statistical methods
  • Neural Revolution: Word embeddings, RNNs, LSTMs, attention mechanisms
  • Transformer Era: BERT for understanding, GPT for generation
  • LLM Explosion: ChatGPT, multimodal AI, emergent abilities
  • Future: AI agents, AGI, new career opportunities
Key Insight: Our simple movie review sentence now receives human-like analysis from AI systems that understand context, sentiment, and nuance - a journey from rule-based parsing to intelligent comprehension.
Your Next Steps: Start building, experimenting, and creating with these powerful tools. The future of human-AI collaboration is in your hands!

🙏 Thank You!

Questions & Discussion

GTU 31711105 AI NLP Course • Journey Complete! 🚀