Generative AI Course
RLHF: Reinforcement Learning from Human Feedback
Instruction fine-tuning teaches a model how to follow tasks.
RLHF teaches the model how to behave.
This step is responsible for making modern assistants feel safe, helpful, and aligned with human expectations.
The Problem RLHF Solves
After instruction tuning, models may still:
- Give overconfident wrong answers
- Respond in unsafe or biased ways
- Ignore social or ethical context
These issues cannot be fixed with raw text data alone.
Why Human Feedback Is Necessary
Some qualities are difficult to encode mathematically:
- Helpfulness
- Politeness
- Safety
- Trustworthiness
Humans provide judgment signals that data cannot.
High-Level RLHF Pipeline
RLHF is not a single algorithm.
It is a multi-stage pipeline:
- Collect human preference data
- Train a reward model
- Optimize the base model using reinforcement learning
Each stage refines behavior further.
Step 1: Human Preference Collection
Humans are shown multiple model responses to the same prompt.
They rank responses based on quality.
Prompt: "Explain recursion"
Response A: Clear explanation with example
Response B: Vague and confusing explanation
Human choice: A > B
These rankings form the foundation of RLHF.
Step 2: Training the Reward Model
A separate neural network learns to predict human preferences.
This model assigns a reward score to outputs.
reward = reward_model(prompt, response)
Higher scores represent more human-aligned responses.
What the Reward Model Learns
The reward model implicitly learns:
- Which answers are clearer
- Which tone feels appropriate
- Which content is unsafe
It becomes a proxy for human judgment.
Step 3: Reinforcement Learning Optimization
The base language model is optimized to maximize reward scores.
Proximal Policy Optimization (PPO) is commonly used.
loss = -reward + kl_penalty
loss.backward()
optimizer.step()
The KL penalty prevents the model from drifting too far from the original distribution.
Why KL Penalty Matters
Without constraints, the model may exploit the reward function.
KL regularization keeps language natural and stable.
How RLHF Changes Model Behavior
After RLHF, models:
- Refuse unsafe requests
- Explain uncertainty
- Respond more politely
These behaviors are learned, not hardcoded.
Real-World Use of RLHF
- Chat assistants
- Customer-facing AI tools
- Enterprise copilots
Almost every deployed LLM goes through some form of RLHF.
Limitations of RLHF
RLHF introduces challenges:
- Human bias in feedback
- High labeling cost
- Reward hacking risks
Alignment is an ongoing process, not a one-time fix.
How Learners Should Practice RLHF Concepts
Hands-on learning focuses on:
- Comparing ranked responses
- Analyzing why one answer is better
- Understanding reward trade-offs
Even without training models, evaluation skills matter.
Practice
What signal drives RLHF?
Which model predicts human judgment?
What does RLHF primarily improve?
Quick Quiz
Which algorithm is commonly used in RLHF?
RLHF primarily focuses on?
What prevents extreme model behavior during RLHF?
Recap: RLHF aligns model behavior with human values using feedback-driven reinforcement learning.
Next up: LoRA — efficient fine-tuning without retraining entire models.