Generative AI Course
Quantization: Making Large Language Models Smaller and Faster
Modern Large Language Models are powerful but expensive to run.
Even after efficient fine-tuning methods like LoRA, inference cost remains a major bottleneck.
Quantization exists to reduce this cost without retraining the model.
The Core Problem Quantization Solves
LLMs store billions of parameters as floating-point numbers.
This leads to:
- High GPU memory usage
- Slow inference latency
- High infrastructure cost
Quantization reduces the precision of these numbers.
What Quantization Means in Practice
Quantization converts model weights from high-precision formats to lower-precision formats.
Typical transitions include:
- FP32 → FP16
- FP16 → INT8
- INT8 → INT4
Lower precision means less memory and faster computation.
Why Precision Can Be Reduced
Neural networks are tolerant to small numerical noise.
Most parameters do not require full precision to function correctly.
Quantization exploits this redundancy.
Thinking Before Applying Quantization
Before quantizing a model, engineers decide:
- Is inference latency critical?
- Is slight accuracy loss acceptable?
- What hardware will be used?
Quantization is a deployment decision, not a training decision.
Basic Weight Quantization Concept
At a high level, quantization maps floating-point values into discrete buckets.
quantized_weight = round(weight / scale) + zero_point
The scale and zero-point preserve relative relationships between values.
What Happens During Inference
During inference:
- Quantized weights are loaded
- Integer arithmetic is used
- Outputs are de-quantized when needed
This drastically improves throughput.
Post-Training Quantization Example
This example shows quantizing a pretrained model without retraining.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
load_in_8bit=True
)
The model weights are loaded directly in 8-bit format.
No gradient updates occur.
Quantization-Aware vs Post-Training
There are two main approaches:
- Post-training quantization
- Quantization-aware training
Post-training is simpler and widely used in production.
Quantization-aware training provides higher accuracy but requires retraining.
INT8 vs INT4 Trade-Offs
INT8 quantization:
- Minimal accuracy loss
- Good hardware support
INT4 quantization:
- Maximum memory savings
- Higher risk of degradation
Choice depends on use case.
Real-World Systems Using Quantization
- Edge AI applications
- Mobile inference
- High-throughput chat systems
Most large-scale deployments rely on quantization.
Common Pitfalls
- Quantizing sensitive layers incorrectly
- Ignoring hardware compatibility
- Assuming zero accuracy impact
Testing is mandatory.
How Learners Should Practice Quantization
Effective practice includes:
- Comparing FP16 vs INT8 inference speed
- Measuring memory usage
- Evaluating output quality differences
This builds deployment intuition.
Practice
What does quantization primarily reduce?
Quantization mainly impacts which stage?
What resource is most saved by quantization?
Quick Quiz
Which quantization format balances speed and accuracy?
Quantization decisions are mainly made during?
Quantization always involves which factor?
Recap: Quantization reduces model size and inference cost by lowering numerical precision.
Next up: Function Calling — connecting LLMs to real code and tools.