Infra & Hardware Deep Learning

Quantization

Difficulty

Plain Explanation

Deep models are heavy: they use many floating‑point numbers, which inflate memory, bandwidth, and compute. Quantization addresses this by using fewer bits per number and hardware that executes low‑precision arithmetic efficiently, improving throughput and reducing memory traffic and storage. Like printing a photo with a limited palette, you map rich values onto a fixed grid. Choose a range [xmin, xmax] and a bit‑width B (with 2^B levels), compute a scale and a zero‑point, and round each real value to the nearest integer index. Uniform quantization sets the scale as Δ = (xmax − xmin)/(2^B − 1) and forms x_q = round(x/Δ + b), with b depending on signed (symmetric) vs unsigned (asymmetric). Rounding introduces quantization error and sometimes bias; if ranges are mis‑set or bit‑width is too low, outputs can shift. Engineers calibrate ranges on a small dataset and pick granularity: per‑tensor (one range) or per‑channel (separate ranges) to fit uneven distributions. They also choose symmetric vs asymmetric: symmetric maps zero exactly and keeps kernels simple, while asymmetric can better cover skewed activations (e.g., after ReLU) but adds overhead in matmuls to correct for the zero‑point. When accuracy is sensitive, teams may retrain with simulated quantization (QAT) or keep select layers at higher precision (mixed precision).

Examples & Analogies

LLM on laptop GPUs (INT8 vs FP16): Some benchmarks report reduced inference time and memory for INT8 PTQ on models like GPT‑2, LLaMA‑2‑7B‑Chat, and Qwen1.5‑1.8B‑Chat on RTX 4070/4080 Laptop GPUs, with quality impacts varying by task; structured code often proves more sensitive than explanatory text.
On‑device document OCR on a mobile NPU: A compact vision model is quantized to INT8 so it fits memory and runs on integer‑optimized neural units under tight power budgets; activation ranges are calibrated on a small sample.
CPU‑only batch scoring with a small language model: A GPT‑2–class model quantized to INT8 reduces memory and bandwidth for offline jobs, accepting minor fluency declines while preserving topical relevance for coarse filtering.

At a Glance

	PTQ (Post‑Training)	QAT (Quantization‑Aware Training)	Mixed‑precision PTQ
Training need	No retraining; uses calibration	Retraining simulates quantization	No retraining; selective higher precision
Scale/offset setup	Calibrate ranges on a subset	Learned/adjusted during training	Calibrate; exempt some ops from low‑bit
Accuracy at low bits	More fragile below INT8	Better preserves at very low bits	Improves where needed
Engineering cost	Low	Higher (training cost)	Medium (policy complexity)
Acceleration headroom	High with wide int‑kernel use	Same at inference; extra training	Lower (some ops stay high‑precision)
Where used	Quick deployment, pilots	When quality must match float	Production compromise

PTQ is the fastest lever, QAT buys back accuracy at engineering cost, and mixed precision trades some speed for stability by keeping the touchy parts high‑precision.

Where and Why It Matters

LLM inference on laptop GPUs: Some reports show lower inference time and memory for INT8 PTQ vs FP16 on the above models and hardware, with task/model sensitivity differences; code generation tends to be more fragile.
Structured reporting via quantization cards: “Quantization cards” make signedness, granularity, ops covered, mixed‑precision constraints, and calibration details explicit for comparability.
Asymmetric trade‑offs in production: Asymmetric ranges can fit skewed activations but add correction terms—especially costly when weights are asymmetric.
Mixed precision as a deployment knob: Excluding sensitive ops (e.g., residuals) from quantization preserves accuracy but caps integer‑only speedups.
Hardware fit mindset: Gains depend on target support for low‑precision arithmetic and chosen design trade‑offs; measure on the deployment device.

Common Misconceptions

❌ Myth: Quantization is harmless rounding. → ✅ Reality: Coarse grids add error/bias; small systematic shifts can matter.
❌ Myth: Asymmetric is always better. → ✅ Reality: It may fit activations better, but zero‑point corrections add overhead—especially for weights.
❌ Myth: One benchmark tells the story. → ✅ Reality: Results hinge on granularity, op coverage, mixed precision, calibration, and hardware; report these explicitly.

How It Sounds in Conversation

"A/B INT8 PTQ vs FP16 on the RTX 4080 Laptop GPU; log throughput, latency, and a small quality set for text vs code."
"Compare per‑channel vs per‑tensor calibration and post the accuracy deltas at fixed batch size."
"Run symmetric vs asymmetric activations and record extra zero‑point correction time in matmuls."
"If PTQ misses quality, scope a short QAT run; keep sensitive layers mixed precision as fallback."
"Publish a one‑page quantization card with bit‑widths, ops covered, calibration size, and exact hardware."

References

★Paper2021
Confounding Tradeoffs for Neural Network QuantizationSahaj Garg, Anirudh Jain, Joe Lou, Mitchell Nahmias
Defines uniform quantization math and details symmetric/asymmetric and per‑tensor/per‑channel trade‑offs.
★Paper2025
Quantization without Tears
Positions quantization as standard compression/acceleration; discusses method complexity and robustness.
★Paper2024
Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview
Survey of PTQ vs QAT and methods for large models; context on efficiency and accuracy challenges.
★Paper
Quantized Transformers in Practice: Benchmarking Full- and Low-Precision LLMs
INT8 PTQ vs FP16 on GPT‑2, LLaMA‑2‑7B‑Chat, Qwen1.5‑1.8B on RTX 4070/4080; efficiency/quality trade‑offs.
·Blog
Quantization in AI: Techniques, Benefits, Trade-offs & Modern Architectures
PTQ·QAT 개괄과 배포 맥락, 활용 환경 정리.
·Blog
The Complete Guide to LLM Quantization with vLLM - Jarvis Labs
Hands‑on deployment and benchmarking notes; practical flags and throughput/latency testing guidance.

Helpful?

0to1log Weekly

AI Glossary