Vol.01 · No.10 CS · AI · Infra May 13, 2026

AI Glossary

GlossaryReferenceLearn
Infra & Hardware Deep Learning

Quantization

Difficulty

Plain Explanation

Deep models are heavy: they use many floating‑point numbers, which inflate memory, bandwidth, and compute. Quantization addresses this by using fewer bits per number and hardware that executes low‑precision arithmetic efficiently, improving throughput and reducing memory traffic and storage. Like printing a photo with a limited palette, you map rich values onto a fixed grid. Choose a range [xmin, xmax] and a bit‑width B (with 2^B levels), compute a scale and a zero‑point, and round each real value to the nearest integer index. Uniform quantization sets the scale as Δ = (xmax − xmin)/(2^B − 1) and forms x_q = round(x/Δ + b), with b depending on signed (symmetric) vs unsigned (asymmetric). Rounding introduces quantization error and sometimes bias; if ranges are mis‑set or bit‑width is too low, outputs can shift. Engineers calibrate ranges on a small dataset and pick granularity: per‑tensor (one range) or per‑channel (separate ranges) to fit uneven distributions. They also choose symmetric vs asymmetric: symmetric maps zero exactly and keeps kernels simple, while asymmetric can better cover skewed activations (e.g., after ReLU) but adds overhead in matmuls to correct for the zero‑point. When accuracy is sensitive, teams may retrain with simulated quantization (QAT) or keep select layers at higher precision (mixed precision).

Examples & Analogies

  • LLM on laptop GPUs (INT8 vs FP16): Some benchmarks report reduced inference time and memory for INT8 PTQ on models like GPT‑2, LLaMA‑2‑7B‑Chat, and Qwen1.5‑1.8B‑Chat on RTX 4070/4080 Laptop GPUs, with quality impacts varying by task; structured code often proves more sensitive than explanatory text.
  • On‑device document OCR on a mobile NPU: A compact vision model is quantized to INT8 so it fits memory and runs on integer‑optimized neural units under tight power budgets; activation ranges are calibrated on a small sample.
  • CPU‑only batch scoring with a small language model: A GPT‑2–class model quantized to INT8 reduces memory and bandwidth for offline jobs, accepting minor fluency declines while preserving topical relevance for coarse filtering.

At a Glance

PTQ (Post‑Training)QAT (Quantization‑Aware Training)Mixed‑precision PTQ
Training needNo retraining; uses calibrationRetraining simulates quantizationNo retraining; selective higher precision
Scale/offset setupCalibrate ranges on a subsetLearned/adjusted during trainingCalibrate; exempt some ops from low‑bit
Accuracy at low bitsMore fragile below INT8Better preserves at very low bitsImproves where needed
Engineering costLowHigher (training cost)Medium (policy complexity)
Acceleration headroomHigh with wide int‑kernel useSame at inference; extra trainingLower (some ops stay high‑precision)
Where usedQuick deployment, pilotsWhen quality must match floatProduction compromise

PTQ is the fastest lever, QAT buys back accuracy at engineering cost, and mixed precision trades some speed for stability by keeping the touchy parts high‑precision.

Where and Why It Matters

  • LLM inference on laptop GPUs: Some reports show lower inference time and memory for INT8 PTQ vs FP16 on the above models and hardware, with task/model sensitivity differences; code generation tends to be more fragile.
  • Structured reporting via quantization cards: “Quantization cards” make signedness, granularity, ops covered, mixed‑precision constraints, and calibration details explicit for comparability.
  • Asymmetric trade‑offs in production: Asymmetric ranges can fit skewed activations but add correction terms—especially costly when weights are asymmetric.
  • Mixed precision as a deployment knob: Excluding sensitive ops (e.g., residuals) from quantization preserves accuracy but caps integer‑only speedups.
  • Hardware fit mindset: Gains depend on target support for low‑precision arithmetic and chosen design trade‑offs; measure on the deployment device.

Common Misconceptions

  • ❌ Myth: Quantization is harmless rounding. → ✅ Reality: Coarse grids add error/bias; small systematic shifts can matter.
  • ❌ Myth: Asymmetric is always better. → ✅ Reality: It may fit activations better, but zero‑point corrections add overhead—especially for weights.
  • ❌ Myth: One benchmark tells the story. → ✅ Reality: Results hinge on granularity, op coverage, mixed precision, calibration, and hardware; report these explicitly.

How It Sounds in Conversation

  • "A/B INT8 PTQ vs FP16 on the RTX 4080 Laptop GPU; log throughput, latency, and a small quality set for text vs code."
  • "Compare per‑channel vs per‑tensor calibration and post the accuracy deltas at fixed batch size."
  • "Run symmetric vs asymmetric activations and record extra zero‑point correction time in matmuls."
  • "If PTQ misses quality, scope a short QAT run; keep sensitive layers mixed precision as fallback."
  • "Publish a one‑page quantization card with bit‑widths, ops covered, calibration size, and exact hardware."

Related Reading

References

Helpful?