AI Inference
Plain Explanation
AI inference is the stage where a trained model is actually used. When a user asks a chatbot a question, the system tokenizes the input, runs the model, and streams generated tokens back. The model is not being retrained during that request; it is using fixed weights to compute an output.
Examples & Analogies
If training is teaching a chef, inference is dinner service. A classifier labeling a new image, a recommender ranking products, a speech model transcribing audio, and an LLM generating an answer are all inference workloads.
At a Glance
| Dimension | Training | Inference |
|---|---|---|
| Goal | Learn model weights | Produce outputs for new inputs |
| Main cost | Data, training time, accelerators | Latency, throughput, memory, requests |
| Weight updates | Usually yes | Usually no |
| Common metrics | loss, accuracy, validation score | TTFT, tokens/s, p95 latency, cost/request |
Where and Why It Matters
Inference is where AI products repeatedly spend money. As traffic grows, the same model must run again and again for users. That makes GPU memory, batching, KV cache, quantization, autoscaling, and serving runtimes central to product economics.
Common Misconceptions
- Myth: Inference is just one forward pass.
- Reality: Production inference includes batching, caching, streaming, routing, fallbacks, and monitoring.
- Myth: A larger model is always the better inference choice.
- Reality: Latency and cost constraints can make smaller, distilled, or quantized models better.
How It Sounds in Conversation
- "The model quality is fine, but p95 latency is above the product target."
- "Prefill looks acceptable; decode tokens per second is the bottleneck."
- "At scale, inference cost repeats on every user request."
Related Reading
References
- What Is AI Inference?
Defines inference as running a trained model on new inputs to produce outputs.
- What is AI Inference?
Explains AI inference and contrasts it with training.
- Text Generation Inference Documentation
Shows production-oriented LLM inference serving concepts.
- vLLM Documentation
Documents an LLM serving runtime focused on throughput and latency.