Inference
Plain Explanation
Inference is the stage where a model applies what it has learned to a real input and produces an output. Classifying a cat photo, answering a user question, or predicting sentiment are all inference. For LLMs, the prompt is tokenized, the model processes the context, and output tokens are predicted one by one.
Examples & Analogies
- Exam room: training is studying; inference is solving the problem in front of you.
- Support chatbot: the model receives a customer question and generates a response.
- Image classification: the model receives an image and returns a label plus confidence.
At a Glance
| Stage | What happens | Main metrics |
|---|---|---|
| Training | Learn model parameters | loss, eval score, training cost |
| Fine-tuning | Adapt to a task or domain | validation score, overfitting |
| Inference | Compute outputs for new inputs | latency, throughput, accuracy |
| Model Serving | Operate inference as a service | uptime, scaling, cost |
Where and Why It Matters
Most of what users experience in an AI product is inference. A well-trained model still feels bad if inference is slow, expensive, or unreliable. In LLM systems, teams track time to first token, generated tokens per second, concurrent requests, GPU memory, and timeouts.
Common Misconceptions
- “Inference is just one model call” → tokenizer, batching, cache, scheduler, and postprocessing are part of the runtime path.
- “Good training results guarantee good inference” → serving setup, context length, quantization, and batch size affect quality and cost.
- “Only latency matters” → throughput, cost, reliability, and output quality also matter.
- “A bigger GPU is always faster” → memory bandwidth, batch shape, KV cache, and framework optimizations matter too.
How It Sounds in Conversation
- “Accuracy is fine; P95 inference latency is the bottleneck.”
- “Decode time grows with output length, so we need an output token budget.”
- “Larger batches improve throughput but can hurt per-request latency.”
- “KV-cache memory is limiting concurrent requests.”
Related Reading
References
- torch.inference_mode
Official docs explaining inference runtime mode, gradient tracking, and the distinction from model.eval().
- Text Generation Inference
Official docs for LLM inference serving, streaming, batching, tensor parallelism, and metrics.
- OpenAI-Compatible Server
Explains vLLM HTTP inference serving and OpenAI-compatible API server workflows.
- Inference Protocols and APIs
Official docs for inference requests, input/output tensors, HTTP/gRPC endpoints, and async responses.
- KServe
Documents Kubernetes-based scalable inference, autoscaling, and GPU serving context.