LLM & Generative AI Infra & Hardware Deep Learning

Inference

Difficulty

Plain Explanation

Inference is the stage where a model applies what it has learned to a real input and produces an output. Classifying a cat photo, answering a user question, or predicting sentiment are all inference. For LLMs, the prompt is tokenized, the model processes the context, and output tokens are predicted one by one.

Examples & Analogies

Exam room: training is studying; inference is solving the problem in front of you.
Support chatbot: the model receives a customer question and generates a response.
Image classification: the model receives an image and returns a label plus confidence.

At a Glance

Stage	What happens	Main metrics
Training	Learn model parameters	loss, eval score, training cost
Fine-tuning	Adapt to a task or domain	validation score, overfitting
Inference	Compute outputs for new inputs	latency, throughput, accuracy
Model Serving	Operate inference as a service	uptime, scaling, cost

Where and Why It Matters

Most of what users experience in an AI product is inference. A well-trained model still feels bad if inference is slow, expensive, or unreliable. In LLM systems, teams track time to first token, generated tokens per second, concurrent requests, GPU memory, and timeouts.

Common Misconceptions

“Inference is just one model call” → tokenizer, batching, cache, scheduler, and postprocessing are part of the runtime path.
“Good training results guarantee good inference” → serving setup, context length, quantization, and batch size affect quality and cost.
“Only latency matters” → throughput, cost, reliability, and output quality also matter.
“A bigger GPU is always faster” → memory bandwidth, batch shape, KV cache, and framework optimizations matter too.

How It Sounds in Conversation

“Accuracy is fine; P95 inference latency is the bottleneck.”
“Decode time grows with output length, so we need an output token budget.”
“Larger batches improve throughput but can hurt per-request latency.”
“KV-cache memory is limiting concurrent requests.”

References

★Docs
torch.inference_mode
Official docs explaining inference runtime mode, gradient tracking, and the distinction from model.eval().
★Docs
Text Generation Inference
Official docs for LLM inference serving, streaming, batching, tensor parallelism, and metrics.
★Docs
OpenAI-Compatible Server
Explains vLLM HTTP inference serving and OpenAI-compatible API server workflows.
★Docs
Inference Protocols and APIs
Official docs for inference requests, input/output tensors, HTTP/gRPC endpoints, and async responses.
·Docs
KServe
Documents Kubernetes-based scalable inference, autoscaling, and GPU serving context.

Helpful?

0to1log Weekly

AI Glossary