Infra & Hardware LLM & Generative AI

AI Inference

Difficulty

Plain Explanation

AI inference is the stage where a trained model is actually used. When a user asks a chatbot a question, the system tokenizes the input, runs the model, and streams generated tokens back. The model is not being retrained during that request; it is using fixed weights to compute an output.

Examples & Analogies

If training is teaching a chef, inference is dinner service. A classifier labeling a new image, a recommender ranking products, a speech model transcribing audio, and an LLM generating an answer are all inference workloads.

At a Glance

Dimension	Training	Inference
Goal	Learn model weights	Produce outputs for new inputs
Main cost	Data, training time, accelerators	Latency, throughput, memory, requests
Weight updates	Usually yes	Usually no
Common metrics	loss, accuracy, validation score	TTFT, tokens/s, p95 latency, cost/request

Where and Why It Matters

Inference is where AI products repeatedly spend money. As traffic grows, the same model must run again and again for users. That makes GPU memory, batching, KV cache, quantization, autoscaling, and serving runtimes central to product economics.

Common Misconceptions

Myth: Inference is just one forward pass.
Reality: Production inference includes batching, caching, streaming, routing, fallbacks, and monitoring.
Myth: A larger model is always the better inference choice.
Reality: Latency and cost constraints can make smaller, distilled, or quantized models better.

How It Sounds in Conversation

"The model quality is fine, but p95 latency is above the product target."
"Prefill looks acceptable; decode tokens per second is the bottleneck."
"At scale, inference cost repeats on every user request."

References

★Docs
What Is AI Inference?
Defines inference as running a trained model on new inputs to produce outputs.
★Docs
What is AI Inference?
Explains AI inference and contrasts it with training.
·Docs
Text Generation Inference Documentation
Shows production-oriented LLM inference serving concepts.
·Docs
vLLM Documentation
Documents an LLM serving runtime focused on throughput and latency.

Helpful?

0to1log Weekly

AI Glossary