Vol.01 · No.10 CS · AI · Infra May 13, 2026

AI Glossary

GlossaryReferenceLearn
Infra & Hardware LLM & Generative AI

AI Inference

Difficulty

Plain Explanation

AI inference is the stage where a trained model is actually used. When a user asks a chatbot a question, the system tokenizes the input, runs the model, and streams generated tokens back. The model is not being retrained during that request; it is using fixed weights to compute an output.

Examples & Analogies

If training is teaching a chef, inference is dinner service. A classifier labeling a new image, a recommender ranking products, a speech model transcribing audio, and an LLM generating an answer are all inference workloads.

At a Glance

DimensionTrainingInference
GoalLearn model weightsProduce outputs for new inputs
Main costData, training time, acceleratorsLatency, throughput, memory, requests
Weight updatesUsually yesUsually no
Common metricsloss, accuracy, validation scoreTTFT, tokens/s, p95 latency, cost/request

Where and Why It Matters

Inference is where AI products repeatedly spend money. As traffic grows, the same model must run again and again for users. That makes GPU memory, batching, KV cache, quantization, autoscaling, and serving runtimes central to product economics.

Common Misconceptions

  • Myth: Inference is just one forward pass.
  • Reality: Production inference includes batching, caching, streaming, routing, fallbacks, and monitoring.
  • Myth: A larger model is always the better inference choice.
  • Reality: Latency and cost constraints can make smaller, distilled, or quantized models better.

How It Sounds in Conversation

  • "The model quality is fine, but p95 latency is above the product target."
  • "Prefill looks acceptable; decode tokens per second is the bottleneck."
  • "At scale, inference cost repeats on every user request."

Related Reading

References

Helpful?