Vol.01 · No.10 CS · AI · Infra May 30, 2026

AI Glossary

GlossaryReferenceLearn
LLM & Generative AI Infra & Hardware Deep Learning

Inference

Difficulty

Plain Explanation

Inference is the stage where a model applies what it has learned to a real input and produces an output. Classifying a cat photo, answering a user question, or predicting sentiment are all inference. For LLMs, the prompt is tokenized, the model processes the context, and output tokens are predicted one by one.

Examples & Analogies

  • Exam room: training is studying; inference is solving the problem in front of you.
  • Support chatbot: the model receives a customer question and generates a response.
  • Image classification: the model receives an image and returns a label plus confidence.

At a Glance

StageWhat happensMain metrics
TrainingLearn model parametersloss, eval score, training cost
Fine-tuningAdapt to a task or domainvalidation score, overfitting
InferenceCompute outputs for new inputslatency, throughput, accuracy
Model ServingOperate inference as a serviceuptime, scaling, cost

Where and Why It Matters

Most of what users experience in an AI product is inference. A well-trained model still feels bad if inference is slow, expensive, or unreliable. In LLM systems, teams track time to first token, generated tokens per second, concurrent requests, GPU memory, and timeouts.

Common Misconceptions

  • “Inference is just one model call” → tokenizer, batching, cache, scheduler, and postprocessing are part of the runtime path.
  • “Good training results guarantee good inference” → serving setup, context length, quantization, and batch size affect quality and cost.
  • “Only latency matters” → throughput, cost, reliability, and output quality also matter.
  • “A bigger GPU is always faster” → memory bandwidth, batch shape, KV cache, and framework optimizations matter too.

How It Sounds in Conversation

  • “Accuracy is fine; P95 inference latency is the bottleneck.”
  • “Decode time grows with output length, so we need an output token budget.”
  • “Larger batches improve throughput but can hurt per-request latency.”
  • “KV-cache memory is limiting concurrent requests.”

Related Reading

References

  • Docs
    torch.inference_mode

    Official docs explaining inference runtime mode, gradient tracking, and the distinction from model.eval().

  • Docs
    Text Generation Inference

    Official docs for LLM inference serving, streaming, batching, tensor parallelism, and metrics.

  • Docs
    OpenAI-Compatible Server

    Explains vLLM HTTP inference serving and OpenAI-compatible API server workflows.

  • Docs
    Inference Protocols and APIs

    Official docs for inference requests, input/output tensors, HTTP/gRPC endpoints, and async responses.

  • ·Docs
    KServe

    Documents Kubernetes-based scalable inference, autoscaling, and GPU serving context.

Helpful?