← Glossary
Infra & Hardware
GPU, cloud, MLOps, deployment, optimization
28 terms
AI Inference
AI 추론
AI inference is the runtime phase in which a trained model with fixed weights processes new inputs to produce prediction…
Batch Inference
배치 추론
Batch inference is an offline prediction method that aggregates a large, fixed set of inputs and generates outputs in bu…
Bedrock
베드록
Amazon Bedrock is a fully managed AWS service that provides secure, enterprise-grade access to multiple foundation model…
CUDA
CUDA
CUDA is NVIDIA's programming model and runtime platform for running massively parallel computation on NVIDIA GPUs.
cuDNN
cuDNN
cuDNN is a GPU-accelerated library from NVIDIA that provides highly optimized implementations of core deep learning oper…
edge deployment
에지 배포
Edge deployment means running AI models or apps close to where data is created — for example on factory lines, inside re…
GPU
그래픽 처리 장치
A GPU is an accelerator that executes uniform, matrix-heavy computations at high throughput via massive parallel threads…
GPU cluster
GPU 클러스터
A GPU cluster is a system where multiple GPUs (Graphics Processing Units) are networked together to function as a single…
Inference
추론
Inference is the execution phase where a trained model receives new inputs and computes predictions, classifications, or…
inference cost
추론 비용
Inference cost is the operational compute-and-infrastructure expense incurred each time a deployed LLM tokenizes a promp…
inference latency
추론 지연 시간
Inference latency is the actual time it takes for an AI model to process an input and return an output. It typically ref…
Inference Scaling
추론 스케일링
Inference scaling is the set of techniques that expand and adapt model serving capacity to meet latency and cost objecti…
KV Cache
KV 캐시
A KV cache is the inference-time memory structure that stores previously computed attention key/value tensors in an auto…
KV Offloading
KV 오프로딩
KV offloading is an inference technique that tiers the self-attention Key/Value cache from GPU memory to CPU RAM or stor…
Model parallelism
모델 병렬 처리
Model parallelism is a distributed technique that enables training or serving neural networks too large for a single GPU…
Model Router
모델 라우터
A model router is an orchestration layer that selects which model should handle a request based on difficulty, modality,…
Model Serving
모델 서빙
Model serving is the operational system that deploys a trained model behind APIs, batch jobs, or streaming endpoints and…
NVIDIA
엔비디아
NVIDIA provides an end-to-end AI software stack—NVIDIA AI Enterprise—spanning deployment microservices (NIM) and develop…
Observability
관측 가능성
Observability is the operational capability to infer a distributed system’s internal state by collecting traces, metrics…
on-device AI
온디바이스 AI
On-device AI means running artificial intelligence directly on your own device—like a phone, laptop, or tablet—instead o…
OpenTelemetry
오픈텔레메트리
OpenTelemetry is a vendor-neutral open-source observability framework that instruments applications via language-specifi…
PagedAttention
페이지드 어텐션
PagedAttention is an LLM-serving memory technique that partitions the attention key–value (KV) cache into fixed-size pag…
Quantization
양자화
Quantization is a neural‑network compression‑and‑acceleration technique that represents weights and activations with low…
real-time inference
실시간 추론
Real-time inference is a serving paradigm that exposes a trained model as an API to execute and respond immediately upon…
SLM
소형 언어 모델
A Small Language Model (SLM) is a language model that performs natural-language understanding or generation with a small…
Speculative Decoding
추측적 디코딩
Speculative decoding is an inference acceleration method where a smaller drafter proposes multiple candidate tokens and …
Trainium
트레이늄
Trainium is Amazon Web Services (AWS)’s custom AI training chip designed to train large deep learning models—especially …
vLLM
vLLM
vLLM is an open-source LLM serving engine that boosts throughput by managing the attention KV cache with PagedAttention—…