← Glossary

Infra & Hardware

GPU, cloud, MLOps, deployment, optimization

28 terms

Infra & Hardware LLM & Generative AI
AI Inference
AI 추론
AI inference is the runtime phase in which a trained model with fixed weights processes new inputs to produce prediction…
Infra & Hardware LLM & Generative AI Data Engineering
Batch Inference
배치 추론
Batch inference is an offline prediction method that aggregates a large, fixed set of inputs and generates outputs in bu…
Products & Platforms Infra & Hardware LLM & Generative AI
Bedrock
베드록
Amazon Bedrock is a fully managed AWS service that provides secure, enterprise-grade access to multiple foundation model…
Infra & Hardware Deep Learning
CUDA
CUDA
CUDA is NVIDIA's programming model and runtime platform for running massively parallel computation on NVIDIA GPUs.
Infra & Hardware Products & Platforms
cuDNN
cuDNN
cuDNN is a GPU-accelerated library from NVIDIA that provides highly optimized implementations of core deep learning oper…
Infra & Hardware LLM & Generative AI
edge deployment
에지 배포
Edge deployment means running AI models or apps close to where data is created — for example on factory lines, inside re…
Infra & Hardware Deep Learning ML Fundamentals
GPU
그래픽 처리 장치
A GPU is an accelerator that executes uniform, matrix-heavy computations at high throughput via massive parallel threads…
Infra & Hardware
GPU cluster
GPU 클러스터
A GPU cluster is a system where multiple GPUs (Graphics Processing Units) are networked together to function as a single…
LLM & Generative AI Infra & Hardware Deep Learning
Inference
추론
Inference is the execution phase where a trained model receives new inputs and computes predictions, classifications, or…
LLM & Generative AI Infra & Hardware Products & Platforms
inference cost
추론 비용
Inference cost is the operational compute-and-infrastructure expense incurred each time a deployed LLM tokenizes a promp…
Infra & Hardware LLM & Generative AI
inference latency
추론 지연 시간
Inference latency is the actual time it takes for an AI model to process an input and return an output. It typically ref…
Infra & Hardware
Inference Scaling
추론 스케일링
Inference scaling is the set of techniques that expand and adapt model serving capacity to meet latency and cost objecti…
Infra & Hardware LLM & Generative AI
KV Cache
KV 캐시
A KV cache is the inference-time memory structure that stores previously computed attention key/value tensors in an auto…
Infra & Hardware LLM & Generative AI
KV Offloading
KV 오프로딩
KV offloading is an inference technique that tiers the self-attention Key/Value cache from GPU memory to CPU RAM or stor…
Infra & Hardware LLM & Generative AI
Model parallelism
모델 병렬 처리
Model parallelism is a distributed technique that enables training or serving neural networks too large for a single GPU…
LLM & Generative AI Infra & Hardware
Model Router
모델 라우터
A model router is an orchestration layer that selects which model should handle a request based on difficulty, modality,…
Infra & Hardware LLM & Generative AI Products & Platforms
Model Serving
모델 서빙
Model serving is the operational system that deploys a trained model behind APIs, batch jobs, or streaming endpoints and…
Products & Platforms LLM & Generative AI Infra & Hardware
NVIDIA
엔비디아
NVIDIA provides an end-to-end AI software stack—NVIDIA AI Enterprise—spanning deployment microservices (NIM) and develop…
Infra & Hardware
Observability
관측 가능성
Observability is the operational capability to infer a distributed system’s internal state by collecting traces, metrics…
Infra & Hardware LLM & Generative AI
on-device AI
온디바이스 AI
On-device AI means running artificial intelligence directly on your own device—like a phone, laptop, or tablet—instead o…
Infra & Hardware
OpenTelemetry
오픈텔레메트리
OpenTelemetry is a vendor-neutral open-source observability framework that instruments applications via language-specifi…
Infra & Hardware LLM & Generative AI
PagedAttention
페이지드 어텐션
PagedAttention is an LLM-serving memory technique that partitions the attention key–value (KV) cache into fixed-size pag…
Infra & Hardware Deep Learning
Quantization
양자화
Quantization is a neural‑network compression‑and‑acceleration technique that represents weights and activations with low…
Infra & Hardware LLM & Generative AI
real-time inference
실시간 추론
Real-time inference is a serving paradigm that exposes a trained model as an API to execute and respond immediately upon…
LLM & Generative AI Infra & Hardware
SLM
소형 언어 모델
A Small Language Model (SLM) is a language model that performs natural-language understanding or generation with a small…
LLM & Generative AI Infra & Hardware
Speculative Decoding
추측적 디코딩
Speculative decoding is an inference acceleration method where a smaller drafter proposes multiple candidate tokens and …
Infra & Hardware
Trainium
트레이늄
Trainium is Amazon Web Services (AWS)’s custom AI training chip designed to train large deep learning models—especially …
Infra & Hardware LLM & Generative AI
vLLM
vLLM
vLLM is an open-source LLM serving engine that boosts throughput by managing the attention KV cache with PagedAttention—…