Infra & Hardware

GPU, cloud, MLOps, deployment, optimization

28 terms

Infra & Hardware LLM & Generative AI

AI inference is the runtime phase in which a trained model with fixed weights processes new inputs to produce prediction…

Infra & Hardware LLM & Generative AI Data Engineering

Batch Inference

Batch inference is an offline prediction method that aggregates a large, fixed set of inputs and generates outputs in bu…

Products & Platforms Infra & Hardware LLM & Generative AI

Amazon Bedrock is a fully managed AWS service that provides secure, enterprise-grade access to multiple foundation model…

Infra & Hardware Deep Learning

CUDA is NVIDIA's programming model and runtime platform for running massively parallel computation on NVIDIA GPUs.

Infra & Hardware Products & Platforms

cuDNN is a GPU-accelerated library from NVIDIA that provides highly optimized implementations of core deep learning oper…

Infra & Hardware LLM & Generative AI

edge deployment

Edge deployment means running AI models or apps close to where data is created — for example on factory lines, inside re…

Infra & Hardware Deep Learning ML Fundamentals

그래픽 처리 장치

A GPU is an accelerator that executes uniform, matrix-heavy computations at high throughput via massive parallel threads…

Infra & Hardware

GPU 클러스터

A GPU cluster is a system where multiple GPUs (Graphics Processing Units) are networked together to function as a single…

LLM & Generative AI Infra & Hardware Deep Learning

Inference is the execution phase where a trained model receives new inputs and computes predictions, classifications, or…

LLM & Generative AI Infra & Hardware Products & Platforms

Inference cost is the operational compute-and-infrastructure expense incurred each time a deployed LLM tokenizes a promp…

Infra & Hardware LLM & Generative AI

inference latency

추론 지연 시간

Inference latency is the actual time it takes for an AI model to process an input and return an output. It typically ref…

Infra & Hardware

Inference Scaling

추론 스케일링

Inference scaling is the set of techniques that expand and adapt model serving capacity to meet latency and cost objecti…

Infra & Hardware LLM & Generative AI

A KV cache is the inference-time memory structure that stores previously computed attention key/value tensors in an auto…

Infra & Hardware LLM & Generative AI

KV 오프로딩

KV offloading is an inference technique that tiers the self-attention Key/Value cache from GPU memory to CPU RAM or stor…

Infra & Hardware LLM & Generative AI

Model parallelism

모델 병렬 처리

Model parallelism is a distributed technique that enables training or serving neural networks too large for a single GPU…

LLM & Generative AI Infra & Hardware

모델 라우터

A model router is an orchestration layer that selects which model should handle a request based on difficulty, modality,…

Infra & Hardware LLM & Generative AI Products & Platforms

Model serving is the operational system that deploys a trained model behind APIs, batch jobs, or streaming endpoints and…

Products & Platforms LLM & Generative AI Infra & Hardware

NVIDIA provides an end-to-end AI software stack—NVIDIA AI Enterprise—spanning deployment microservices (NIM) and develop…

Infra & Hardware

관측 가능성

Observability is the operational capability to infer a distributed system’s internal state by collecting traces, metrics…

Infra & Hardware LLM & Generative AI

온디바이스 AI

On-device AI means running artificial intelligence directly on your own device—like a phone, laptop, or tablet—instead o…

Infra & Hardware

오픈텔레메트리

OpenTelemetry is a vendor-neutral open-source observability framework that instruments applications via language-specifi…

Infra & Hardware LLM & Generative AI

페이지드 어텐션

PagedAttention is an LLM-serving memory technique that partitions the attention key–value (KV) cache into fixed-size pag…

Infra & Hardware Deep Learning

Quantization is a neural‑network compression‑and‑acceleration technique that represents weights and activations with low…

Infra & Hardware LLM & Generative AI

real-time inference

실시간 추론

Real-time inference is a serving paradigm that exposes a trained model as an API to execute and respond immediately upon…

LLM & Generative AI Infra & Hardware

소형 언어 모델

A Small Language Model (SLM) is a language model that performs natural-language understanding or generation with a small…

LLM & Generative AI Infra & Hardware

Speculative Decoding

추측적 디코딩

Speculative decoding is an inference acceleration method where a smaller drafter proposes multiple candidate tokens and …

Infra & Hardware

Trainium is Amazon Web Services (AWS)’s custom AI training chip designed to train large deep learning models—especially …

Infra & Hardware LLM & Generative AI

vLLM is an open-source LLM serving engine that boosts throughput by managing the attention KV cache with PagedAttention—…