Vol.01 · No.10 CS · AI · Infra April 11, 2026

AI Glossary

GlossaryReferenceLearn
Infra & Hardware Deep Learning

CUDA

Difficulty

Plain Explanation

Traditional CPU programs process many tasks one after another, which becomes a bottleneck when an application must crunch millions of similar math operations. CUDA solves this by letting developers send those parallel-friendly tasks to a GPU and run thousands of lightweight threads in parallel. That turns long wait times into much shorter ones for workloads that fit the parallel pattern.

Think of a busy kitchen: a single cook (CPU) makes dishes one by one, but a large team (GPU) can prepare many dishes at once. CUDA is the kitchen’s system—tickets (work) go into specific lines, prep and cooking can happen in parallel, and the pass stays busy instead of idle. When done right, you’re plating the next order while the previous one is still simmering.

Concretely, CUDA exposes streams that map to device work queues: operations in the same stream execute in order, while independent streams can run concurrently. Asynchronous memory copies use dedicated DMA engines (copy engines), so data transfers can overlap with kernel execution; devices with one copy engine can overlap one transfer with compute, while devices with two copy engines can overlap host-to-device and device-to-host transfers with compute. The default stream serializes device work, so explicit non-default streams are used to unlock concurrency, and actual overlap depends on the device’s copy-engine count and how it partitions copy vs. execute work.

Examples & Analogies

  • Batch data pipeline with overlap: A team splits a large dataset into tiles and uses three non-default streams—one to copy the next tile to the GPU, one to run a kernel on the current tile, and one to copy results back. On a device with two copy engines, both directions of transfer can overlap with compute, reducing end-to-end stage time.
  • Multi-GPU scale-out job: An application partitions work across several GPUs in one server and then extends to a cluster. The CUDA Toolkit provides the runtime and libraries to drive multi-GPU execution; at larger scales, additional orchestration layers are typically used to coordinate processes and scheduling beyond a single node.
  • Tuning with the APOD cycle: Following Assess → Parallelize → Optimize → Deploy, a team profiles code, ports hot loops into CUDA kernels, then iteratively improves memory access and stream usage. They deploy early speedups and repeat the cycle to uncover the next bottleneck.

At a Glance

Default stream (0)Explicit non-default streams
Execution orderSerializes device workIn-order per stream, can run concurrently
Overlap potentialLimited; operations block othersEnables copy/compute overlap if hardware allows
Timing useReliable for device timingRequires per-stream events for accuracy
Tuning focusSimplicity, correctnessConcurrency, dependency management

Non-default streams unlock concurrency and copy/compute overlap, but only when the device’s copy engines and dependencies permit it.

Where and Why It Matters

  • Shift to concurrency-first design: Teams routinely restructure pipelines into multiple streams to hide transfer time behind kernel execution, improving throughput and latency.
  • APOD as standard workflow: The Assess–Parallelize–Optimize–Deploy cycle guides profiling, porting hot spots, and iterative tuning until gains are production-ready.
  • Multi-GPU normalization: Applications increasingly assume more than one GPU and use built-in multi-GPU capabilities, with additional orchestration commonly added for clusters.
  • Operational discipline: Engineers avoid unnecessary CPU–GPU sync points because they stall the pipeline, using device timers and events instead to measure progress.
  • Deployment breadth: The same CUDA-based codebase can target embedded devices, workstations, data centers, and cloud/HPC installs, simplifying rollout across environments.

Common Misconceptions

  • ❌ Myth: "If I use streams, operations will always run in parallel." → ✅ Reality: Concurrency depends on dependencies and hardware limits like the number of copy engines.
  • ❌ Myth: "The default stream is fine for high throughput." → ✅ Reality: It serializes device work; non-default streams are needed to enable overlap and concurrency.
  • ❌ Myth: "CUDA is only for training AI models." → ✅ Reality: It’s a general GPU programming platform with libraries, tools, and runtime for many accelerated apps.

How It Sounds in Conversation

  • "Profiling shows our H2D memcpy is 12 ms; with two streams we can overlap it with the kernel and cut tail latency by ~30%."
  • "Let’s check asyncEngineCount on this GPU; if it’s 2 we can do H2D and D2H concurrently while the kernel runs."
  • "The default stream is serializing our pipeline—switch to per-stage streams and add events for timing before Friday’s load test."
  • "We’re scaling to 4 GPUs on the node; CUDA gives us the multi-GPU runtime, but we still need a plan for process orchestration in the cluster."
  • "Action items: remove extra host syncs, add device events for timing, and re-run APOD to target the next hotspot by end of sprint."

Related Reading

References

Helpful?