Deep Learning ML Fundamentals Infra & Hardware

GPU

Graphics Processing Unit

A GPU (Graphics Processing Unit) is a processor built with thousands of small cores to execute many operations in parallel. Originally designed for graphics rendering, GPUs now accelerate data-intensive workloads such as AI and machine learning by leveraging parallelism and advanced memory architectures, significantly reducing training and inference time.

Difficulty

Plain Explanation

There was a bottleneck: computers had to crunch huge piles of numbers for tasks like training AI, but doing them one-by-one was too slow. A GPU solves this by doing many small calculations at the same time—like having an army of helpers each handle a tiny piece of a giant spreadsheet.

Why this works: a GPU has thousands of lightweight cores designed for parallel processing. AI workloads (like neural network layers) break down into many similar, independent math operations that can be run concurrently. GPUs also include advanced memory features—shared memory, caches (L1/L2), and memory coalescing—that feed these cores efficiently so they spend more time computing and less time waiting on data.

Concretely: memory coalescing groups many nearby memory accesses from threads into a single wide transfer, so instead of making dozens of small, slow trips to external DRAM, the GPU issues one large, efficient read. Shared memory lets threads in the same block reuse data stored on-chip, avoiding repeated external memory reads altogether. Together, these features reduce external DRAM transactions and latency, which boosts throughput for AI training and inference.

Example & Analogy

Real-time quality checks in manufacturing

A factory streams sensor readings and high-resolution images from a production line. With GPUs running parallel image and signal processing, defects are flagged in near real time so the conveyor doesn’t have to slow down. This parallelism turns what would be batch, minutes-later analysis into immediate feedback, cutting scrap and rework.

Satellite imagery tiling for crop health

Large satellite images are split into thousands of small tiles, each analyzed for vegetation changes. A GPU processes many tiles at once, accelerating per-tile inference so analysts can scan entire regions the same day imagery arrives. In practice, this shifts per-frame or per-tile processing from seconds on a CPU to near real-time on a GPU, enabling timely alerts; exact gains depend on the model and I/O throughput.

Bulk video processing for compliance

A media archive must blur sensitive information across countless hours of footage. GPUs parallelize frame decoding and filtering so hundreds of frames are processed concurrently. This reduces end-to-end job time dramatically and makes overnight compliance runs feasible.

Rapid model fine-tuning for internal analytics

A team fine-tunes a recommendation model on a new dataset. GPUs speed up the matrix-heavy training steps, turning multi-day experiments into same-day iterations. Faster feedback loops mean the team can test more hypotheses before a release window.

At a Glance

	CPU	GPU	AI Accelerator
Core design	Few powerful general-purpose cores	Thousands of smaller cores for parallel tasks	Specialized logic for targeted AI ops
Best for	Diverse tasks, control flow, low-latency single-thread work	Highly parallel math (e.g., neural nets, graphics)	Narrow AI workloads where specialization wins
Programming model	General-purpose languages	Parallel programming (e.g., CUDA on NVIDIA)	Vendor-specific stacks; less general-purpose
Performance vs AI	Slower for large-scale training/inference	Leading performance for training and inference	Can outperform GPUs for select models but less flexible
Cost/Power	Typically lower power per unit	High performance but power-hungry and costly	Often optimized for efficiency; trade-offs in versatility

Why It Matters

If you skip GPUs for parallel-friendly AI tasks, training might take days instead of hours, delaying launches and experiments.
Without GPU memory features (coalescing, shared memory), your kernels stall on memory, wasting compute and missing real-time goals.
If you treat a GPU like a CPU (sequential loops, poor batching), you pay for power and hardware but get little speedup.
Ignoring the GPU software stack means you reinvent basics and risk unstable performance under production load.

Where It's Used

Verified information on this topic is limited

▶ Curious about more?

Role-Specific Insights
What mistakes do people make?
How do you talk about it?
What should I learn next?
What to Read Next

Role-Specific Insights

Junior Developer: Learn how to structure work for parallelism—use batching and ensure memory access is contiguous for coalescing. Start with small kernels and profile to see where memory stalls occur. PM/Planner: Identify stages that are math-heavy and parallelizable (e.g., model inference, embeddings, video filters). Prioritize GPU resources where they meaningfully shrink cycle time or unlock real-time features. Senior Engineer: Tune end-to-end: overlap data transfers with compute, exploit shared memory, and validate with profilers. Revisit instance sizing if power/cost per IBM guidance outweighs the latency benefit. Data Scientist: Adjust batch sizes and precision settings to keep the GPU utilized while maintaining accuracy. Measure wall-clock gains, not just iteration speed, to ensure pipeline wins.

Precautions

❌ Myth: A GPU speeds up every workload. → ✅ Reality: GPUs excel at parallel workloads (like neural network math). CPU-like, branching-heavy tasks may see little gain.

❌ Myth: More GPU cores always mean faster AI. → ✅ Reality: Memory behavior (coalescing, cache use, shared memory) and data pipelines often limit performance more than raw core count.

❌ Myth: GPUs are always the most efficient choice for AI. → ✅ Reality: Per IBM, high-performance GPUs can be power-hungry and costly; specialized AI accelerators may be preferable depending on priorities.

❌ Myth: Programming a GPU is just like writing CPU code. → ✅ Reality: You must design for parallelism and data locality; otherwise, memory stalls erase theoretical speedups.

Communication

"The QA pipeline chokes on memory bandwidth. Let’s refactor the kernel so accesses are coalesced and move the hot window into shared memory on the GPU."
"Batch size 1 kills throughput. If we can group to 32 images, the GPU will stay saturated and we should hit the near real-time SLA."
"Finance flagged power costs—can we compare the current GPU nodes with a more specialized accelerator for inference-only workloads?"
"The model trains fine locally, but production stalls. Profiling shows 70% memory wait on the GPU; let’s re-tile the data and pin host memory to fix the transfer bottleneck."
"We’ll keep the CPU for orchestration and send the matrix-heavy parts to the GPU. That split should cut end-to-end latency without changing the API."

Related Terms

CPU — Great at diverse, sequential tasks and control logic; falls behind GPUs on massively parallel math used in deep learning.
AI accelerator — Purpose-built chips for select AI ops; per IBM, can beat GPUs in efficiency for certain priorities but offer less general flexibility.
CUDA — NVIDIA’s programming model that opened general-purpose computing on GPUs, enabling developers to tap massive parallelism.
Parallel processing — The core idea behind GPU speedups: break big problems into many small, simultaneous tasks instead of doing them one-by-one.
L1/L2 cache & shared memory — On-GPU memories that keep data close to the cores; when used well, they reduce expensive trips to external memory and unlock performance.
High-Performance Computing (HPC) — Field where GPUs drive huge gains by accelerating scientific and data-intensive workloads beyond what CPUs alone can deliver.

0to1log Weekly

AI Glossary

GPU