Observability
Plain Explanation
Modern systems span services, containers, and GPUs, so a CPU or error-rate chart rarely explains why one request was slow. You need to pivot from a symptom (latency) to the exact path that request took across services and queues without redeploying to add logs. Observability does this by collecting traces, metrics, and logs and stitching them with a shared trace ID—like a shipment number that follows one package everywhere. With that, you can jump from a slow API trace to the database span and the co-located log lines. Concretely, OpenTelemetry instruments services and ships telemetry via OTLP to a Collector, which batches, samples, scrubs PII, and forwards to backends. On Kubernetes, run it in sidecar + gateway mode to change vendors or policies centrally. The 128-bit trace ID must propagate end to end; if it’s dropped at a queue or async boundary, correlation breaks.
Examples & Analogies
- GPU training cluster cost spike: Correlate high-cardinality traces from training tasks with GPU device metrics to find a few long jobs saturating interconnects while others idle.
- Checkout via HTTP + queue: Inject/extract trace context in message headers so the API and worker appear on one trace, exposing a slow worker step.
- Fault injection demo: With multi-signal instrumentation, watch how an injected error propagates and confirm remediation in a single correlated view.
At a Glance
| Monitoring | Observability | |
|---|---|---|
| Questions | Known in advance (CPU, 5xx) | Unanticipated (why this request was slow) |
| Data | Pre-aggregated metrics | High-cardinality events + traces + logs |
| Correlation | Weak/manual | Strong via shared 128-bit trace ID |
| New questions | Often redeploy | No redeploy; slice by attributes |
| Sampling | Head-based common | Tail-based to keep errors/slowness |
Observability emphasizes correlating rich events via a shared trace ID so you can debug unknown-unknowns without code changes.
Where and Why It Matters
- OpenTelemetry Collector on Kubernetes: Sidecar + gateway centralizes exporters, sampling, and PII scrubbing.
- GPU observability: Unify GPU, container, and application context to expose inefficiencies and prevent failures.
- Async and queue-heavy systems: Inject/extract trace context to keep end-to-end visibility across producers and workers.
- Practice shift: Instrument once with vendor-neutral OTel, keep the trace ID in logs/exemplars, manage sampling centrally.
- Deployment gate: If trace ID propagation is broken, block promotion until fixed.
Common Misconceptions
- ❌ Observability is just nicer dashboards → ✅ It’s correlated traces, metrics, and logs to answer new questions without redeploying.
- ❌ You must trace 100% of traffic → ✅ Tail-based sampling keeps errors/slow paths and samples routine requests.
- ❌ It’s fine to put user IDs on metric labels → ✅ High-cardinality IDs belong on spans/logs, not metrics.
How It Sounds in Conversation
- "Run the Collector as a gateway so we can change vendors and sampling without touching app pods."
- "Carry traceparent through the SQS/Kafka hop or the worker spans won’t attach to the API trace."
- "Enable tail-based sampling to keep 100% of ERROR and >5s traces, and 1–5% of the rest."
- "GPU utilization looks high, but traces show the tokenizer is the bottleneck before kernels launch."
- "Set OTEL_EXPORTER_OTLP_ENDPOINT on each service; add container stats in local runs."
Related Reading
References
- Observability primer
Official OpenTelemetry primer explaining logs, metrics, traces, and trace/span correlation.
- CNCF Observability Whitepaper
CNCF TAG Observability overview of observability concepts, signals, and cloud-native operations.
- Architecture Requirements | OpenTelemetry Community Demo
Official demo architecture: Collector topology and telemetry routing.
- GPU Monitoring Reference Architecture
Reference architecture for GPU observability across AI/ML and HPC workloads.
- What Is Observability? Fundamentals & Architecture
High-level fundamentals and use cases for observability.