Vol.01 · No.10 CS · AI · Infra May 30, 2026

AI Glossary

GlossaryReferenceLearn
CS Fundamentals Data Engineering

Inverted Index

Difficulty

Plain Explanation

Scanning every row of a large text column is slow. An inverted index flips the mapping to term→documents so the engine looks up a token in a dictionary and jumps straight to its postings list. Many queries can be answered from the dictionary+postings without touching the base text.

Examples & Analogies

  • Book index: look up "wind" and get the pages; in a DB you get doc IDs.
  • Ops log filter: "timeout AND 500" intersects the two postings lists, e.g., timeout=[12,15,99], 500=[15,50,99] → [15,99].
  • Fast single‑term responses: some implementations keep a small top‑N table per term to avoid scanning full postings for common queries.

At a Glance

Inverted indexBloom filter indexFull table scan
ResultExact doc IDs (opt: TF/positions)Maybe present (false positives)Evaluate all rows
QueriesAND/OR/phrase/NEARLimitedAny (slow)
StorageDictionary + postings; often answer from indexBitset/hash summaryNo index
TuningDict/block layout, compression, skip/scoreHash count/sizeLittle to none

Where and Why It Matters

  • Remote object storage: minimize small random reads with block‑based dictionaries and contiguous postings for sequential I/O.
  • Search/in‑memory modules: compress docIDs with delta+varint; use skip/score indexes to speed intersections and top‑N.
  • IR practice: postings are sorted by docID and stored contiguously; the dictionary maps tokens to offsets and tracks stats.

Common Misconceptions

  • ❌ You must read the original text for every match. → ✅ Many queries are satisfied by dictionary+postings alone.
  • ❌ A Bloom filter is equivalent. → ✅ It may return false positives and supports limited predicates.
  • ❌ A hashmap dictionary is enough. → ✅ Layouts like FST or block+front‑coding with a sparse index cut random I/O.

How It Sounds in Conversation

  • "Switch that text column to an inverted index; we’ll tune the dictionary block size for object storage."
  • "For 'wind AND winter', just intersect postings; no need to open the base column."
  • "Let’s delta+varint encode doc IDs and keep a skip table for long lists."
  • "Phrase queries are slow because we don’t store positions."

Related Reading

References

Helpful?