CS Fundamentals Data Engineering

Inverted Index

Difficulty

Plain Explanation

Scanning every row of a large text column is slow. An inverted index flips the mapping to term→documents so the engine looks up a token in a dictionary and jumps straight to its postings list. Many queries can be answered from the dictionary+postings without touching the base text.

Examples & Analogies

Book index: look up "wind" and get the pages; in a DB you get doc IDs.
Ops log filter: "timeout AND 500" intersects the two postings lists, e.g., timeout=[12,15,99], 500=[15,50,99] → [15,99].
Fast single‑term responses: some implementations keep a small top‑N table per term to avoid scanning full postings for common queries.

At a Glance

	Inverted index	Bloom filter index	Full table scan
Result	Exact doc IDs (opt: TF/positions)	Maybe present (false positives)	Evaluate all rows
Queries	AND/OR/phrase/NEAR	Limited	Any (slow)
Storage	Dictionary + postings; often answer from index	Bitset/hash summary	No index
Tuning	Dict/block layout, compression, skip/score	Hash count/size	Little to none

Where and Why It Matters

Remote object storage: minimize small random reads with block‑based dictionaries and contiguous postings for sequential I/O.
Search/in‑memory modules: compress docIDs with delta+varint; use skip/score indexes to speed intersections and top‑N.
IR practice: postings are sorted by docID and stored contiguously; the dictionary maps tokens to offsets and tracks stats.

Common Misconceptions

❌ You must read the original text for every match. → ✅ Many queries are satisfied by dictionary+postings alone.
❌ A Bloom filter is equivalent. → ✅ It may return false positives and supports limited predicates.
❌ A hashmap dictionary is enough. → ✅ Layouts like FST or block+front‑coding with a sparse index cut random I/O.

How It Sounds in Conversation

"Switch that text column to an inverted index; we’ll tune the dictionary block size for object storage."
"For 'wind AND winter', just intersect postings; no need to open the base column."
"Let’s delta+varint encode doc IDs and keep a skip table for long lists."
"Phrase queries are slow because we don’t store positions."

References

★Docs
Apache Lucene index package
Lucene documentation for postings, term dictionaries, TermsEnum, and PostingsEnum.
★Docs
Elasticsearch index_options
Official mapping option controlling which inverted-index details are stored for search and highlighting.
★Book
A first take at building an inverted index
Stanford IR book chapter introducing inverted indexes and postings lists.
★Book
Index construction
Indexing process for large text collections and inverted-index construction.

Helpful?

0to1log Weekly

AI Glossary