Inverted Index
Plain Explanation
Scanning every row of a large text column is slow. An inverted index flips the mapping to term→documents so the engine looks up a token in a dictionary and jumps straight to its postings list. Many queries can be answered from the dictionary+postings without touching the base text.
Examples & Analogies
- Book index: look up "wind" and get the pages; in a DB you get doc IDs.
- Ops log filter: "timeout AND 500" intersects the two postings lists, e.g., timeout=[12,15,99], 500=[15,50,99] → [15,99].
- Fast single‑term responses: some implementations keep a small top‑N table per term to avoid scanning full postings for common queries.
At a Glance
| Inverted index | Bloom filter index | Full table scan | |
|---|---|---|---|
| Result | Exact doc IDs (opt: TF/positions) | Maybe present (false positives) | Evaluate all rows |
| Queries | AND/OR/phrase/NEAR | Limited | Any (slow) |
| Storage | Dictionary + postings; often answer from index | Bitset/hash summary | No index |
| Tuning | Dict/block layout, compression, skip/score | Hash count/size | Little to none |
Where and Why It Matters
- Remote object storage: minimize small random reads with block‑based dictionaries and contiguous postings for sequential I/O.
- Search/in‑memory modules: compress docIDs with delta+varint; use skip/score indexes to speed intersections and top‑N.
- IR practice: postings are sorted by docID and stored contiguously; the dictionary maps tokens to offsets and tracks stats.
Common Misconceptions
- ❌ You must read the original text for every match. → ✅ Many queries are satisfied by dictionary+postings alone.
- ❌ A Bloom filter is equivalent. → ✅ It may return false positives and supports limited predicates.
- ❌ A hashmap dictionary is enough. → ✅ Layouts like FST or block+front‑coding with a sparse index cut random I/O.
How It Sounds in Conversation
- "Switch that text column to an inverted index; we’ll tune the dictionary block size for object storage."
- "For 'wind AND winter', just intersect postings; no need to open the base column."
- "Let’s delta+varint encode doc IDs and keep a skip table for long lists."
- "Phrase queries are slow because we don’t store positions."
Related Reading
References
- Apache Lucene index package
Lucene documentation for postings, term dictionaries, TermsEnum, and PostingsEnum.
- Elasticsearch index_options
Official mapping option controlling which inverted-index details are stored for search and highlighting.
- A first take at building an inverted index
Stanford IR book chapter introducing inverted indexes and postings lists.
- Index construction
Indexing process for large text collections and inverted-index construction.