LLM
Large Language Model
Plain Explanation
Handling real-world language is messy: rules break, phrasing varies, and long documents require tracking ideas across many sentences. Traditional systems struggled to capture long-range relations while staying efficient, which hurt tasks like summarizing long articles or translating with nuance. LLMs address this by learning patterns directly from very large datasets and then generating text one token at a time based on those learned patterns. A useful picture is a roundtable where every word can “look at” every other word to decide what matters now. Concretely, the Transformer’s self‑attention produces attention weights that score how strongly each token should attend to others, creating a context‑aware representation in parallel. The model then computes scores for possible next tokens (logits), turns them into probabilities, and decodes step by step—via greedy choice, sampling, or beam search—until a special end token stops generation. Training typically uses vast text (and often programming code) so the model can generalize to translation, summarization, and question answering. Many LLMs use a decoder‑only setup for next‑token prediction, while encoder‑decoder variants are strong for conditional generation where outputs must closely reflect a specific input.
Examples & Analogies
- Customer support triage: The model drafts a prioritized, polite response and suggests related FAQs based on the user’s message. Because outputs are pattern-based and can be unreliable, an agent reviews and edits before sending.
- Code generation and review: Given a docstring and brief description, the model proposes a function, comments, and possible tests. It can help surface likely edge cases but must be validated by engineers before use.
- Cross-language contract briefing: Paste a contract in one language and request a plain-English risk summary that combines translation and summarization. Treat this as a starting point; add evidence from the text and have legal staff review.
At a Glance
| Encoder-only (Auto-encoding) | Decoder-only (Auto-regressive) | Encoder–Decoder (Seq2Seq) | |
|---|---|---|---|
| Primary objective | Learn bidirectional representations | Open-ended text generation | Conditional generation given a source |
| Context directionality | Uses left and right context | Left-to-right (causal) | Encoder is bidirectional; decoder is causal with cross-attention |
| Input grounding strength | N/A (no decoder) | Weaker without explicit conditioning | Strong: attends to full source each step |
| Decoding at inference | Not required | Required (greedy/sampling/beam) | Required (greedy/sampling/beam) |
| Typical tasks | Classification, retrieval, tagging | Free-form writing, chat-style Q&A | Translation, source-tied summarization |
Pick decoder-only for open-ended generation, encoder–decoder when strict source conditioning matters, and encoder-only when you need understanding without generation.
Where and Why It Matters
- Broad NLP coverage in one interface: Many teams apply LLMs to translation, summarization, and prompt-driven Q&A with a shared workflow, reducing one-off task-specific pipelines.
- Transformer-first—with caveats: Self-attention’s parallelism and long-range handling drive adoption, though lighter or task-specific models remain viable trade-offs depending on latency and cost.
- Lifecycle formalization: Data preparation → model preparation → training → alignment → inference → evaluation clarifies responsibilities and risk gates before deployment.
- Multimodal expansion: Extending text models with image/audio/video encoders enables captioning, visual Q&A, and media-aware assistance in a unified experience.
- Evaluation emphasis: Because outputs are learned patterns and can be unreliable, high-stakes uses often add review and targeted evaluations before rollout.
Common Misconceptions
- ❌ Myth: An LLM is a fact database you can query. → ✅ Reality: It predicts likely next tokens from learned patterns; condition it on relevant source text and include human review for high-stakes use.
- ❌ Myth: All useful LLMs share the same architecture. → ✅ Reality: Encoder-only, decoder-only, and encoder–decoder variants exist; match the choice to generation needs and input-grounding strength.
- ❌ Myth: Bigger always means better. → ✅ Reality: Scale helps, but data quality, alignment, decoding choices, and evaluation bound reliability and cost.
How It Sounds in Conversation
- "For free-form replies, consider a decoder-only checkpoint; for strict source mapping, keep an encoder–decoder baseline."
- "We're close to the context window; trim boilerplate or we'll hit truncation at inference."
- "Add two in-context examples so the output format stabilizes, then compare greedy vs sampling decoding."
- "Post-alignment, tone improved, but we still need a held-out evaluation set before rollout."
- "Track tokens per prompt so finance can forecast run costs and set a budget per request."
Related Reading
References
- Beyond the Black Box: A Survey on the Theory and Mechanism of Large Language Models
Survey of LLM mechanisms and lifecycle: data → model → training → alignment → inference → evaluation.
- Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges
Architecture families of LLMs and multimodal extensions summarized in one place.
- Large Language Models - Stanford University (SLP3 Chapter 7)
Clear overview of encoder-only, decoder-only, and encoder–decoder models and generation basics.
- The architecture of language: Understanding the mechanics behind LLMs
Transformer·LLM 메커니즘의 개관.
- Large language models use a surprisingly simple mechanism to retrieve stored knowledge
News explainer on probing how LLMs retrieve facts during generation.