The neural network architecture underlying modern LLMs, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. Unlike earlier recurrent neural networks (RNNs) that processed text sequentially word-by-word, transformers process entire sequences in parallel using attention mechanisms. This parallelization enabled dramatically faster training and better handling of long-range dependencies in text.
Key innovation: Transformers replaced recurrence with self-attention, allowing every position in a sequence to attend to every other position simultaneously. This architecture now powers GPT, Claude, Gemini, LLaMA, and virtually all frontier language models.
đAnalogy: Think of RNNs like reading a book one word at a time while trying to remember everything. Transformers are like having the entire book spread out on a table, allowing you to instantly see connections between any two passages.
A technique that allows models to dynamically focus on the most relevant parts of input when generating each output. Self-attention computes relationships between all tokens in a sequence by transforming each input into three vectors:
Query â what am I looking for?
Key â what do I contain?
Value â what information do I provide?
The attention scores determine how much each token should influence others.
Multi-head attention: Modern transformers use multiple parallel attention "heads," each learning different relationship patterns. One head might track syntactic structure while another captures semantic meaning. GPT-2, for example, uses 12 attention heads per layer.
đExample: In "The cat sat on the mat because it was tired," attention helps the model understand that "it" refers to "the cat" by computing high attention scores between these tokens.
A computing system loosely inspired by biological neurons, consisting of interconnected nodes (artificial neurons) organized in layers that process information. Each connection has a learnable weight that adjusts during training. Modern LLMs are "deep" neural networks with many layersâGPT-2 has 12-48 transformer blocks stacked sequentially, with information becoming more abstract at higher layers.
đAnalogy: Like a factory assembly line where each station (layer) transforms the raw material (input) into increasingly refined products, with early stages handling basic features and later stages assembling complex patterns.
The numerical weights that define what an LLM has learned. Each parameter represents a tunable value in the network's connections, adjusted during training to minimize prediction errors. More parameters generally means greater capacity to learn complex patterns, though efficiency gains can achieve similar results with fewer parameters through better architectures or training data.
The basic units of text that LLMs processâsubword chunks that balance vocabulary size against sequence length. Rather than whole words or individual characters, modern tokenizers split text into meaningful subunits. This approach handles rare words by breaking them into familiar pieces while keeping common words intact.
đExample: "unhappiness" might tokenize to ["un", "happi", "ness"]. "Hello, world!" typically becomes 4 tokens: ["Hello", ",", " world", "!"]. GPT-4 uses ~100,000 tokens in its vocabulary; approximately 1 token â 0.75 words in English.
Byte Pair Encoding (BPE): Originally a compression algorithm, now the dominant tokenization method (used by GPT, LLaMA, DeepSeek). The algorithm:
Starts with individual characters
Iteratively merges the most frequent adjacent pairs
Continues until reaching the desired vocabulary size
Byte-level BPE extends this to raw bytes, ensuring any text can be encoded.
WordPiece: Developed by Google for BERT. Similar to BPE but selects merges based on likelihood improvement rather than raw frequency. Uses ## prefix to mark continuation tokens.
Why it matters: Tokenization directly impacts model efficiency. A larger vocabulary means shorter sequences but more parameters in the embedding layer. GPT-2 uses ~50K tokens; GPT-4 uses ~100Kâa tradeoff between compression and model size.
Dense numerical vectors representing words, tokens, or concepts in high-dimensional space. Models learn embeddings during training such that semantically similar items cluster together. These vectors capture rich relationshipsânot just similarity but analogies and hierarchies.
đClassic example:vector("king") - vector("man") + vector("woman") â vector("queen"). Modern embeddings encode much richer relationships across thousands of dimensions, enabling semantic search, clustering, and transfer learning.
The maximum amount of text (in tokens) an LLM can consider simultaneously. This defines how much "memory" the model has within a single interaction. Larger windows enable processing entire books, codebases, or long conversations, but increase computational cost quadratically with attention.
Evolution of context windows:
Model
Year
Context Window
GPT-3
2020
4K tokens
GPT-4 Turbo
2023
128K tokens
Gemini 1.5 Pro
2024
1M tokens
Claude Sonnet 4
2025
1M tokens
Llama 4
2025
up to 10M tokens
Caveat: Research shows "context rot"âmost models degrade in reliability as input length grows, often performing well below advertised limits. A 200K window might become unreliable around 130K tokens in practice.
The foundational training phase where LLMs learn language patterns by predicting the next token across massive text corporaâbooks, websites, code repositories, and academic papers comprising hundreds of billions to trillions of tokens. This self-supervised learning (no human labels needed) allows models to acquire grammar, facts, reasoning patterns, and world knowledge implicitly.
đCore insight: Next-token prediction is deceptively powerful. To accurately predict what comes next, models must implicitly learn syntax, semantics, facts, logical relationships, and even approximate reasoningâall emerging from this simple objective.
Additional training on specific data to adapt a pre-trained model for particular tasks, domains, or behaviors. This takes a general-purpose language model and specializes itâfor instruction-following, medical knowledge, coding, or specific organizational needsâusing far less data than pre-training required.
Common approaches:
Supervised fine-tuning (SFT) on curated examples
Instruction tuning on diverse task formats
Domain adaptation on specialized corpora
Parameter-efficient methods like LoRA enable fine-tuning with minimal compute
A technique that aligns LLMs with human preferences by training on human judgments rather than predefined rewards.
The process:
Collect human comparisons of model outputs (which response is better?)
Train a reward model to predict human preferences
Use reinforcement learning (typically PPO) to optimize the LLM against this reward model
Why it matters: RLHF transforms raw language models into helpful assistants. InstructGPT, ChatGPT, and Claude all use RLHF variants. It addresses the "alignment" problemâmaking AI systems do what humans actually want rather than what's literally specified.
2025 developments:
RLAIF (AI feedback) achieves comparable results with less human annotation
RLTHF achieves full alignment with only 6-7% of traditional annotation effort
Direct Preference Optimization (DPO) bypasses reward model training entirely
Modern training involves many iterative rounds combining multiple techniques
The text corpus used to train an LLM, significantly impacting capabilities and behaviors. Quality and diversity matter as much as scaleâresearch shows smaller models trained on high-quality data can outperform larger models trained on noisier data.
Abilities that appear suddenly in larger models but are absent in smaller onesâcapabilities that cannot be predicted by extrapolating from smaller scales. Examples include chain-of-thought reasoning, in-context learning, and multi-step problem solving.
The scientific debate:
Perspective
Argument
Emergence is real
Performance hovers near random until a critical threshold, then jumps dramatically (like phase transitions in physics)
Emergent abilities may be tied to pre-training loss thresholds, not just parameter count
Large Reasoning Models (LRMs) like o1 demonstrate emergent capabilities through reinforcement learning + inference-time search
OpenAI's o1 achieved 83.3% on Competition Math vs GPT-4o's 13.4%âsuggesting fundamental shifts in capability
đAnalogy: Like phase transitions in physicsâwater doesn't gradually become "a little bit frozen." Similarly, models may acquire capabilities through sudden reorganizations of internal representations rather than smooth accumulation.
When an LLM generates content that is fluent and plausible-sounding but factually incorrect, unsupported by evidence, or entirely fabricated.
Types:
Intrinsic hallucination: Contradicts information in the provided context
Extrinsic hallucination: Invents unverifiable information not present in any source
Root causes (2025 research): Hallucinations are now understood as a systemic incentive problemâtraining objectives reward confident responses over calibrated uncertainty. Models learn to "bluff" rather than admit ignorance because benchmarks penalize "I don't know" responses.
Mitigation strategies:
Strategy
Effectiveness
Chain-of-thought prompting
Reduces hallucinations 50%+ in prompt-sensitive scenarios
Retrieval-Augmented Generation (RAG)
Grounds responses in external knowledge (but not a panacea)
Calibration-aware reward training
Rewards appropriate uncertainty
Span-level verification
Validates claims against knowledge bases
đReal-world impact: In Mata v. Avianca (2023), a lawyer was sanctioned for submitting a brief with fabricated case citations generated by ChatGPT.
AI systems that can process and generate multiple types of contentâtext, images, audio, videoâoften within the same interaction. This requires specialized encoders (like vision transformers) to convert non-text inputs into representations the language model can understand.
Examples:
GPT-4o ("omni"): Unifies text, image, and audio in a single architecture
Gemini 2.5: Processes text, images, audio, and video with native 1M+ token context
Claude 3+: Analyzes images within conversations
DALL-E 3, Stable Diffusion, Midjourney: Generate images from text
A parameter controlling randomness in output generation. Temperature scales the probability distribution over possible next tokens before sampling.
Temperature
Behavior
Use Cases
0.0
Deterministic, most likely tokens
Factual Q&A, code generation, structured outputs
0.3-0.5
Balanced
General-purpose tasks
0.7-1.0
Creative, varied
Creative writing, brainstorming, diverse options
>1.0
Highly random
Experimental, may become incoherent
Technical detail: Temperature divides the logits (raw scores) before softmax. Lower temperature sharpens the distribution; higher temperature flattens it.
A sampling method that considers only the smallest set of most likely tokens whose cumulative probability exceeds threshold p. Unlike top-k (fixed number of candidates), top-p adapts dynamically.
đExample: With top-p = 0.9, the model samples from tokens comprising the top 90% of probability mass. If one token has 95% probability, only it is considered. If the top token has 40% probability, many tokens might be included until reaching 90% cumulative.
Best practice: Often used in combination with temperature. A common setting is temperature=0.7, top-p=0.9.
A component that converts images into tokens or embeddings that language models can understand. Vision Transformers (ViT) divide images into patches (like tokens), process them through transformer layers, and produce representations that can be integrated with text.
Process:
Image divided into fixed-size patches (e.g., 16Ă16 pixels)
Each patch embedded as a vector
Positional encodings added
Processed through transformer layers
Output representations integrated with language model
A technique for generating images (and increasingly video) by learning to reverse a process of gradually adding noise.
Training: Model learns to denoise images step-by-step Generation: Starts from pure noise, iteratively refines into coherent images guided by text prompts
Key models:
DALL-E 3 (OpenAI): Text-to-image, integrated with ChatGPT
Sora (2024): Extends diffusion to video generation
đAnalogy: Like a sculptor starting with a rough block of marble (noise) and progressively chiseling away to reveal the statue (image), with the text prompt serving as the blueprint.