Core Architecture

Transformer

The neural network architecture underlying modern LLMs, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. Unlike earlier recurrent neural networks (RNNs) that processed text sequentially word-by-word, transformers process entire sequences in parallel using attention mechanisms. This parallelization enabled dramatically faster training and better handling of long-range dependencies in text.

Key innovation: Transformers replaced recurrence with self-attention, allowing every position in a sequence to attend to every other position simultaneously. This architecture now powers GPT, Claude, Gemini, LLaMA, and virtually all frontier language models.

🔗 Analogy: Think of RNNs like reading a book one word at a time while trying to remember everything. Transformers are like having the entire book spread out on a table, allowing you to instantly see connections between any two passages.

Attention Mechanism

A technique that allows models to dynamically focus on the most relevant parts of input when generating each output. Self-attention computes relationships between all tokens in a sequence by transforming each input into three vectors:

  • Query — what am I looking for?
  • Key — what do I contain?
  • Value — what information do I provide?

The attention scores determine how much each token should influence others.

Multi-head attention: Modern transformers use multiple parallel attention "heads," each learning different relationship patterns. One head might track syntactic structure while another captures semantic meaning. GPT-2, for example, uses 12 attention heads per layer.

🔗 Example: In "The cat sat on the mat because it was tired," attention helps the model understand that "it" refers to "the cat" by computing high attention scores between these tokens.

Neural Network

A computing system loosely inspired by biological neurons, consisting of interconnected nodes (artificial neurons) organized in layers that process information. Each connection has a learnable weight that adjusts during training. Modern LLMs are "deep" neural networks with many layers—GPT-2 has 12-48 transformer blocks stacked sequentially, with information becoming more abstract at higher layers.

🔗 Analogy: Like a factory assembly line where each station (layer) transforms the raw material (input) into increasingly refined products, with early stages handling basic features and later stages assembling complex patterns.

Parameters

The numerical weights that define what an LLM has learned. Each parameter represents a tunable value in the network's connections, adjusted during training to minimize prediction errors. More parameters generally means greater capacity to learn complex patterns, though efficiency gains can achieve similar results with fewer parameters through better architectures or training data.

Scale reference:

ModelParameters
GPT-2 (2019)1.5B
GPT-4 (est.)~1.7T
Mixtral 8x7B47B total, 13B active per token

Data Processing

Tokens

The basic units of text that LLMs process—subword chunks that balance vocabulary size against sequence length. Rather than whole words or individual characters, modern tokenizers split text into meaningful subunits. This approach handles rare words by breaking them into familiar pieces while keeping common words intact.

🔗 Example: "unhappiness" might tokenize to ["un", "happi", "ness"]. "Hello, world!" typically becomes 4 tokens: ["Hello", ",", " world", "!"]. GPT-4 uses ~100,000 tokens in its vocabulary; approximately 1 token ≈ 0.75 words in English.

Tokenization (BPE & WordPiece)

Byte Pair Encoding (BPE): Originally a compression algorithm, now the dominant tokenization method (used by GPT, LLaMA, DeepSeek). The algorithm:

  1. Starts with individual characters
  2. Iteratively merges the most frequent adjacent pairs
  3. Continues until reaching the desired vocabulary size

Byte-level BPE extends this to raw bytes, ensuring any text can be encoded.

WordPiece: Developed by Google for BERT. Similar to BPE but selects merges based on likelihood improvement rather than raw frequency. Uses ## prefix to mark continuation tokens.

"playing" → ["play", "##ing"]
"unhappiness" → ["un", "##happi", "##ness"]

Why it matters: Tokenization directly impacts model efficiency. A larger vocabulary means shorter sequences but more parameters in the embedding layer. GPT-2 uses ~50K tokens; GPT-4 uses ~100K—a tradeoff between compression and model size.


Embeddings

Dense numerical vectors representing words, tokens, or concepts in high-dimensional space. Models learn embeddings during training such that semantically similar items cluster together. These vectors capture rich relationships—not just similarity but analogies and hierarchies.

🔗 Classic example: vector("king") - vector("man") + vector("woman") ≈ vector("queen"). Modern embeddings encode much richer relationships across thousands of dimensions, enabling semantic search, clustering, and transfer learning.

Context Window

The maximum amount of text (in tokens) an LLM can consider simultaneously. This defines how much "memory" the model has within a single interaction. Larger windows enable processing entire books, codebases, or long conversations, but increase computational cost quadratically with attention.

Evolution of context windows:

ModelYearContext Window
GPT-320204K tokens
GPT-4 Turbo2023128K tokens
Gemini 1.5 Pro20241M tokens
Claude Sonnet 420251M tokens
Llama 42025up to 10M tokens

Caveat: Research shows "context rot"—most models degrade in reliability as input length grows, often performing well below advertised limits. A 200K window might become unreliable around 130K tokens in practice.


Training Process

Pre-training

The foundational training phase where LLMs learn language patterns by predicting the next token across massive text corpora—books, websites, code repositories, and academic papers comprising hundreds of billions to trillions of tokens. This self-supervised learning (no human labels needed) allows models to acquire grammar, facts, reasoning patterns, and world knowledge implicitly.

🔗 Core insight: Next-token prediction is deceptively powerful. To accurately predict what comes next, models must implicitly learn syntax, semantics, facts, logical relationships, and even approximate reasoning—all emerging from this simple objective.

Fine-tuning

Additional training on specific data to adapt a pre-trained model for particular tasks, domains, or behaviors. This takes a general-purpose language model and specializes it—for instruction-following, medical knowledge, coding, or specific organizational needs—using far less data than pre-training required.

Common approaches:

  • Supervised fine-tuning (SFT) on curated examples
  • Instruction tuning on diverse task formats
  • Domain adaptation on specialized corpora
  • Parameter-efficient methods like LoRA enable fine-tuning with minimal compute

RLHF (Reinforcement Learning from Human Feedback)

A technique that aligns LLMs with human preferences by training on human judgments rather than predefined rewards.

The process:

  1. Collect human comparisons of model outputs (which response is better?)
  2. Train a reward model to predict human preferences
  3. Use reinforcement learning (typically PPO) to optimize the LLM against this reward model

Why it matters: RLHF transforms raw language models into helpful assistants. InstructGPT, ChatGPT, and Claude all use RLHF variants. It addresses the "alignment" problem—making AI systems do what humans actually want rather than what's literally specified.

2025 developments:

  • RLAIF (AI feedback) achieves comparable results with less human annotation
  • RLTHF achieves full alignment with only 6-7% of traditional annotation effort
  • Direct Preference Optimization (DPO) bypasses reward model training entirely
  • Modern training involves many iterative rounds combining multiple techniques

Training Data

The text corpus used to train an LLM, significantly impacting capabilities and behaviors. Quality and diversity matter as much as scale—research shows smaller models trained on high-quality data can outperform larger models trained on noisier data.

Typical sources:

  • Web crawls (Common Crawl)
  • Books and literature
  • Wikipedia
  • Academic papers
  • Code repositories (GitHub)
  • Curated instruction datasets

Capabilities & Phenomena

Emergent Capabilities

Abilities that appear suddenly in larger models but are absent in smaller ones—capabilities that cannot be predicted by extrapolating from smaller scales. Examples include chain-of-thought reasoning, in-context learning, and multi-step problem solving.

The scientific debate:

PerspectiveArgument
Emergence is realPerformance hovers near random until a critical threshold, then jumps dramatically (like phase transitions in physics)
Emergence is a mirageSmoother metrics reveal gradual improvement; apparent discontinuities stem from non-linear evaluation choices

2025 research findings:

  • Emergent abilities may be tied to pre-training loss thresholds, not just parameter count
  • Large Reasoning Models (LRMs) like o1 demonstrate emergent capabilities through reinforcement learning + inference-time search
  • OpenAI's o1 achieved 83.3% on Competition Math vs GPT-4o's 13.4%—suggesting fundamental shifts in capability
🔗 Analogy: Like phase transitions in physics—water doesn't gradually become "a little bit frozen." Similarly, models may acquire capabilities through sudden reorganizations of internal representations rather than smooth accumulation.

Hallucination

When an LLM generates content that is fluent and plausible-sounding but factually incorrect, unsupported by evidence, or entirely fabricated.

Types:

  • Intrinsic hallucination: Contradicts information in the provided context
  • Extrinsic hallucination: Invents unverifiable information not present in any source

Root causes (2025 research): Hallucinations are now understood as a systemic incentive problem—training objectives reward confident responses over calibrated uncertainty. Models learn to "bluff" rather than admit ignorance because benchmarks penalize "I don't know" responses.

Mitigation strategies:

StrategyEffectiveness
Chain-of-thought promptingReduces hallucinations 50%+ in prompt-sensitive scenarios
Retrieval-Augmented Generation (RAG)Grounds responses in external knowledge (but not a panacea)
Calibration-aware reward trainingRewards appropriate uncertainty
Span-level verificationValidates claims against knowledge bases
🔗 Real-world impact: In Mata v. Avianca (2023), a lawyer was sanctioned for submitting a brief with fabricated case citations generated by ChatGPT.

Inference

The process of generating output from a trained model—what happens when you chat with an AI. Each response involves:

  1. Processing your input through all model layers
  2. Generating tokens one at a time (autoregressive generation)
  3. Each new token conditioned on everything that came before

Inference costs (compute, latency, money) are a major practical consideration for deployment.


Multimodal

AI systems that can process and generate multiple types of content—text, images, audio, video—often within the same interaction. This requires specialized encoders (like vision transformers) to convert non-text inputs into representations the language model can understand.

Examples:

  • GPT-4o ("omni"): Unifies text, image, and audio in a single architecture
  • Gemini 2.5: Processes text, images, audio, and video with native 1M+ token context
  • Claude 3+: Analyzes images within conversations
  • DALL-E 3, Stable Diffusion, Midjourney: Generate images from text
  • Sora: Generates video from text using diffusion

Generation Controls

Temperature

A parameter controlling randomness in output generation. Temperature scales the probability distribution over possible next tokens before sampling.

TemperatureBehaviorUse Cases
0.0Deterministic, most likely tokensFactual Q&A, code generation, structured outputs
0.3-0.5BalancedGeneral-purpose tasks
0.7-1.0Creative, variedCreative writing, brainstorming, diverse options
>1.0Highly randomExperimental, may become incoherent

Technical detail: Temperature divides the logits (raw scores) before softmax. Lower temperature sharpens the distribution; higher temperature flattens it.


Top-p (Nucleus Sampling)

A sampling method that considers only the smallest set of most likely tokens whose cumulative probability exceeds threshold p. Unlike top-k (fixed number of candidates), top-p adapts dynamically.

🔗 Example: With top-p = 0.9, the model samples from tokens comprising the top 90% of probability mass. If one token has 95% probability, only it is considered. If the top token has 40% probability, many tokens might be included until reaching 90% cumulative.

Best practice: Often used in combination with temperature. A common setting is temperature=0.7, top-p=0.9.


Architectures & Models

Vision Encoder

A component that converts images into tokens or embeddings that language models can understand. Vision Transformers (ViT) divide images into patches (like tokens), process them through transformer layers, and produce representations that can be integrated with text.

Process:

  1. Image divided into fixed-size patches (e.g., 16×16 pixels)
  2. Each patch embedded as a vector
  3. Positional encodings added
  4. Processed through transformer layers
  5. Output representations integrated with language model

Diffusion Models

A technique for generating images (and increasingly video) by learning to reverse a process of gradually adding noise.

Training: Model learns to denoise images step-by-step
Generation: Starts from pure noise, iteratively refines into coherent images guided by text prompts

Key models:

  • DALL-E 3 (OpenAI): Text-to-image, integrated with ChatGPT
  • Midjourney: Known for artistic, stylized outputs
  • Stable Diffusion 3 (2024): Open-source, uses transformers
  • Sora (2024): Extends diffusion to video generation
🔗 Analogy: Like a sculptor starting with a rough block of marble (noise) and progressively chiseling away to reveal the statue (image), with the text prompt serving as the blueprint.

Quick Reference: Model Context Windows (2025)

ModelContext WindowNotes
GPT-5400K input, 128K outputLarge output window for long-form generation
GPT-4.11MAPI access
Claude Opus 4200KOptimized for precision
Claude Sonnet 41MUpgraded from 200K
Gemini 2.5 Pro1M (2M coming)Native multimodal
Llama 4 Maverick1MOpen weights, MoE architecture
DeepSeek R1/V3128KStrong reasoning, open-source

Last updated: January 2025

Research compiled from arXiv surveys, peer-reviewed publications, and industry documentation including: "Emergent Abilities in Large Language Models: A Survey" (2025), "Large Language Models Hallucination: A Comprehensive Survey" (2025), Hugging Face documentation, and model technical reports.

Explore the Learning Journey →
Theme
Support
Š funclosure 2025