Language Model Fundamentals

At its core, the remarkable ability of AI to write, translate, and converse stems from a foundational task: predicting what comes next. Language modeling is the computational process of learning a probability distribution over sequences of words or tokens. By mastering how to assign likelihoods to text, models learn the patterns, grammar, and even reasoning embedded in human language.

From N-Grams to Neural Probabilities

The fundamental goal of a language model is to compute the probability of a sequence of tokens (words or sub-words). For a sequence of $N$ tokens $w_{1}, w_{2}, ..., w_{N}$ , we want $P (w_{1}, w_{2}, ..., w_{N})$ .

The simplest approach is the n-gram model. It makes a simplifying Markov assumption: a token's probability depends only on the previous $n - 1$ tokens. For a bigram model ( $n = 2$ ), the probability of a sequence is approximated as: $P (w_{1}, w_{2}, ..., w_{N}) \approx i = 1 \prod N P (w_{i} ∣ w_{i - 1})$

The probability $P (w_{i} ∣ w_{i - 1})$ is estimated from a corpus using maximum likelihood estimation: $P (w_{i} ∣ w_{i - 1}) = \frac{count ( w _{i - 1} , w _{i} )}{count ( w _{i - 1} )}$ . While intuitive, n-gram models suffer from the curse of dimensionality—the number of possible n-grams explodes with vocabulary size—and data sparsity. They cannot handle long-range dependencies or unseen word combinations gracefully.

This is where neural language models take over. Instead of storing explicit counts, a neural network learns dense, continuous vector representations (embeddings) for each token. These embeddings capture semantic and syntactic similarities. A model like a Recurrent Neural Network (RNN) or, more effectively, a Transformer, processes a sequence of these embeddings to compute a probability distribution for the next token over the entire vocabulary at each step. The neural model's parameters are learned by maximizing the probability of the training data, allowing it to generalize to unseen sequences far better than sparse n-gram models.

Evaluating Language Models: The Role of Perplexity

How do we know if one language model is better than another? While task-specific metrics (like translation accuracy) are useful, an intrinsic evaluation metric for language models is perplexity. Conceptually, perplexity measures how "surprised" or "perplexed" a model is by unseen text. A lower perplexity indicates a better model.

Mathematically, perplexity is the exponentiated average negative log-likelihood per token on a test set. For a test sequence with $N$ tokens: $Perplexity = exp (- \frac{1}{N} i = 1 \sum N lo g P (w_{i} ∣ w_{< i}))$

Think of it this way: if a model assigns a high probability to the actual tokens in the test set, the negative log-likelihood is low, leading to low perplexity. For example, a model with a perplexity of 100 is, in a sense, as "confused" as if it had to choose uniformly among 100 equally likely tokens at each step. It is the primary metric for comparing pure language modeling performance during training and development.

Core Training Paradigms: Autoregressive and Masked Modeling

Modern language models are trained using two primary paradigms, each with different strengths.

Autoregressive language modeling is the classic "next token prediction" task. The model is trained to predict the next token in a sequence given all previous tokens. It processes text sequentially from left to right, and the probability of the full sequence is factored as: $P (w_{1}, w_{2}, ..., w_{N}) = i = 1 \prod N P (w_{i} ∣ w_{1}, ..., w_{i - 1})$ This is the objective used in models like GPT (Generative Pre-trained Transformer). It excels at natural text generation, as the training process directly mimics the act of generating one word after another. During inference, you give it a "prompt," and it generates the most probable subsequent tokens.

In contrast, Masked language modeling (MLM) is a denoising objective. A random subset (e.g., 15%) of tokens in the input sequence is replaced with a special [MASK] token. The model is then trained to predict the original identity of these masked tokens based on the surrounding context—both left and right. This bidirectional understanding allows the model to develop a richer, contextual representation of each word. This is the core pre-training objective for models like BERT (Bidirectional Encoder Representations from Transformers). While great for understanding tasks, BERT is not inherently designed for fluent, left-to-right text generation.

From Modeling to Capabilities: How LLMs Leverage Their Training

You might wonder: if a model is just trained to predict the next word, how can it follow instructions, write code, or answer questions? The answer lies in the transformative power of scale and a technique called transfer learning.

First, by training a massive neural network (a Transformer with billions of parameters) on a vast corpus of internet text using an autoregressive or masked objective, the model internalizes an incredible amount of world knowledge, reasoning patterns, and linguistic structure. It learns a sophisticated, high-dimensional representation of language.

Second, this pre-trained model is not the final product. Through a process called fine-tuning, the model is further trained (or "adapted") on a smaller, curated dataset for a specific downstream task. For example, to create an instruction-following assistant like ChatGPT, the base language model (GPT) is fine-tuned on thousands of examples of human-written prompts and desired responses. This teaches the model to align its "next token prediction" engine with the format and intent of the user's query. This paradigm means a single foundational language modeling skill can be specialized for translation (by fine-tuning on parallel text), summarization, classification, and much more.

Common Pitfalls

Confusing Model Type with Capability: Assuming an autoregressive model (like GPT) cannot understand context because it processes text left-to-right is a mistake. While its training is sequential, the self-attention mechanism in Transformers allows any position in a sequence to attend to all previous positions, building a rich contextual representation. Conversely, assuming a masked model (like BERT) is good at text generation is also incorrect; its training objective does not optimize for fluent, extended sequence generation.

Misinterpreting Perplexity: Perplexity is only meaningful for comparing models trained on the same vocabulary and tested on the same data distribution. A lower perplexity on Wikipedia text does not necessarily mean a model will perform better as a chatbot. It is an intrinsic measure of modeling efficiency, not a direct measure of usefulness for all downstream tasks.

Overlooking the Data Foundation: A language model is a compressed representation of its training data. Pitfalls like generating biased, toxic, or factually incorrect text often stem from patterns and imperfections in the pre-training corpus, not just the model architecture. Understanding a model's output requires considering what it learned from, not just how it learned.

Treating Probability as Certainty: A language model outputs a probability distribution. The word it selects (via "sampling" or "argmax") is the most statistically likely next token according to its training, not a verified truth. This is why models can "hallucinate" plausible-sounding but incorrect information; they are optimizing for linguistic plausibility, not factual accuracy.

Summary

The fundamental task of a language model is to assign probabilities to sequences of tokens, evolving from simple statistical n-gram models to powerful neural language models that use dense embeddings.
Perplexity is the key intrinsic evaluation metric, quantifying how well a model predicts a held-out sample; lower perplexity indicates a better fit to the data.
The two dominant training paradigms are autoregressive language modeling (predicting the next token, ideal for generation) and masked language modeling (predicting masked tokens within context, ideal for understanding).
Modern large language models (LLMs) leverage scale and transfer learning via fine-tuning to adapt their core language modeling capability to a wide array of downstream tasks like translation, summarization, and dialogue.
Critically evaluating a model's output requires understanding its training objective, its data sources, and the probabilistic nature of its generations.

Language Model Fundamentals

Language Model Fundamentals

From N-Grams to Neural Probabilities

Evaluating Language Models: The Role of Perplexity

Core Training Paradigms: Autoregressive and Masked Modeling

From Modeling to Capabilities: How LLMs Leverage Their Training

Common Pitfalls

Summary

Write better notes with AI