Word Embeddings: Word2Vec and GloVe

Word embeddings are the foundational layer of modern natural language processing, transforming discrete words into meaningful, dense vectors. Understanding how to create and use these representations is essential for building intelligent systems that can comprehend language, from search engines to chatbots.

From Intuition to Vector: The Core Idea

Before diving into algorithms, let's solidify the intuition. The fundamental idea behind word embeddings is the distributional hypothesis, which posits that words appearing in similar contexts tend to have similar meanings. Word embeddings operationalize this by mapping each word in a vocabulary to a high-dimensional vector (e.g., 300 dimensions) such that the geometric relationships between these vectors reflect semantic and syntactic relationships between the words.

For example, in a well-trained embedding space, the vectors for "king," "man," "woman," and "queen" would be arranged such that the vector operation $kin g - man + w o man$ results in a vector very close to $q u ee n$ . This ability to perform embedding arithmetic demonstrates that the model has captured abstract relational concepts like gender and royalty. The primary goal is to move from sparse, one-hot encodings, which treat every word as an isolated unit, to these dense, informative vectors.

Word2Vec: Learning from Local Context Windows

The Word2Vec framework, introduced by researchers at Google, provides an efficient method for learning word embeddings using shallow neural networks. Its brilliance lies in reframing the unsupervised learning problem as a supervised prediction task. Word2Vec comes in two primary architectures: Continuous Bag-of-Words (CBOW) and Skip-gram.

Continuous Bag-of-Words (CBOW) predicts a target word given its surrounding context words. Imagine a sliding window of text: "The quick brown fox jumps." For the target word "fox," the context might be ["The", "quick", "brown", "jumps"]. The CBOW model averages the vectors of the context words and tries to predict the target word "fox." This architecture is faster and works well with more frequent words.

Skip-gram inverts this objective: it predicts the surrounding context words given a target word. Using the same example, given the target word "fox," the Skip-gram model tries to predict each word in the context window ["The", "quick", "brown", "jumps"] independently. While computationally more demanding, Skip-gram often produces better embeddings for rare words, as it uses each target word as multiple training examples.

Both models are trained using a trick called negative sampling. Instead of calculating a computationally expensive softmax over the entire vocabulary (which can have millions of words), the model learns to distinguish the true target word from a handful of randomly sampled "negative" words. This makes training feasible on large corpora.

GloVe: Capturing Global Co-Occurrence Statistics

While Word2Vec learns from local context windows, the Global Vectors (GloVe) model, developed at Stanford, takes a different approach by leveraging global word-word co-occurrence statistics from the entire corpus. The key insight is that the ratio of co-occurrence probabilities can encode meaningful semantic relationships.

GloVe's construction involves three main steps. First, it builds a co-occurrence matrix $X$ , where each entry $X_{ij}$ represents how often word $j$ appears in the context of word $i$ within a defined window in the entire corpus. Second, it formulates a weighted least squares regression model. The core idea is to learn vectors such that the dot product of two word vectors equals the logarithm of their probability of co-occurrence. The model minimizes a cost function:

$J = i, j = 1 \sum V f (X_{ij}) (w_{i}^{T} \tilde{w_{j}} + b_{i} + \tilde{b_{j}} - lo g X_{ij})^{2}$

Here, $w_{i}$ and $\tilde{w_{j}}$ are the primary and context word vectors for words $i$ and $j$ , $b$ are bias terms, and $f (X_{ij})$ is a weighting function that prevents rare co-occurrences from dominating the objective. By factorizing this log co-occurrence matrix, GloVe directly captures global statistical information, often leading to strong performance on word analogy tasks.

FastText: Enriching Embeddings with Subword Information

A significant limitation of both standard Word2Vec and GloVe is that they treat each word as an atomic unit. This means they cannot infer vectors for words not seen during training (out-of-vocabulary words) and ignore internal word structure, like morphemes. FastText, developed by Facebook AI Research, addresses this by representing each word as a bag of character n-grams.

For example, the word "where" with n=3 would be represented by the character trigrams: <wh, whe, her, ere, re>, and the whole word <where>. The angle brackets denote word boundaries. The vector for "where" is then the sum of the vectors for these constituent n-grams. This subword model has two major advantages: it can generate reasonable embeddings for rare or misspelled words by sharing n-gram representations, and it can even construct a vector for a completely unseen word if its character combinations are familiar.

Leveraging Pre-Trained Embeddings

In practice, you rarely need to train word embeddings from scratch on your own data for general tasks. A vast ecosystem of pre-trained embeddings exists, trained on enormous corpora like Wikipedia, Common Crawl, or domain-specific text. Using these is a form of transfer learning: you import semantic knowledge learned from billions of words into your model.

The process is straightforward. You download a file (e.g., glove.6B.300d.txt for 300-dimensional GloVe vectors trained on 6 billion tokens) which maps words to their vector values. You then load these into an embedding layer in your neural network, often initializing it with these weights. A critical decision is whether to freeze these embeddings (keep them static) or fine-tune them (allow them to be updated during training on your specific task). Freezing is faster and prevents overfitting on small datasets, while fine-tuning can help the embeddings adapt to a specialized domain.

Common Pitfalls

Ignoring Preprocessing and Hyperparameters: Simply dumping raw text into Word2Vec will yield poor results. Failing to properly clean text (lowercasing, handling punctuation), tune the context window size, or select an appropriate embedding dimension (e.g., 50, 100, 300) are common errors. A small window (e.g., 2-5) captures syntactic relationships, while a larger window (e.g., 10-20) captures more topical/thematic similarities.
Misapplying Embedding Arithmetic: The famous "king - man + woman ≈ queen" analogy works for clear, linear relationships but is not a magic bullet. Performing arbitrary arithmetic (e.g., "Paris - France + Germany") assumes the relationship is consistently encoded and can produce nonsensical results if the embedding space hasn't robustly learned that specific relation.
Using the Wrong Embedding for the Task: GloVe often excels at word analogy tasks due to its global objective, while Skip-gram with Negative Sampling (SGNS) can be better for capturing nuanced similarity on rare words. FastText is superior for morphologically rich languages or tasks with many typos and out-of-vocabulary terms. Choose based on your data and objective.
Assuming Embeddings Encode True Meaning or World Knowledge: Word embeddings capture statistical patterns in text, which can reflect and amplify human biases present in the training data (e.g., gender or racial stereotypes). They are a powerful engineering tool but do not represent conceptual understanding. Always evaluate their output critically.

Summary

Word embeddings are dense vector representations that capture semantic and syntactic word relationships based on their usage in context, enabling machines to work with language numerically.
Word2Vec learns embeddings via local context prediction, using either the CBOW (context predicts target) or Skip-gram (target predicts context) architecture, optimized with negative sampling for efficiency.
GloVe constructs embeddings by factorizing a global co-occurrence matrix, aiming to capture the ratios of co-occurrence probabilities, which often effectively models linear analogies.
FastText extends the model by using subword information (character n-grams), allowing it to handle out-of-vocabulary words and morphologically complex languages effectively.
In practice, leveraging pre-trained embeddings is standard, with a choice between keeping them static or fine-tuning them on your specific dataset and task.

Word Embeddings: Word2Vec and GloVe

Word Embeddings: Word2Vec and GloVe

From Intuition to Vector: The Core Idea

Word2Vec: Learning from Local Context Windows

GloVe: Capturing Global Co-Occurrence Statistics

FastText: Enriching Embeddings with Subword Information

Leveraging Pre-Trained Embeddings

Common Pitfalls

Summary

Write better notes with AI