Bag of Words and TF-IDF

Before machine learning algorithms can understand text, documents must be converted into a numerical language they can process. This transformation from unstructured words to structured numbers is the critical first step in any text analysis pipeline. Bag-of-words (BoW) and its refined cousin, Term Frequency-Inverse Document Frequency (TF-IDF), are foundational, traditional methods for this exact purpose, forming the bedrock upon which many modern NLP applications were built. Mastering these techniques is essential not only for historical context but for understanding the core challenges of text representation—sparsity, importance, and context—that even advanced models must address.

The Foundation: Bag-of-Words Representation

The bag-of-words model is a simplifying representation where a text (a document) is represented as a "bag" (multiset) of its words, disregarding grammar, word order, and even sentence structure, but keeping track of word frequency. The core idea is that the frequency of a word in a document can be a useful feature for classification or analysis tasks, such as sentiment analysis or topic labeling.

The practical implementation is typically done via a tool like CountVectorizer (common in Python's scikit-learn library). This process involves three key steps:

Tokenization: Splitting the raw text into individual words (tokens).
Vocabulary Building: Creating a dictionary of all unique tokens found across the entire corpus (the collection of documents). This defines the vocabulary size.
Vectorization: Encoding each document as a numerical vector. The length of the vector equals the vocabulary size. Each position in the vector corresponds to a specific word from the master vocabulary, and the value is the count (frequency) of that word in the current document.

Consider a tiny corpus:

Document 1: "The cat sat on the mat."
Document 2: "The dog played on the mat."

Ignoring case and punctuation, the vocabulary is: ['the', 'cat', 'sat', 'on', 'mat', 'dog', 'played'] (size = 7). The BoW vectors become:

Doc1: [2, 1, 1, 1, 1, 0, 0] // "the" appears twice, "cat", "sat", "on", "mat" once.
Doc2: [2, 0, 0, 1, 1, 1, 1] // "the" twice, "on", "mat", "dog", "played" once.

Vocabulary size management is a critical consideration here. A raw vocabulary built from thousands of documents can easily reach tens or hundreds of thousands of unique tokens, leading to extremely high-dimensional vectors. This is managed through parameters in CountVectorizer such as max_features (keep only top N most frequent words), min_df/max_df (ignore terms that appear in too few or too many documents), and stop word removal (filtering out common words like "the," "is," "and").

Beyond Single Words: Capturing Phrases with N-grams

A major limitation of the basic BoW model is its complete loss of word order and context. The phrases "not good" and "good not" would be identically represented. N-grams solve this by capturing contiguous sequences of N items (words or characters).

Unigrams: Single words (N=1). This is the standard BoW. ['the', 'cat', 'sat']
Bigrams: Pairs of consecutive words (N=2). ['the cat', 'cat sat', 'sat on']
Trigrams: Triplets of consecutive words (N=3). ['the cat sat', 'cat sat on']

Using n-grams (e.g., setting ngram_range=(1,2) in CountVectorizer) expands the feature space to include these multi-word expressions. This allows the model to capture negation ("not good"), common phrases ("New York"), or other meaningful patterns. However, it exponentially increases the vocabulary size, exacerbating the problem of high-dimensional, sparse representations (vectors filled mostly with zeros).

Weighting for Importance: Introducing TF-IDF

A raw word count is often a poor indicator of importance. Common words like "the" will have high counts but low discriminative power across documents. Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical weight designed to reflect how important a word is to a document in a collection.

It is the product of two components:

Term Frequency (TF): Measures how frequently a term appears in a document. It is often normalized (e.g., by the document length) to prevent bias towards longer documents. A simple form is: $t f (t, d) = \frac{count of term t in document d}{total terms in document d}$ .
Inverse Document Frequency (IDF): Measures how rare or common a term is across the entire corpus. It downweights terms that appear in many documents. The formula is: $i df (t, D) = lo g \frac{total number of documents ∣ D ∣}{number of documents containing term t + 1}$ . The "+1" is a smoothing factor to avoid division by zero.

Thus, the TF-IDF score is: $t f i df (t, d, D) = t f (t, d) \times i df (t, D)$ .

A high TF-IDF score occurs when a term appears frequently in a specific document (high TF) but rarely in the rest of the corpus (high IDF). For instance, in a corpus of sports articles, the word "penalty" may have a moderate TF-IDF. In a corpus mixing sports and finance articles, "penalty" would have a high TF-IDF for sports articles (discriminating them from finance) while "interest" would have a high TF-IDF for finance articles.

Practical Workflow and Sparse Representation

A standard workflow for text feature extraction using these methods is:

Preprocess: Clean text (lowercase, remove punctuation, handle special characters).
Vectorize: Instantiate CountVectorizer or TfidfVectorizer. Define parameters (max_df, min_df, stop_words, ngram_range, max_features).
Fit and Transform: fit_transform() the vectorizer on the training corpus. This learns the vocabulary and IDF weights (if using TF-IDF) from the training data.
Transform New Data: Use the transform() method on new, unseen documents using the learned vocabulary and IDF.

The resulting document-term matrix, whether from BoW or TF-IDF, is almost always a sparse matrix. With a vocabulary of 20,000 words, a single 100-word document will have a vector with 19,900 zeros. Storing this in a dense format is massively inefficient. Libraries like scikit-learn use sparse data structures (e.g., Compressed Sparse Row format) that only store non-zero entries and their positions, making computation feasible.

Limitations and the Path Forward

While indispensable, these traditional methods have clear limitations:

Sparsity: The high-dimensional vectors are sparse and computationally challenging for very large vocabularies.
Semantic Ignorance: The models have no understanding of word meaning. "Smart," "intelligent," and "clever" are treated as completely unrelated features.
Context Loss: Even with n-grams, long-range dependencies and syntactic structure are lost.
Vocabulary-Based: They cannot handle out-of-vocabulary (OOV) words not seen during training.

These limitations are precisely what motivated the development of dense embedding models like Word2Vec, GloVe, and modern contextual embeddings (BERT, GPT). However, TF-IDF remains a remarkably strong, fast, and interpretable baseline for many text classification tasks, especially when training data is limited or computational resources are constrained.

Common Pitfalls

Ignoring Stop Words and max_df: Failing to filter ubiquitous terms (common stop words or corpus-specific frequent terms) allows uninformative features to dominate the vector space. Correction: Always use a stop word list and tune the max_df parameter (e.g., max_df=0.85 to ignore terms in >85% of documents).

Applying IDF Incorrectly to New Data: A common mistake is recalculating IDF on a new test set, which changes the meaning of the features. The IDF must be derived only from the training corpus and then applied to test documents. Correction: Always fit the TfidfVectorizer on the training set and use it to transform both train and test sets.

Mismanaging Vocabulary Size: Letting the vocabulary grow unchecked leads to massive, inefficient models prone to overfitting. Correction: Use min_df (e.g., min_df=2 or 5) to prune rare terms and max_features to set a hard limit. Experiment with these as hyperparameters.

Treating Vectors as Dense: Attempting to convert a large sparse matrix to a dense array can crash your kernel due to memory overflow. Correction: Keep data in sparse format and use algorithms (like those in scikit-learn) that are optimized for sparse matrix operations.

Summary

The Bag-of-Words (BoW) model converts text documents into numerical vectors based on word counts, using tools like CountVectorizer, but it completely discards word order and context.
N-grams (bigrams, trigrams) extend BoW to capture contiguous word sequences, helping to preserve some local context at the cost of a significantly increased vocabulary size.
TF-IDF weighting refines raw counts by multiplying a term's frequency in a document (TF) by the inverse logarithm of its frequency across the corpus (IDF), highlighting terms that are discriminative for a specific document.
The resulting document-term matrix is inherently a sparse representation, requiring specialized data structures for efficient storage and computation.
While foundational, these methods have key limitations, including semantic blindness and the curse of high-dimensional sparsity, which are addressed by more modern embedding-based approaches.

Bag of Words and TF-IDF

Bag of Words and TF-IDF

The Foundation: Bag-of-Words Representation

Beyond Single Words: Capturing Phrases with N-grams

Weighting for Importance: Introducing TF-IDF

Practical Workflow and Sparse Representation

Limitations and the Path Forward

Common Pitfalls

Summary

Write better notes with AI