Tokenization with BPE and SentencePiece
Tokenization with BPE and SentencePiece
In natural language processing (NLP), raw text is meaningless to a neural network. Tokenization—the art of breaking text into meaningful chunks—is the critical first step that bridges human language and machine learning. While simple word-level tokenization falters with rare words and massive vocabularies, and character-level tokenization loses semantic coherence, subword tokenization strikes an elegant balance.
From Characters to Subwords: The Byte Pair Encoding (BPE) Algorithm
Byte Pair Encoding (BPE) is a data compression algorithm repurposed for text tokenization. Its core principle is iterative merging: start with a base vocabulary of individual characters and repeatedly merge the most frequent pair of adjacent symbols to create new, longer tokens. This process learns a vocabulary of subword units from a training corpus.
The algorithm follows a clear, step-by-step procedure. First, you preprocess the text (e.g., normalize, add word delimiters like _) and split it into a sequence of characters. This forms your initial vocabulary. You then calculate the frequency of every adjacent pair of symbols within words. The most frequent pair (e.g., "e" and "s" if "es" is common) is merged into a new single token (e.g., "es"). This merge operation is added to a list of merge rules. You repeat the frequency counting and merging process for a predefined number of iterations or until a target vocabulary size is reached. The final output is a vocabulary of subword units and a set of merge rules.
Consider a tiny corpus: "low lower lowest". The initial split with word delimiters might be ['l', 'o', 'w', '_', 'l', 'o', 'w', 'e', 'r', '_', 'l', 'o', 'w', 'e', 's', 't']. The pair ('l', 'o') and ('o', 'w') are very frequent. If ('l', 'o') merges first, we get ['lo', 'w', '_', 'lo', 'w', 'e', 'r', '_', 'lo', 'w', 'e', 's', 't']. Next, ('lo', 'w') will likely merge, creating the subword "low". This efficiently captures the common root.
To tokenize new text with a trained BPE model, you split it into characters and then apply the learned merge rules in the exact order they were created during training, greedily combining pairs until no more rules apply. This ensures consistent segmentation.
SentencePiece: Language-Agnostic and Model-Co-Designed Tokenization
While BPE is powerful, it has limitations: it typically requires pre-tokenization by whitespace or rules, which struggles with languages without clear word boundaries (e.g., Chinese, Japanese). SentencePiece solves this by treating the input text as a raw byte sequence, making it truly language-agnostic. It does not assume any pre-tokenization, allowing the model to learn optimal segmentation directly from bytes, including spaces, which are encoded as a normal symbol (like _).
SentencePiece implements two core tokenization algorithms. The first is BPE, as described, applied directly to the raw Unicode characters or byte sequences. The second, and a key innovation, is unigram language model tokenization. This method starts with a large seed vocabulary (e.g., all characters plus common substrings) and iteratively prunes it down to a target size using a unigram language model to assess the likelihood of a sentence under different segmentation options. It removes the least necessary tokens to maximize the overall likelihood of the training data. This probabilistic approach often yields more balanced and robust vocabularies compared to the purely frequency-based BPE.
A critical aspect of SentencePiece is its handling of unknown tokens. Since it operates directly on raw input, it can always fall back to decomposing a rare sequence into smaller known subwords or even bytes, effectively achieving a zero unknown vocabulary. This is crucial for robust real-world application where encountering unseen words is inevitable.
Vocabulary Size Tradeoffs and Performance Implications
Choosing the vocabulary size is a fundamental hyperparameter with direct tradeoffs. A larger vocabulary (e.g., 50k-100k subwords) means most common words and phrases are stored as single tokens, leading to shorter, more efficient sequences for the model to process. However, it increases the embedding matrix size, raising memory usage and the risk of overfitting on rare tokens. A smaller vocabulary (e.g., 5k-10k subwords) is more compact and generalizes better at the subword level, but it produces longer, more granular sequences, which increases computational cost during training and inference.
The optimal size is a product of tokenizer-model co-design. The tokenizer's granularity must align with the model's capacity and task. For example, a translation model benefiting from morphological understanding might work better with a smaller vocabulary that captures prefixes and suffixes. A large language model aiming for broad coverage might use a larger vocabulary. The tokenizer is not a pre-processing step isolated from model training; it is trained on the same corpus as the downstream model to ensure its statistics are matched, making the segmentation an integral part of the model's design for optimal NLP performance.
Common Pitfalls
- Training and Applying Merge Rules Out of Order: BPE's greedy encoding during inference must apply merge rules in the exact same order they were learned during training. Applying them in a different order or using a frequency-sorted list will produce incorrect segmentations. Always store and load the full, ordered list of merge operations.
- Ignoring the Training Corpus Distribution: A tokenizer trained exclusively on formal Wikipedia text will segment informal social media text poorly, leading to excessive fragmentation of slang and misspellings. Always ensure your tokenizer's training data is representative of your application's text domain.
- Overlooking the Space/Whitespace Representation: How spaces are handled is critical. SentencePiece's default behavior of treating space as
_and decoding it back is standard, but customizing this (e.g., for code or specific languages) requires careful configuration. Forgetting that spaces are tokens can lead to surprising sequence length calculations. - Setting Vocabulary Size Without Consideration for Compute: Arbitrarily choosing a very large vocabulary to "be safe" directly increases the parameters of your model's embedding layer and output layer, slowing training and increasing memory costs. The choice should be informed by model size, dataset size, and computational budget.
Summary
- Byte Pair Encoding (BPE) builds a subword vocabulary by iteratively merging the most frequent pairs of adjacent characters or symbols, starting from a character-level base. It requires pre-tokenization and applies merge rules greedily.
- SentencePiece enables language-agnostic tokenization by processing raw text as a byte sequence, without pre-tokenization. It supports both BPE and a unigram language model algorithm, which uses a probabilistic framework to optimize the vocabulary.
- The vocabulary size presents a key tradeoff: larger vocabularies yield shorter sequences but more parameters, while smaller ones create longer sequences but better generalization at the subword level.
- Effective tokenization requires tokenizer-model co-design, where the tokenizer is trained on data representative of the target domain and its granularity is chosen to complement the model's architecture and task objectives.
- Robust tokenizers like SentencePiece mitigate the unknown token problem by allowing fallback segmentation to smaller units, ensuring any text can be processed.