DS: Wavelet Trees
AI-Generated Content
DS: Wavelet Trees
Imagine you have a long string of text—like the entire genome of an organism or every word in a library—and you need to ask complex questions about its contents. Traditional arrays and trees quickly become inefficient when you need to find how many times a specific letter appears up to a certain point or locate the tenth occurrence of a character across millions of symbols. Wavelet trees are a powerful data structure that elegantly solves these problems by cleverly decomposing a sequence. They transform complex queries over large alphabets into simple operations on bits, enabling you to answer rank, select, and range queries in time proportional to the logarithm of the alphabet size. This makes them indispensable in fields like bioinformatics, text compression, and information retrieval.
From Intuition to Construction: Recursive Alphabet Splitting
A wavelet tree is built on a simple but powerful idea: recursive alphabet decomposition. You start with your original sequence, , composed of characters from an alphabet of size sigma. The core operation is to split this alphabet into two roughly equal halves. For example, if your alphabet is the lowercase letters a-z, you might split it into a-m and n-z.
For the root node of the tree, you create a bitvector that represents the original sequence . For each character in , you assign a 0 if it belongs to the first half of the alphabet and a 1 if it belongs to the second half. This bitvector is stored, along with two crucial pieces of auxiliary data that support fast operations: a rank structure (which can tell you how many 0s or 1s appear up to any position) and a select structure (which can find the position of the i-th 0 or 1).
The recursion then begins. You create two child nodes: a left child for the 0s and a right child for the 1s. For the left child, you take the subsequence of containing only characters from the first alphabet half (the 0s), and you repeat the process, splitting that sub-alphabet further. The right child does the same for the second half. You continue recursively until you reach nodes where the alphabet contains only one character—these are the leaves of the tree. The final structure is a balanced binary tree of height , where each internal node stores a bitvector representing the "path" of its subsequence.
Construction Example: Let's build a wavelet tree for the sequence "banana\Sigma = {a, b, n}$.
- Root: Alphabet split: (0) vs (1). Bitvector: → . Left subsequence: "baa" (from positions 0,1,3,5). Right subsequence: "nn" (from positions 2,4).
- Left Child (handling {a,b}): Split: (0) vs (1). Bitvector for "baa": → . Its left child is a leaf for 'a', right child a leaf for 'b'.
- Right Child (handling {n}): This is already a leaf node for 'n'.
Answering Core Queries: Rank, Select, and Access
The true power of the wavelet tree is revealed when answering queries. All operations leverage the tree traversal and the precomputed rank/select support on the bitvectors at each node.
- Access(, i): Retrieve the character at position in the original sequence. You start at the root. Look at the bit at . If , you go to the left child. The new position in the child's subsequence is not
i, butrank_0(B_root, i)—the number of0s in the root bitvector up to positioni. If , you go right usingrank_1(B_root, i). You repeat this process until you reach a leaf, whose associated character is your answer. This takes steps.
- Rank(, i): Count how many times character appears in (up to position i). You traverse the tree down the path to the leaf for character . At each node, you check which half of its alphabet belongs to. This tells you which child to go to, and you update the position
iusing the rank operation on the current node's bitvector (e.g.,i = rank_0(B_node, i)if going left). When you finally arrive at the leaf, the finalivalue is the answer—the rank of up to the original position. Time complexity is again .
- Select(, j): Find the position in of the -th occurrence of character (where is 1-based). This is the inverse of rank. You start at the leaf node for . You must now traverse up the tree to the root. This requires each node to know how many characters went to its left child (a simple count of
0s in its bitvector). The process is recursive: at a leaf, the "position" isj. When moving to its parent, you need to find which position in the parent's bitvector thisj-th occurrence came from. You do this using a select operation on the parent's bitvector. You continue upward until you reach the root, where the calculated position is your final answer. This also runs in time.
Practical Considerations and Space Efficiency
A naive implementation storing the bitvectors and tree pointers can use excessive memory. In practice, wavelet trees are made space-efficient. The bitvectors are typically compressed using techniques like Raman, Raman, and Rao (RRR) encoding or SDArray, which provide compressed storage while still supporting constant-time rank and select operations. This allows the overall structure to occupy space close to the zero-order entropy of the sequence bits, making it comparable to compressed text indexes.
Another critical consideration is alphabet mapping and ordering. The standard construction assumes a known, ordered alphabet (like ASCII or a sorted list of symbols). For non-contiguous or very large alphabets, you must first map the symbols to a contiguous range [0, sigma-1]. The choice of splitting strategy (perfectly balanced vs. based on frequency) can also affect performance; a balanced split guarantees the height, while a Huffman-shaped tree can improve average-case time for skewed distributions.
Application to Advanced Problems: Document Retrieval and Substring Search
Wavelet trees are not just for single-sequence queries. They form the backbone for solving more complex problems.
- Document Retrieval: Imagine you have a collection of documents concatenated into one string , with a separate bitvector marking document boundaries. A common query is: "Which distinct documents contain pattern ?" By building a suffix array and Burrows-Wheeler Transform (BWT) of , and then overlaying a wavelet tree on the BWT, you can answer these queries efficiently. The wavelet tree allows you to track intervals in the suffix array during pattern search while simultaneously using rank queries to count occurrences across document boundaries, enabling efficient reporting of distinct document IDs.
- Substring Search & Range Queries: A wavelet tree built over a suffix array (as an array of integers) can answer range quantile queries (find the k-th smallest element in a subarray) or range counting (how many symbols in are lexicographically between
aandb). For substring search, the combination of a suffix array and a wavelet tree on the LCP array (Longest Common Prefix) can be used to simulate the functionality of more complex structures like suffix trees, enabling rapid enumeration of repeated substrings or finding maximal repeats in genomic data.
Common Pitfalls
- Ignoring Alphabet Preprocessing: Attempting to build a wavelet tree directly on raw character codes (e.g., Unicode code points) without mapping them to a dense range
[0, sigma-1]will create a degenerate, overly deep tree, destroying the guarantee. Always map your symbols first. - Confusing Rank and Select Semantics: A frequent error is mixing up the arguments for
rankandselect. Remember:rank_c(i)returns a count (how manycs up to indexi).select_c(j)returns an index (the position of thej-thc). Using an index where a count is needed, or vice versa, will lead to incorrect traversal and wrong answers. - Forgetting 0-based vs. 1-based Indexing: Implementations of rank/select on bitvectors and the wavelet tree queries must be consistent. Most theoretical descriptions use 1-based indexing for
select(the 1st occurrence, 2nd occurrence, etc.). In code, you often work with 0-based arrays. Failing to convert between these correctly—for example, passing a 0-basedjto aselectfunction expecting 1-based—is a common source of off-by-one errors. - Overlooking Construction Time: While queries are fast, building the tree naively by repeatedly scanning and partitioning subsequences can take time and temporary memory. For large sequences, you need an efficient construction algorithm that uses global scanning and in-place partitioning techniques to keep memory overhead low.
Summary
- A wavelet tree is a recursive data structure that represents a sequence by splitting its alphabet and storing a bitvector at each internal node, enabling efficient queries in time.
- Its core operations—access, rank, and select—are performed by traversing the tree up or down while using rank/select on node bitvectors to transform indices between levels.
- For practical use, bitvectors are compressed (e.g., with RRR encoding), allowing the structure to occupy space proportional to the compressed size of the original sequence.
- It is crucial to preprocess the alphabet into a dense, ordered range before construction to maintain logarithmic query performance.
- Beyond basic queries, wavelet trees are fundamental components in advanced text indexing tasks, such as document retrieval (finding which documents contain a pattern) and complex substring search operations when combined with suffix arrays and the BWT.