Algo: Coordinate Compression
Algo: Coordinate Compression
Coordinate compression is a crucial technique for solving computational problems where data values are theoretically large or widely spaced, but the number of distinct values is manageably small. It transforms unwieldy, sparse coordinate systems into dense, consecutive integers, enabling the use of efficient data structures like arrays, segment trees, or binary indexed trees that would otherwise be impossible due to memory constraints. Mastering this technique allows you to tackle a wide range of problems in computational geometry, range query processing, and discrete event simulation.
What Is Coordinate Compression?
At its core, coordinate compression is a mapping process. You take a set of numbers—often representing coordinates, times, or IDs—and replace each original value with its rank (or index) among all the sorted, unique values. This rank becomes the new, compressed coordinate.
Consider you have points at x-coordinates: {7, 1, 1000000, 7, 1}. The distinct values are [1, 7, 1000000]. After compression:
- 1 → 0 (or 1, if using 1-based indexing)
- 7 → 1 (or 2)
- 1000000 → 2 (or 3)
The problem's spatial or logical structure is preserved (order is maintained), but you've reduced the coordinate universe from a range of 0 to 1,000,000 to just 0-2. This is a classic space-time tradeoff: you spend time preprocessing to compress coordinates, which then allows you to use -sized data structures instead of -sized ones.
Implementing Compression: Sorting and Binary Search
The standard implementation involves two clear phases: compression and lookup.
1. The Compression Phase (Preprocessing):
- Collect all values that need to be compressed into a list.
- Create a sorted list of unique values. This is typically done by adding all values to a vector, sorting it, and then using
std::unique(in C++) or similar methods to remove duplicates. - This sorted, unique list is your compression map.
2. The Lookup Phase (Runtime):
- To find the compressed index for any original value, you perform a binary search on the sorted unique list.
- The index returned by the binary search (e.g., using
std::lower_bound) is the compressed coordinate.
Here’s a conceptual snippet:
Original coordinates: [120, 5, 999, 5, 120]
1. Collect & Sort Uniques: [5, 120, 999]
2. Compress:
- 5 -> index 0
- 120 -> index 1
- 999 -> index 2
3. To query: What is compressed index of 120? Binary search in [5,120,999] returns index 1.The reverse mapping (from compressed index back to the original value) is often stored in an array for quick reference when needed.
Key Application: Computing the Union Area of Rectangles
A quintessential problem solved by coordinate compression is finding the total area covered by a set of axis-aligned rectangles. A naive approach might try to mark every unit square on a huge grid, which is infeasible if coordinates are large.
Coordinate compression enables a sweep line algorithm:
- Compress: Gather all unique x-coordinates and y-coordinates from the rectangles' edges. Compress them independently.
- Sweep: Imagine a vertical line sweeping from left to right across the compressed x-grid.
- Event Processing: When the sweep line hits a rectangle's left edge, you activate that rectangle's y-interval (using the compressed y-indices). When it hits a right edge, you deactivate the interval.
- Area Calculation: The distance between the current and next compressed x-coordinate gives a width. The total active height across all y-intervals, calculated using the original y-values mapped by the compression, gives the height. You sum
width * active heightfor each vertical strip.
This works because compression turns the continuous coordinate plane into a discrete grid of segments between compressed coordinates. The area over each grid cell is uniform (either fully covered or not), allowing precise calculation.
Example: Two rectangles: [(1,1)-(4,3)] and [(2,2)-(5,4)].
- Unique Xs:
[1, 2, 4, 5]→ compressed indices 0,1,2,3. - Unique Ys:
[1, 2, 3, 4]→ compressed indices 0,1,2,3. - Sweeping through x-indices 0 to 3, you add and remove y-intervals. The active height between x=2 (compressed index 1) and x=4 (index 2) is from y=1 to y=4 (original values), which is a height of 3. The width is 2 (4-2). This strip's contribution is 6. Summing all strips yields the correct union area.
Combining with Segment Trees for Efficiency
In the rectangle area problem, a critical sub-problem is efficiently maintaining the "total active length" of a set of y-intervals as rectangles are added and removed. A segment tree built on the compressed y-coordinates is the perfect tool for this.
The segment tree's nodes represent ranges between compressed y-indices. Instead of storing a simple sum, each node stores:
-
count: How many rectangles currently cover this entire y-segment. -
length: The total original length (e.g., in y-units) this node contributes ifcount > 0.
The rules for updating and querying become:
- Update (Add/Remove an interval): You perform a range update on the segment tree over the compressed y-range of the rectangle's edge. You modify the
countfor the affected nodes. - Query Active Length: The root node's
lengthattribute always holds the total original y-length that is currently covered by at least one rectangle. This value is directly used in the area calculation:area_strip = (x_next - x_current) * root.length.
This combination—compressing to define the segment tree's scale and using a specialized segment tree for range updates—is powerful. It generalizes to many "geometric plane sweep" problems and range modification/query problems with large coordinate bounds.
Recognizing When to Apply Coordinate Compression
You should consider coordinate compression when you see these problem characteristics:
- Large/Sparse Value Range: The input numbers are very large (e.g., up to ), but the number of unique values is proportional to the input size N (e.g., up to ).
- Need for Dense Data Structures: The algorithm you want to use requires array-like indexing (e.g., segment tree, Fenwick tree, union-find on coordinates, or a 2D grid) that would be impossible with the original value range.
- Order Preservation is Sufficient: Your logic relies only on the relative order (
<,=,>) of values, not their absolute magnitudes for arithmetic (except possibly for final output calculations using a reverse map). - Discrete Event Modeling: Problems involving intervals on a line (like scheduling, rectangle unions, or paint problems) where events happen only at specific, discrete points.
Common Pitfalls
- Off-by-One Errors in Interval Representation: Compression maps points, but you often work with intervals between points. Confusing the index of a coordinate with the segment to its right is a common source of error. When updating a segment tree for the range
[y1_compressed, y2_compressed], you are typically affecting the continuous segment from the original value ofy1up to the original value ofy2. Ensure your segment tree implementation correctly handles this. Using half-open intervals[start, end)consistently can help.
- Forgetting to Decompress for Final Calculations: Compression is a means to an end. When calculating final answers like area or distance, you must often convert back to the original scale. In the rectangle area example, the segment tree stores and returns lengths in the original y-coordinate values, not in compressed indices. Failing to multiply by the difference in original x-coordinates (using the reverse map) will give an incorrect, unit-less result.
- Incomplete Event Point Collection: Every coordinate that acts as a boundary in the problem must be included in the compression list. For rectangles, this is all x-left and x-right coordinates. If you miss some, your compressed grid will not have the resolution needed to detect all area changes. A good rule is to collect all numbers from the input that will be used as query points or interval endpoints in your post-compression algorithm.
Summary
- Coordinate compression replaces large or sparse numbers with their rank among unique values, preserving order to enable efficient data structure usage.
- Implementation is a two-step process: create a sorted list of unique values (the map), then use binary search to convert values to their compressed indices at runtime.
- A primary application is computing the union area of rectangles via a sweep line algorithm, where x and y coordinates are compressed to form a manageable event grid.
- For optimal performance, combine compression with a segment tree designed to track "active" lengths over compressed intervals, allowing for solutions to complex geometric problems.
- Recognize compression opportunities by looking for large value ranges with few unique values and algorithms that require dense indexing but only depend on value ordering.