Problem Restatement
Given a rooted tree where each vertex i has a value A[i], answer Q queries. Each query (s, t, x) asks:
On the simple path from s to t, what is the rank of value x?
The rank is defined as:
\[\text{rank}(x) = 1 + \left|\{y \mid \text{cnt}(y) > \text{cnt}(x)\}\right|\]where cnt(y) is the number of occurrences of y on the path. If x does not appear on the path, cnt(x) = 0, so the rank equals 1 + (number of distinct values on the path).
The challenge: queries may be online (T = 1), meaning each query’s parameters depend on the previous answer. This rules out algorithms that require knowing all queries in advance (e.g., offline Mo’s algorithm, offline divide-and-conquer).
High-Level Strategy
The query decomposes into two sub-problems:
-
Compute cnt(x) for a specific value x — count how many times value x appears on the path from s to t. This is straightforward.
-
Count how many distinct values appear more than k times on the path, where k = cnt(x). This is the core difficulty.
Our approach uses a checkpoint variant of Mo’s algorithm on trees that works online by precomputing intermediate states at regular grid points, then stepping from the nearest checkpoint for each query.
Phase 1: Tree Setup
Euler Tour (Entry-Exit DFS Order)
Run a DFS from the root. For each node v, record:
tin[v]— the time when we first enter v (entry time)tout[v]— the time when we finish processing v’s subtree (exit time)
This gives an Euler tour array tour[] of length 2N, where each node appears exactly twice (entry and exit).
Key property for path queries
For nodes s and t with tin[s] ≤ tin[t], the segment tour[tin[s]..tin[t]] has the structure:
- Nodes on the path s → t: appear exactly once in this segment
- Nodes not on the path: appear either 0 times or 2 times (both entry and exit fall in the segment)
To track which nodes are on the path, maintain a bitset<N> called on_path. Whenever a tour position is added to or removed from the current Mo’s window, toggle the bit for the corresponding node: on_path[v] ^= 1. After the toggle:
- If
on_path[v]becomes 1, the node now appears exactly once in the window → it is on the path. Add its value tofreqand updateff. - If
on_path[v]becomes 0, the node now appears 0 or 2 times → it is not on the path. Remove its value fromfreqand updateff.
Each Mo’s step (add/remove one tour position) is O(1): toggle the bit, conditionally update freq and ff, and erase any zero entries from ff.
LCA Precomputation
Build a binary lifting table up[v][j] = the 2^j-th ancestor of v. This gives:
- LCA queries in O(log N)
- Depth information for each node
Position Lists for Subproblem 1
For each distinct value v, store a sorted list positions[v] of all nodes (by entry time tin) that have value A[node] = v.
To count occurrences of value x on the path from s to t:
k = count of nodes with value x in the subtree-range query [tin[LCA(s,t)], tout[LCA(s,t)]]
using binary search on positions[x]
adjusted for the path endpoints s and t
More precisely: a node u is on the path from s to t iff LCA(s, t) is an ancestor of u AND u is an ancestor of s or u is an ancestor of t. Using entry/exit times, we binary search positions[x] for nodes whose entry time falls in the range [tin[lca], tout[lca]] and who are also on the s→t branch. A simpler formulation: count all nodes with value x whose entry time is in [tin[lca], tout[lca]] that are ancestors of s or ancestors of t.
Simpler approach: For each value x, store nodes with that value in DFS order. A node u is on the path s→t iff tin[lca(s,t)] ≤ tin[u] ≤ tout[lca(s,t)] AND depth[u] ≥ depth[lca(s,t)] AND (u is on the s-branch or t-branch). The cleanest way: count nodes with value x in [tin[s], tin[t]] (assuming tin[s] ≤ tin[t]) in the Euler tour, treating the LCA’s double appearance correctly. Use binary search on positions[x] restricted to the range — O(log N) per query.
Phase 2: Checkpoint-Based Online Mo’s Algorithm
Why Standard Mo’s Doesn’t Work Online
Standard Mo’s algorithm sorts queries by (block_of_l, r) and processes them in that order, moving the window incrementally. This requires knowing all queries ahead of time.
The Checkpoint Idea
Instead of sorting, we precompute states at regular grid points and answer each query by stepping from the nearest checkpoint.
Grid of checkpoints: For block size B, define checkpoints at all (l, r) where:
lis a multiple of Bris a multiple of Bl ≤ r
Number of checkpoints: O((N/B)² / 2)
What We Store at Each Checkpoint
For each checkpoint (l, r), we store:
ff_cache[(l, r)]— a hashmap of the frequency-of-frequencies at this checkpointff_cache[(l, r)][j]= number of distinct values that appear exactly j times on the path corresponding to the Euler tour segment [l, r]- Keys with value 0 are removed, so the hashmap contains only nonzero entries
- At most O(√N) nonzero keys (if d values have frequencies f₁ ≥ f₂ ≥ … ≥ f_d ≥ 1, then d² ≤ ∑fᵢ ≤ N, so d ≤ √N)
freq_cache[(l, r)] — the value frequencies of the boundary nodes onlyfreq_cache[(l, r)][v]= current (full) frequency of value v in the window at checkpoint (l, r), but stored only for values appearing within B positions of each endpoint- With nearest-checkpoint rounding, stepping moves each endpoint at most ~B/2 in either direction, so the only
freqentries ever read while stepping are those for nodes in the symmetric boundary regions[l-B, l+B]and[r-B, r+B]. The stored frequency is the full path frequency (it may count nodes outside the boundary), which is exactly whatstep_add/step_removeneed. - This is the key to bounding memory: a full
freqcopy is O(window) ≈ O(N) per checkpoint, giving O((N/B)²·N) total; the boundary-only copy is O(B) per checkpoint, giving O((N/B)²·B). On the worst case (path tree) this is the difference between ~847 MB and well under 200 MB.
Storage format. Both caches are stored compactly as
vector<pair<int,int>>(flat key/value lists), not as live hash tables — storing ~10⁵gp_hash_tableobjects directly wastes memory on per-table bucket overhead. On query, the chosen checkpoint’s vectors are rehydrated intogp_hash_tables (cur_ff,cur_freq) for O(1) stepping.
Total memory per checkpoint: O(√N + B)
Total memory across all checkpoints: O((N/B)² · (√N + B))
Precomputation
No need to sort checkpoints. Use a simple nested loop: for each l (multiple of B), start from an empty window and iterate r from l upward (in steps of B to 2N), stepping one position at a time and maintaining:
freq[v]— hashmap from value → current frequency in the windowff— hashmap from frequency → count of distinct values with that frequency; entries whereff[f] == 0are removed immediately
Each step (add tour position tour[r] to the window as r increases):
- Let
v = tour[r]. Toggleon_path[v] ^= 1. - If
on_path[v] == 1(node now on the path — appears once):- Decrement
ff[freq[A[v]]]; erase key if value becomes 0 freq[A[v]]++; incrementff[freq[A[v]]]
- Decrement
- If
on_path[v] == 0(node now off the path — appeared twice, so a full subtree round-trip):- Decrement
ff[freq[A[v]]]; erase key if value becomes 0 freq[A[v]]--; incrementff[freq[A[v]]]
- Decrement
This is O(1) amortized per step (hashmap erase is O(1) average). For each fixed l, the r-sweep costs O(2N) steps. With N/B choices for l, total time: O(N/B · 2N) = O(N²/B).
At each checkpoint (l, r), store ff_cache[(l, r)] = ff and freq_cache[(l, r)] = freq (both copies of the current hashmaps).
Answering Each Query
Given query (s, t, x):
Step 1: Compute k = cnt(x) on the path (Subproblem 1)
Use the position lists and binary search. O(log N) time.
Step 2: Find the nearest checkpoint
Map the Euler tour segment [L, R] to the grid. Round each endpoint to the nearest multiple of B (up or down, whichever is closer):
l₀ = ((L + B/2) / B) · B, then clamp to the valid grid[0, max_r]r₀ = ((R + B/2) / B) · B, then clamp to[l₀, max_r]
where max_r = (tour_size-1) - (tour_size-1) % B is the largest multiple of B inside the tour (and is itself a multiple of B, so the clamp lands on a real checkpoint). Each endpoint moves at most B/2 (a little more only near the tour boundary where clamping applies, still < B), so the stepping distance per side is ≤ B/2.
Finding the checkpoint takes O(1) — just integer division. This doesn’t depend on knowing the query in advance; we compute it when the query arrives.
Step 3: Load the checkpoint state
Set ff = ff_cache[(l₀, r₀)] and freq = freq_cache[(l₀, r₀)]. O(√N + B) to copy.
Step 4: Move the range to match the query [L, R]
Starting from [l₀, r₀], add or remove tour positions one at a time until the window is [L, R]. Because rounding lets either endpoint move in either direction, the stepping order must expand both sides first, then shrink both sides (expand right → expand left → shrink left → shrink right). Expanding first means the window only ever grows, then contracts, so it is never inverted (cl ≤ L ≤ R ≤ cr is maintained throughout the shrink phase). Each step decides add-vs-remove not via an on_path bit but by appearances_in(v, old_cl, old_cr) — how many of {tin[v], tout[v]+1} lay in the window before the step (1 ⇒ remove, else add). At most B/2 steps per side.
Each step: O(1) amortized.
Step 5: Query the ff hashmap
After reaching the exact query window, the answer is:
\[\text{rank} = 1 + \sum_{j=k+1}^{2N} \text{ff}[j]\]Since ff is sparse with at most O(√N) nonzero entries, iterate over the entries and sum those with index > k. O(√N) time.
Complexity Analysis
Per-query time: O(B + √N)
- Finding checkpoint: O(1)
- Loading checkpoint: O(√N + B) (but we can avoid full copy by using a reference + delta approach)
- Stepping: O(B)
- Querying ff: O(√N)
Precomputation time: O(N²/B)
Total time: O(N²/B + Q · (B + √N))
Total memory: O(N²/B · (√N + B))
Choosing B
Optimize B to balance precomputation and per-query cost:
N²/B = Q · B → B = N / √Q
With N = Q = 10⁵: B ≈ √N ≈ 316.
B ≈ √N ≈ 316:
- Checkpoints: (N/B)² ≈ (316)² ≈ 10⁵
- Memory per checkpoint: O(√N + 2B) ≈ 316 + 632 ≈ 948 integers
- Total memory: 10⁵ × 948 × 4 ≈ 380 MB — fine under 1G
- Precomputation: N²/B ≈ 3 × 10^7
- Per query: B/2 + √N ≈ 484
- Total: ~10^8 operations — feasible in 5s
B ≈ n^(2/3) ≈ 2154 (memory-conservative alternative):
- Checkpoints: (N/B)² ≈ 47² ≈ 2209
- Memory per checkpoint: √N + 2B ≈ 316 + 4308 ≈ 4624
- Total memory: 2209 × 4624 × 4 ≈ 41 MB
- Precomputation: N²/B ≈ 4.6 × 10^6
- Per query: B/2 + √N ≈ 1077 + 316 ≈ 1393
- Total query time: 10⁵ × 1393 ≈ 1.4 × 10^8 — a bit slower but still feasible
Recommended: B ≈ √N ≈ 316, giving O((N + Q)√N) total time and ~380 MB memory.
Empirical tuning (the analysis above understates the worst case)
The theory assumes O(√N) distinct values per window, but the path (line) tree is far worse: its Euler-tour windows have large on-path sets, inflating each checkpoint’s cached ff/freq vectors. Measured on N = Q = 10⁵, T = 1 (online), values in [1,1000]:
| B | Random tree | Path tree (worst case) |
|---|---|---|
| 400 | 2.70 s / 60 MB | 5.82 s / 847 MB |
| 700 | — | 3.30 s / 331 MB |
| 1000 | 2.58 s / 26 MB | 2.76 s / 181 MB |
| 1500 | 3.09 s / 23 MB | 2.79 s / 97 MB |
At B = 316–400 the path case blows past a 5 s / 1 GB budget. Because total memory scales as ~(2N/B)²·(per-checkpoint size), raising B is very effective — it cuts the checkpoint count quadratically while only linearly increasing per-query stepping. (These figures predate the array-freq / hybrid-ff rewrite below; the B sweep table re-measures on the current code.)
Measured impact of the array freq + hybrid ff change
The numbers above predate the array-freq / hybrid-ff rewrite. Re-measured on N = Q = 10⁵, T = 1, values in [1,1000] (g++ -O2), the two changes are independent — the data structures buy time, B buys memory:
freq/ff storage |
B | Random tree | Path tree (worst case) |
|---|---|---|---|
| hash / hash (original) | 300 | 4.36 s / 90 MB | 12.59 s / 1291 MB |
| hash / hash | 1000 | 3.60 s / 27 MB | 4.38 s / 182 MB |
| array / hybrid | 300 | 2.55 s / 90 MB | 9.61 s / 1314 MB |
| array / hybrid | 1000 | 1.60 s / 28 MB | 1.89 s / 184 MB |
At fixed B the data-structure change is ~1.3–1.7× faster (array indexing replaces hashing on the stepping path) with essentially unchanged memory (checkpoint caches are flat vectors either way). Raising B from 300→1000 cuts memory ~7× (checkpoint count ∝ (2N/B)²). Both are needed: array/hybrid at B = 300 still busts memory on the path case, and hash/hash at B = 1000 is ~2.3× slower than the combined result. End-to-end (original → current) the worst case improves from 12.6 s / 1.3 GB (over budget) to under 2 s.
B sweep on the current code (choosing the default)
Re-measured on the array/hybrid build (N = Q = 10⁵, T = 1, values in [1,1000]). As B grows the random case slows slightly (more per-query stepping) while the path case gets both faster and much lighter (fewer checkpoints → less cached state; path windows dominate memory):
| B | Random tree | Path tree (worst case) |
|---|---|---|
| 1000 | 1.62 s / 27.6 MB | 1.93 s / 184 MB |
| 1200 | 1.75 s / 25.6 MB | 1.76 s / 138 MB |
| 1500 | 1.76 s / 24.0 MB | 1.58 s / 99 MB |
| 2000 | 2.01 s / 22.5 MB | 1.54 s / 67 MB |
The implementation uses B = 1500 — the best all-around balance: ≤ 1.76 s on both shapes and 99 MB worst case, nearly halving worst-case memory vs B = 1000 (184→99 MB) for a ~0.14 s random-case cost. (B = 2000 trades ~0.25 s of random-case time for the lightest footprint, 67 MB; B = 1000 is marginally faster on random but carries ~2× the worst-case memory.)
Phase 3: Implementation Details
Data Structures
freq: int[D] array, compressed value index → frequency in current window
(values coordinate-compressed to 0..D-1; reused across queries, see below)
ff: FFTable, frequency → count of distinct values with that frequency
hybrid array+hash: keys < B2 in a flat int[B2], keys ≥ B2 in a gp_hash_table;
0 means absent in both halves; at most O(√N) nonzero keys
positions: in_times / out_times, value → sorted list of tin / tout
up[v][j]: binary lifting table for LCA
tour: Euler tour array of length 2N
ff_cache: map<(l,r), vector<pair<int,int>>> — full ff at checkpoint, flat list
freq_cache: map<(l,r), vector<pair<int,int>>> — boundary freq at checkpoint, flat list
(rehydrated into freq[] / a local FFTable on query)
No
on_patharray. An earlier design tracked abitset<N> on_pathto decide add-vs-remove on each toggle. That bitset is O(N) bits per checkpoint if cached, which is prohibitive. It was eliminated:toggle_tour_posinstead recomputes membership fromappearances_in(v, old_cl, old_cr)(how many of{tin[v], tout[v]+1}lie in the pre-step window — 1 ⇒ remove, else add). Thefreq/ffupdates below are driven by that count rather than by a flipped bit.
freq and ff: arrays instead of hash tables
Both freq and ff started as gp_hash_table<int,int>. On the hot stepping path (millions of add/remove steps across precompute and queries), hashing dominates. Both were replaced by direct-indexed storage, keeping the same O(1)-per-step interface:
freq — a single reused array. Values are coordinate-compressed to 0..D-1 (Acomp[v] = compressed index of A[v]), so freq is a plain int[D] indexed by compressed value. There is one global freq array, reused by both precompute and every query, kept all-zero between uses:
- Precompute resets
freq[0..D)to 0 at the start of eachl-sweep, and once more after the whole sweep so the query phase starts from a clean array. - Each query loads only the checkpoint’s boundary entries into
freq, runs the query, then reverts exactly the touched entries back to zero — never an O(D) clear. The touched set is the loaded boundary keys plus every value written while stepping (recorded via afreq_touchpointer thatstep_add/step_removeappend to when non-null). This matters because stepping can make values nonzero that were not in the boundary cache (they had frequency 0 at the checkpoint); reverting only the loaded keys would leak those into the next query.
ff — a hybrid array + hash table (FFTable). ff is keyed by frequency (1..window size). Small frequencies are dense and overwhelmingly common; large ones are rare (a path tree can push a single value’s frequency up toward N). So FFTable stores keys < B2 in a flat int small[B2] and keys ≥ B2 in a gp_hash_table, with 0 meaning absent in both. get/set/inc/dec/for_each route by key; the rank sum and cache (de)hydration use for_each. The common small-key path becomes pure array indexing while the rare large-key tail still works. B2 = 1024. Per-query cur_ff is a stack FFTable (its int[B2] is zero-initialized cheaply each query); precompute uses one global FFTable cleared per l-sweep.
Handling the Online Query Decoding (T = 1)
When T = 1, each query’s parameters depend on the last answer:
last_0 = 0
s_i = ((s_hat_i + last_{i-1} × (T-1)) mod N) + 1
t_i = ((t_hat_i + last_{i-1} × (T-1)) mod N) + 1
x_i = ((x_hat_i + last_{i-1} × (T-1)) mod 10^9) + 1
Note: when T = 1, the term last_{i-1} × (T-1) = 0, so queries are NOT actually encoded! Wait — let me re-read the formula.
The formula is: s_i = ((s_hat_i + last_{i-1} * T - 1) mod N) + 1. When T = 1: s_i = ((s_hat_i + last_{i-1} - 1) mod N) + 1. So queries ARE encoded when T = 1 — the previous answer shifts the parameters.
This means we must answer queries sequentially and cannot reorder them. The checkpoint approach handles this naturally: each query independently looks up its nearest checkpoint and steps.
Counting cnt(x) — Subproblem 1 Detail
For each distinct value x, maintain two sorted lists:
in_times[x]— entry times (tin[i]) of all nodesiwhereA[i] = xout_times[x]— exit times (tout[i]) of all nodesiwhereA[i] = x
Ancestor count (including v itself): The number of nodes with value x on the root-to-v path:
cnt(1, v, x) = |{i : in[i] <= in[v]}| - |{i : out[i] < in[v]}|
where both sets range over nodes with value x. A node i is an ancestor of v iff in[i] <= in[v] AND out[i] >= out[v]. The first term includes v itself (since in[v] <= in[v]). Subtracting the second term removes nodes that entered and exited before v was entered — these are on a different branch and are not ancestors. The remainder is exactly the ancestor set of v (including v).
Both terms are computed via binary search on the sorted lists in O(log N).
Path count: For the path from u to v:
cnt(u, v, x) = cnt(1, u, x) + cnt(1, v, x) - 2 * cnt(1, lca(u, v), x) + correction
where the correction is +1 if A[lca(u, v)] = x, and 0 otherwise. The lca is counted in both cnt(1, u) and cnt(1, v) since it’s an ancestor of both. The term 2*cnt(1, lca) subtracts it twice (net: 1+1−2 = 0), so if the lca has value x we add it back once.
Total time: O(log N) per query (three binary searches + one LCA lookup).
Frequency-of-Frequencies Query
Given the ff hashmap and threshold k:
answer = 1
for each (freq, count) in ff where freq > k:
answer += count
return answer
Iterate over the sparse ff hashmap. Since there are O(√N) distinct frequency values, this is O(√N).
Optimization: Store ff as a sorted array and use binary search to find the first entry with frequency > k, then scan. This doesn’t change the asymptotic bound but reduces constants.
Boundary Frequencies Detail
freq_cache[(l, r)] holds value→frequency entries copied from the running freq at checkpoint time, but only for the values appearing in the two boundary regions [l-B, l+B] and [r-B, r+B] (positions clamped to the tour). With nearest-checkpoint rounding, stepping toward any query that maps to (l, r) touches only positions within ~B/2 of each endpoint — in either direction — so those are the only freq entries ever read. The stored value is the node’s full path frequency (which may count nodes outside the boundary), exactly what step_add/step_remove need. Caching a generous symmetric superset (±B rather than ±B/2) costs only a constant factor and removes any off-by-one risk at the boundary.
This is what bounds memory: a full freq snapshot is O(window) per checkpoint; the boundary snapshot is O(B). On the path worst case the full version reached 847 MB; the boundary version stays well under 200 MB.
Folding the LCA into ff Without freq
In the non-ancestor case the Euler segment [tout[u]+1, tin[v]] excludes the LCA, so the LCA’s value must be added back to the answer. The naive way is step_add(cur_freq, cur_ff, A[lca]) — but that reads cur_freq[A[lca]], and the LCA is generally not a boundary node, so it is absent from the truncated freq cache. Reading it would silently use 0 and corrupt ff.
Instead we compute the LCA value’s frequency within the segment directly: it is cnt(A[lca] on path s..t) − 1 (the same cnt_x binary-search routine, minus the LCA itself). Knowing that count f, adding the LCA just moves one value from bucket f to f+1 in ff (ff[f]--; ff[f+1]++), with no freq lookup at all. This is the change that makes the boundary-only freq cache correct.
Disabling the Built-in Brute-Force Check
A brute-force recomputation of every answer (walk the path, build a std::map frequency table, compare) is compiled in only under -DBRUTE_FORCE_CHECK and writes MISMATCH … to stderr on any disagreement. The default release build (g++ -O2 solution.cpp) omits it entirely — no path walk, no extra allocation, nothing on stderr — so it does not affect the measured time/memory above.
Complexity Summary
| Component | Complexity |
|---|---|
| Tree setup (DFS, LCA, position lists) | O(N log N) |
| Checkpoint precomputation | O(N²/B) |
| Per-query: find checkpoint | O(1) |
| Per-query: step to query | O(B) |
| Per-query: count cnt(x) | O(log N) |
| Per-query: sum ff above k | O(√N) |
| Total time (B = √N) | O((N + Q)√N) ≈ 6 × 10⁷ |
| Total memory (B = √N) | O(N²/B · (√N + B)) = O(N√N) ≈ 250 MB |
Why This Works for T = 1 (Online)
The checkpoint grid is precomputed independently of any queries. For each online query:
- Decode the parameters using the previous answer
- Map to Euler tour segment [L, R]
- Look up checkpoint by integer division — no query knowledge needed
- Step to the exact segment and answer
The algorithm never reorders queries or needs to see future queries.
Potential Optimizations
-
Value compression: Map the up to 10⁹ distinct values to ranks 1..D using coordinate compression. This allows arrays instead of hash maps for
freq[]. -
Block size tuning: The optimal B depends on the ratio of N to Q and the actual constant factors in stepping vs ff-querying. Empirical tuning may help.
-
Checkpoint layout: Store checkpoints in a 1D array indexed by
(l/B) * (N/B) + (r/B)for cache-friendly access. -
Parallel checkpoint precomputation: Process different blocks of l independently.
Bugs and Problems Encountered During Implementation
1. Wrong Euler-tour segment formula
Problem: Initially used [tin[s], tin[t]] (after ensuring tin[s] ≤ tin[t]) for all queries. This only works when s is an ancestor of t. For non-ancestor cases, the segment misses or includes extra nodes.
Fix: Split into two cases: ancestor case uses [tin[u], tin[v]] where u = LCA(s,t) (LCA included in segment), non-ancestor case uses [tout[u]+1, tin[v]] where u is the node with smaller tin (LCA excluded from segment, added separately).
2. Stepping creates empty window
Problem: When stepping from checkpoint [l₀, r₀] to query [L, R], shrinking an endpoint before expanding the opposite one could create an invalid range (e.g., [1, 0]), causing incorrect appearances_in queries.
Fix: Always expand both endpoints first, then shrink. The stepping order is: expand right → shrink left → shrink right → expand left.
3. appearances_in uses wrong exit position
Problem: The function checked {tin[v], tout[v]} but the DFS records the exit at tour[timer_dfs] = v after tout[v] = timer_dfs - 1, so the actual tour position of the exit is tout[v] + 1.
Fix: Changed to check {tin[v], tout[v]+1} in the range. The tout[v]+1 offset is critical for correctly determining whether a node appears 0, 1, or 2 times in a window.
4. Checkpoint clamping exceeds valid grid
Problem: Nearest-checkpoint rounding ((L + B/2) / B) * B can round up to a position where no checkpoint was precomputed (e.g., timer_dfs=400, B=400 — the only checkpoint is (0,0) but rounding can yield (400,400)). Checkpoints exist only for multiples of B that are < tour_size.
Fix: Clamp both l₀ and r₀ into the valid grid [0, max_r], where max_r = timer_dfs - 1 - (timer_dfs - 1) % B is the largest multiple of B inside the tour. Crucially max_r is itself a multiple of B, so the clamp always lands on a real checkpoint: l₀ = max(0, min(l₀, max_r)), r₀ = max(l₀, min(r₀, max_r)). (An interim version sidestepped the issue with floor division (L/B)*B, but that was reverted in favor of true nearest rounding once the clamp made rounding safe — see Bug 9.)
5. on_path bitset too large to cache per checkpoint
Problem: The on_path bitset (O(N) bits per checkpoint) made checkpoint storage O(N²/B · N) which is prohibitive.
Fix: Eliminated on_path entirely. Instead, toggle_tour_pos recomputes whether a node is on the path by checking appearances_in(v, old_cl, old_cr) — how many of {tin[v], tout[v]+1} fall in the window bounds before the toggle. If 1, the node is on the path and needs to be removed; otherwise it needs to be added.
6. step_add/step_remove must operate on local maps
Problem: The frequency maps (freq, ff) are copied from checkpoints for each query. The stepping functions needed to work on these local copies, not global state.
Fix: Made step_add and step_remove take map references as parameters, and toggle_tour_pos passes the local copies through.
7. test_b.sh shell quoting issues
Problem: Embedding a complex Python brute-force verification inline in a bash heredoc caused quoting conflicts with triple-quoted strings.
Fix: Moved the B sweep into a standalone test_b.py script that patches B with sed, compiles, and runs all test cases. All B values 1–100 then passed cleanly.
8. freq_cache was storing the full map, not the boundary range
Problem: Although the design called for freq_cache to hold only boundary-node frequencies, the code stored a copy of the entire freq map at every checkpoint (freq_cache[{l,r}] = freq). That makes checkpoint memory O((N/B)²·N) instead of O((N/B)²·B) — on the path worst case, 847 MB.
Fix: Build the cache from only the values appearing in the symmetric boundary regions [l-B, l+B] and [r-B, r+B]. This depends on Bug 9’s LCA fix to be correct (otherwise stepping reads a non-boundary frequency).
9. Nearest-checkpoint rounding reintroduced the inverted window — and broke the LCA add
Problem: Switching from floor ((L/B)*B) to nearest rounding (((L+B/2)/B)*B) lets each endpoint step ≤ B/2, but it also lets an endpoint move in either direction. (a) The old stepping order (expand right → shrink left → shrink right → expand left) could then invert the window, e.g. l₀=r₀=400, query L=210,R=220 drives cr down to 220 while cl=400 → [400,220] (a resurfaced Bug 2). (b) With the now boundary-only freq cache, adding the LCA via step_add(cur_freq, A[lca]) reads a frequency that isn’t cached (the LCA isn’t a boundary node).
Fix: (a) Reorder stepping to expand both sides first, then shrink both so the window only grows then contracts (never inverts). (b) Compute the LCA value’s segment frequency as cnt_x(s,t,A[lca]) − 1 and move the ff bucket directly, with no freq lookup. Also widened the boundary cache regions to be symmetric (±B) to match bidirectional stepping.
10. map → gp_hash_table, with caches as flat vectors
Problem: std::map for freq/ff adds a log factor and heavy node allocation on the hot stepping path; and storing ~10⁵ live hash tables in the caches wastes memory on per-table bucket overhead.
Fix: Use __gnu_pbds::gp_hash_table<int,int> for freq/ff (O(1) open addressing). Store each checkpoint compactly as a vector<pair<int,int>> and rehydrate into a gp_hash_table on query. Note gp_hash_table has no erase(iterator) — erase by key instead.
11. Built-in brute force left enabled by default
Problem: The in-main brute-force comparison ran on every query in normal builds, walking each path and allocating a std::map — pure overhead in production.
Fix: Guarded it behind #ifdef BRUTE_FORCE_CHECK (and the path scratch vector with it). Default g++ -O2 builds skip it; build with -DBRUTE_FORCE_CHECK to re-enable the self-check.
12. B too small for the path worst case
Problem: B ≈ √N (316–400), optimal under the O(√N)-distinct-values assumption, blows the budget on the path tree (5.82 s / 847 MB at B=400) because its windows hold many distinct values.
Fix: Tuned B = 1500 empirically — the best all-around balance, ≤ 1.76 s on both random and path trees and ≤ 99 MB worst case. See the B sweep table. (A stray B = 300 had earlier been left in the source, contradicting this; it was corrected.)
13. gp_hash_table freq/ff → direct-indexed arrays
Problem: Even open-addressed, hashing freq/ff on every one of the millions of stepping operations is the dominant per-step cost.
Fix: (a) freq → a single reused int[D] array keyed by compressed value, kept all-zero between uses. (b) ff → FFTable, a hybrid int[B2] (keys < B2) + gp_hash_table (keys ≥ B2). At fixed B this is ~1.3–1.7× faster end-to-end with unchanged memory; see Measured impact table. Two pitfalls surfaced while doing this:
- Leak from the reused
freqarray (Bug 13a). Reverting a query only zeroed the loaded boundary keys, but stepping can make other values nonzero (they had frequency 0 at the checkpoint, so weren’t cached). Those leaked into the next query, corrupting later answers. Fix: record every value written during stepping (afreq_touchpointer thatstep_add/step_removeappend to) and revert that full set, not just the loaded keys. - Dirty array after precompute (Bug 13b). Precompute now shares the one global
freqarray with queries and left it nonzero after its finall-sweep, so query 1 read garbage. Fix: zerofreq[0..D)once at the end ofprecompute_checkpointsbefore the query phase.