Problem C — Tree Path Value Rank Query

Problem Restatement

Given a rooted tree where each vertex i has a value A[i], answer Q queries. Each query (s, t, x) asks:

On the simple path from s to t, what is the rank of value x?

The rank is defined as:

\[\text{rank}(x) = 1 + \left|\{y \mid \text{cnt}(y) > \text{cnt}(x)\}\right|\]

where cnt(y) is the number of occurrences of y on the path. If x does not appear on the path, cnt(x) = 0, so the rank equals 1 + (number of distinct values on the path).

The challenge: queries may be online (T = 1), meaning each query’s parameters depend on the previous answer. This rules out algorithms that require knowing all queries in advance (e.g., offline Mo’s algorithm, offline divide-and-conquer).

High-Level Strategy

The query decomposes into two sub-problems:

Compute cnt(x) for a specific value x — count how many times value x appears on the path from s to t. This is straightforward.
Count how many distinct values appear more than k times on the path, where k = cnt(x). This is the core difficulty.

Our approach uses a checkpoint variant of Mo’s algorithm on trees that works online by precomputing intermediate states at regular grid points, then stepping from the nearest checkpoint for each query.

Phase 1: Tree Setup

Euler Tour (Entry-Exit DFS Order)

Run a DFS from the root. For each node v, record:

tin[v] — the time when we first enter v (entry time)
tout[v] — the time when we finish processing v’s subtree (exit time)

This gives an Euler tour array tour[] of length 2N, where each node appears exactly twice (entry and exit).

Key property for path queries

For nodes s and t with tin[s] ≤ tin[t], the segment tour[tin[s]..tin[t]] has the structure:

Nodes on the path s → t: appear exactly once in this segment
Nodes not on the path: appear either 0 times or 2 times (both entry and exit fall in the segment)

To track which nodes are on the path, maintain a bitset<N> called on_path. Whenever a tour position is added to or removed from the current Mo’s window, toggle the bit for the corresponding node: on_path[v] ^= 1. After the toggle:

If on_path[v] becomes 1, the node now appears exactly once in the window → it is on the path. Add its value to freq and update ff.
If on_path[v] becomes 0, the node now appears 0 or 2 times → it is not on the path. Remove its value from freq and update ff.

Each Mo’s step (add/remove one tour position) is O(1): toggle the bit, conditionally update freq and ff, and erase any zero entries from ff.

LCA Precomputation

Build a binary lifting table up[v][j] = the 2^j-th ancestor of v. This gives:

LCA queries in O(log N)
Depth information for each node

Position Lists for Subproblem 1

For each distinct value v, store a sorted list positions[v] of all nodes (by entry time tin) that have value A[node] = v.

To count occurrences of value x on the path from s to t:

k = count of nodes with value x in the subtree-range query [tin[LCA(s,t)], tout[LCA(s,t)]]
    using binary search on positions[x]
    adjusted for the path endpoints s and t

More precisely: a node u is on the path from s to t iff LCA(s, t) is an ancestor of u AND u is an ancestor of s or u is an ancestor of t. Using entry/exit times, we binary search positions[x] for nodes whose entry time falls in the range [tin[lca], tout[lca]] and who are also on the s→t branch. A simpler formulation: count all nodes with value x whose entry time is in [tin[lca], tout[lca]] that are ancestors of s or ancestors of t.

Simpler approach: For each value x, store nodes with that value in DFS order. A node u is on the path s→t iff tin[lca(s,t)] ≤ tin[u] ≤ tout[lca(s,t)] AND depth[u] ≥ depth[lca(s,t)] AND (u is on the s-branch or t-branch). The cleanest way: count nodes with value x in [tin[s], tin[t]] (assuming tin[s] ≤ tin[t]) in the Euler tour, treating the LCA’s double appearance correctly. Use binary search on positions[x] restricted to the range — O(log N) per query.

Phase 2: Checkpoint-Based Online Mo’s Algorithm

Why Standard Mo’s Doesn’t Work Online

Standard Mo’s algorithm sorts queries by (block_of_l, r) and processes them in that order, moving the window incrementally. This requires knowing all queries ahead of time.

The Checkpoint Idea

Instead of sorting, we precompute states at regular grid points and answer each query by stepping from the nearest checkpoint.

Grid of checkpoints: For block size B, define checkpoints at all (l, r) where:

l is a multiple of B
r is a multiple of B
l ≤ r

Number of checkpoints: O((N/B)² / 2)

What We Store at Each Checkpoint

For each checkpoint (l, r), we store:

ff_cache[(l, r)] — a hashmap of the frequency-of-frequencies at this checkpoint
- ff_cache[(l, r)][j] = number of distinct values that appear exactly j times on the path corresponding to the Euler tour segment [l, r]
- Keys with value 0 are removed, so the hashmap contains only nonzero entries
- At most O(√N) nonzero keys (if d values have frequencies f₁ ≥ f₂ ≥ … ≥ f_d ≥ 1, then d² ≤ ∑fᵢ ≤ N, so d ≤ √N)
freq_cache[(l, r)] — the value frequencies of the boundary nodes only
- freq_cache[(l, r)][v] = current (full) frequency of value v in the window at checkpoint (l, r), but stored only for values appearing within B positions of each endpoint
- With nearest-checkpoint rounding, stepping moves each endpoint at most ~B/2 in either direction, so the only freq entries ever read while stepping are those for nodes in the symmetric boundary regions [l-B, l+B] and [r-B, r+B]. The stored frequency is the full path frequency (it may count nodes outside the boundary), which is exactly what step_add/step_remove need.
- This is the key to bounding memory: a full freq copy is O(window) ≈ O(N) per checkpoint, giving O((N/B)²·N) total; the boundary-only copy is O(B) per checkpoint, giving O((N/B)²·B). On the worst case (path tree) this is the difference between ~847 MB and well under 200 MB.

Storage format. Both caches are stored compactly as vector<pair<int,int>> (flat key/value lists), not as live hash tables — storing ~10⁵ gp_hash_table objects directly wastes memory on per-table bucket overhead. On query, the chosen checkpoint’s vectors are rehydrated into gp_hash_tables (cur_ff, cur_freq) for O(1) stepping.

Total memory per checkpoint: O(√N + B)

Total memory across all checkpoints: O((N/B)² · (√N + B))

Precomputation

No need to sort checkpoints. Use a simple nested loop: for each l (multiple of B), start from an empty window and iterate r from l upward (in steps of B to 2N), stepping one position at a time and maintaining:

freq[v] — hashmap from value → current frequency in the window
ff — hashmap from frequency → count of distinct values with that frequency; entries where ff[f] == 0 are removed immediately

Each step (add tour position tour[r] to the window as r increases):

Let v = tour[r]. Toggle on_path[v] ^= 1.
If on_path[v] == 1 (node now on the path — appears once):
- Decrement ff[freq[A[v]]]; erase key if value becomes 0
- freq[A[v]]++; increment ff[freq[A[v]]]
If on_path[v] == 0 (node now off the path — appeared twice, so a full subtree round-trip):
- Decrement ff[freq[A[v]]]; erase key if value becomes 0
- freq[A[v]]--; increment ff[freq[A[v]]]

This is O(1) amortized per step (hashmap erase is O(1) average). For each fixed l, the r-sweep costs O(2N) steps. With N/B choices for l, total time: O(N/B · 2N) = O(N²/B).

At each checkpoint (l, r), store ff_cache[(l, r)] = ff and freq_cache[(l, r)] = freq (both copies of the current hashmaps).

Answering Each Query

Given query (s, t, x):

Step 1: Compute k = cnt(x) on the path (Subproblem 1)

Use the position lists and binary search. O(log N) time.

Step 2: Find the nearest checkpoint

Map the Euler tour segment [L, R] to the grid. Round each endpoint to the nearest multiple of B (up or down, whichever is closer):

l₀ = ((L + B/2) / B) · B, then clamp to the valid grid [0, max_r]
r₀ = ((R + B/2) / B) · B, then clamp to [l₀, max_r]

where max_r = (tour_size-1) - (tour_size-1) % B is the largest multiple of B inside the tour (and is itself a multiple of B, so the clamp lands on a real checkpoint). Each endpoint moves at most B/2 (a little more only near the tour boundary where clamping applies, still < B), so the stepping distance per side is ≤ B/2.

Finding the checkpoint takes O(1) — just integer division. This doesn’t depend on knowing the query in advance; we compute it when the query arrives.

Step 3: Load the checkpoint state

Set ff = ff_cache[(l₀, r₀)] and freq = freq_cache[(l₀, r₀)]. O(√N + B) to copy.

Step 4: Move the range to match the query [L, R]

Starting from [l₀, r₀], add or remove tour positions one at a time until the window is [L, R]. Because rounding lets either endpoint move in either direction, the stepping order must expand both sides first, then shrink both sides (expand right → expand left → shrink left → shrink right). Expanding first means the window only ever grows, then contracts, so it is never inverted (cl ≤ L ≤ R ≤ cr is maintained throughout the shrink phase). Each step decides add-vs-remove not via an on_path bit but by appearances_in(v, old_cl, old_cr) — how many of {tin[v], tout[v]+1} lay in the window before the step (1 ⇒ remove, else add). At most B/2 steps per side.

Each step: O(1) amortized.

Step 5: Query the ff hashmap

After reaching the exact query window, the answer is:

\[\text{rank} = 1 + \sum_{j=k+1}^{2N} \text{ff}[j]\]

Since ff is sparse with at most O(√N) nonzero entries, iterate over the entries and sum those with index > k. O(√N) time.

Complexity Analysis

Per-query time: O(B + √N)

Finding checkpoint: O(1)
Loading checkpoint: O(√N + B) (but we can avoid full copy by using a reference + delta approach)
Stepping: O(B)
Querying ff: O(√N)

Precomputation time: O(N²/B)

Total time: O(N²/B + Q · (B + √N))

Total memory: O(N²/B · (√N + B))

Choosing B

Optimize B to balance precomputation and per-query cost:

N²/B = Q · B      →      B = N / √Q

With N = Q = 10⁵: B ≈ √N ≈ 316.

B ≈ √N ≈ 316:

Checkpoints: (N/B)² ≈ (316)² ≈ 10⁵
Memory per checkpoint: O(√N + 2B) ≈ 316 + 632 ≈ 948 integers
Total memory: 10⁵ × 948 × 4 ≈ 380 MB — fine under 1G
Precomputation: N²/B ≈ 3 × 10^7
Per query: B/2 + √N ≈ 484
Total: ~10^8 operations — feasible in 5s

B ≈ n^(2/3) ≈ 2154 (memory-conservative alternative):

Checkpoints: (N/B)² ≈ 47² ≈ 2209
Memory per checkpoint: √N + 2B ≈ 316 + 4308 ≈ 4624
Total memory: 2209 × 4624 × 4 ≈ 41 MB
Precomputation: N²/B ≈ 4.6 × 10^6
Per query: B/2 + √N ≈ 1077 + 316 ≈ 1393
Total query time: 10⁵ × 1393 ≈ 1.4 × 10^8 — a bit slower but still feasible

Recommended: B ≈ √N ≈ 316, giving O((N + Q)√N) total time and ~380 MB memory.

Empirical tuning (the analysis above understates the worst case)

The theory assumes O(√N) distinct values per window, but the path (line) tree is far worse: its Euler-tour windows have large on-path sets, inflating each checkpoint’s cached ff/freq vectors. Measured on N = Q = 10⁵, T = 1 (online), values in [1,1000]:

B	Random tree	Path tree (worst case)
400	2.70 s / 60 MB	5.82 s / 847 MB
700	—	3.30 s / 331 MB
1000	2.58 s / 26 MB	2.76 s / 181 MB
1500	3.09 s / 23 MB	2.79 s / 97 MB

At B = 316–400 the path case blows past a 5 s / 1 GB budget. Because total memory scales as ~(2N/B)²·(per-checkpoint size), raising B is very effective — it cuts the checkpoint count quadratically while only linearly increasing per-query stepping. (These figures predate the array-freq / hybrid-ff rewrite below; the B sweep table re-measures on the current code.)

Measured impact of the array `freq` + hybrid `ff` change

The numbers above predate the array-freq / hybrid-ff rewrite. Re-measured on N = Q = 10⁵, T = 1, values in [1,1000] (g++ -O2), the two changes are independent — the data structures buy time, B buys memory:

`freq`/`ff` storage	B	Random tree	Path tree (worst case)
hash / hash (original)	300	4.36 s / 90 MB	12.59 s / 1291 MB
hash / hash	1000	3.60 s / 27 MB	4.38 s / 182 MB
array / hybrid	300	2.55 s / 90 MB	9.61 s / 1314 MB
array / hybrid	1000	1.60 s / 28 MB	1.89 s / 184 MB

At fixed B the data-structure change is ~1.3–1.7× faster (array indexing replaces hashing on the stepping path) with essentially unchanged memory (checkpoint caches are flat vectors either way). Raising B from 300→1000 cuts memory ~7× (checkpoint count ∝ (2N/B)²). Both are needed: array/hybrid at B = 300 still busts memory on the path case, and hash/hash at B = 1000 is ~2.3× slower than the combined result. End-to-end (original → current) the worst case improves from 12.6 s / 1.3 GB (over budget) to under 2 s.

B sweep on the current code (choosing the default)

Re-measured on the array/hybrid build (N = Q = 10⁵, T = 1, values in [1,1000]). As B grows the random case slows slightly (more per-query stepping) while the path case gets both faster and much lighter (fewer checkpoints → less cached state; path windows dominate memory):

B	Random tree	Path tree (worst case)
1000	1.62 s / 27.6 MB	1.93 s / 184 MB
1200	1.75 s / 25.6 MB	1.76 s / 138 MB
1500	1.76 s / 24.0 MB	1.58 s / 99 MB
2000	2.01 s / 22.5 MB	1.54 s / 67 MB

The implementation uses B = 1500 — the best all-around balance: ≤ 1.76 s on both shapes and 99 MB worst case, nearly halving worst-case memory vs B = 1000 (184→99 MB) for a ~0.14 s random-case cost. (B = 2000 trades ~0.25 s of random-case time for the lightest footprint, 67 MB; B = 1000 is marginally faster on random but carries ~2× the worst-case memory.)

Phase 3: Implementation Details

Data Structures

freq: int[D] array, compressed value index → frequency in current window
      (values coordinate-compressed to 0..D-1; reused across queries, see below)
ff:   FFTable, frequency → count of distinct values with that frequency
      hybrid array+hash: keys < B2 in a flat int[B2], keys ≥ B2 in a gp_hash_table;
      0 means absent in both halves; at most O(√N) nonzero keys
positions: in_times / out_times, value → sorted list of tin / tout
up[v][j]: binary lifting table for LCA
tour: Euler tour array of length 2N
ff_cache:   map<(l,r), vector<pair<int,int>>>  — full ff at checkpoint, flat list
freq_cache: map<(l,r), vector<pair<int,int>>>  — boundary freq at checkpoint, flat list
            (rehydrated into freq[] / a local FFTable on query)

No on_path array. An earlier design tracked a bitset<N> on_path to decide add-vs-remove on each toggle. That bitset is O(N) bits per checkpoint if cached, which is prohibitive. It was eliminated: toggle_tour_pos instead recomputes membership from appearances_in(v, old_cl, old_cr) (how many of {tin[v], tout[v]+1} lie in the pre-step window — 1 ⇒ remove, else add). The freq/ff updates below are driven by that count rather than by a flipped bit.

`freq` and `ff`: arrays instead of hash tables

Both freq and ff started as gp_hash_table<int,int>. On the hot stepping path (millions of add/remove steps across precompute and queries), hashing dominates. Both were replaced by direct-indexed storage, keeping the same O(1)-per-step interface:

freq — a single reused array. Values are coordinate-compressed to 0..D-1 (Acomp[v] = compressed index of A[v]), so freq is a plain int[D] indexed by compressed value. There is one global freq array, reused by both precompute and every query, kept all-zero between uses:

Precompute resets freq[0..D) to 0 at the start of each l-sweep, and once more after the whole sweep so the query phase starts from a clean array.
Each query loads only the checkpoint’s boundary entries into freq, runs the query, then reverts exactly the touched entries back to zero — never an O(D) clear. The touched set is the loaded boundary keys plus every value written while stepping (recorded via a freq_touch pointer that step_add/step_remove append to when non-null). This matters because stepping can make values nonzero that were not in the boundary cache (they had frequency 0 at the checkpoint); reverting only the loaded keys would leak those into the next query.

ff — a hybrid array + hash table (FFTable). ff is keyed by frequency (1..window size). Small frequencies are dense and overwhelmingly common; large ones are rare (a path tree can push a single value’s frequency up toward N). So FFTable stores keys < B2 in a flat int small[B2] and keys ≥ B2 in a gp_hash_table, with 0 meaning absent in both. get/set/inc/dec/for_each route by key; the rank sum and cache (de)hydration use for_each. The common small-key path becomes pure array indexing while the rare large-key tail still works. B2 = 1024. Per-query cur_ff is a stack FFTable (its int[B2] is zero-initialized cheaply each query); precompute uses one global FFTable cleared per l-sweep.

Handling the Online Query Decoding (T = 1)

When T = 1, each query’s parameters depend on the last answer:

last_0 = 0
s_i = ((s_hat_i + last_{i-1} × (T-1)) mod N) + 1
t_i = ((t_hat_i + last_{i-1} × (T-1)) mod N) + 1
x_i = ((x_hat_i + last_{i-1} × (T-1)) mod 10^9) + 1

Note: when T = 1, the term last_{i-1} × (T-1) = 0, so queries are NOT actually encoded! Wait — let me re-read the formula.

The formula is: s_i = ((s_hat_i + last_{i-1} * T - 1) mod N) + 1. When T = 1: s_i = ((s_hat_i + last_{i-1} - 1) mod N) + 1. So queries ARE encoded when T = 1 — the previous answer shifts the parameters.

This means we must answer queries sequentially and cannot reorder them. The checkpoint approach handles this naturally: each query independently looks up its nearest checkpoint and steps.

Counting cnt(x) — Subproblem 1 Detail

For each distinct value x, maintain two sorted lists:

in_times[x] — entry times (tin[i]) of all nodes i where A[i] = x
out_times[x] — exit times (tout[i]) of all nodes i where A[i] = x

Ancestor count (including v itself): The number of nodes with value x on the root-to-v path:

cnt(1, v, x) = |{i : in[i] <= in[v]}| - |{i : out[i] < in[v]}|

where both sets range over nodes with value x. A node i is an ancestor of v iff in[i] <= in[v] AND out[i] >= out[v]. The first term includes v itself (since in[v] <= in[v]). Subtracting the second term removes nodes that entered and exited before v was entered — these are on a different branch and are not ancestors. The remainder is exactly the ancestor set of v (including v).

Both terms are computed via binary search on the sorted lists in O(log N).

Path count: For the path from u to v:

cnt(u, v, x) = cnt(1, u, x) + cnt(1, v, x) - 2 * cnt(1, lca(u, v), x) + correction

where the correction is +1 if A[lca(u, v)] = x, and 0 otherwise. The lca is counted in both cnt(1, u) and cnt(1, v) since it’s an ancestor of both. The term 2*cnt(1, lca) subtracts it twice (net: 1+1−2 = 0), so if the lca has value x we add it back once.

Total time: O(log N) per query (three binary searches + one LCA lookup).

Frequency-of-Frequencies Query

Given the ff hashmap and threshold k:

answer = 1
for each (freq, count) in ff where freq > k:
    answer += count
return answer

Iterate over the sparse ff hashmap. Since there are O(√N) distinct frequency values, this is O(√N).

Optimization: Store ff as a sorted array and use binary search to find the first entry with frequency > k, then scan. This doesn’t change the asymptotic bound but reduces constants.

Boundary Frequencies Detail

freq_cache[(l, r)] holds value→frequency entries copied from the running freq at checkpoint time, but only for the values appearing in the two boundary regions [l-B, l+B] and [r-B, r+B] (positions clamped to the tour). With nearest-checkpoint rounding, stepping toward any query that maps to (l, r) touches only positions within ~B/2 of each endpoint — in either direction — so those are the only freq entries ever read. The stored value is the node’s full path frequency (which may count nodes outside the boundary), exactly what step_add/step_remove need. Caching a generous symmetric superset (±B rather than ±B/2) costs only a constant factor and removes any off-by-one risk at the boundary.

This is what bounds memory: a full freq snapshot is O(window) per checkpoint; the boundary snapshot is O(B). On the path worst case the full version reached 847 MB; the boundary version stays well under 200 MB.

Folding the LCA into `ff` Without `freq`

In the non-ancestor case the Euler segment [tout[u]+1, tin[v]] excludes the LCA, so the LCA’s value must be added back to the answer. The naive way is step_add(cur_freq, cur_ff, A[lca]) — but that reads cur_freq[A[lca]], and the LCA is generally not a boundary node, so it is absent from the truncated freq cache. Reading it would silently use 0 and corrupt ff.

Instead we compute the LCA value’s frequency within the segment directly: it is cnt(A[lca] on path s..t) − 1 (the same cnt_x binary-search routine, minus the LCA itself). Knowing that count f, adding the LCA just moves one value from bucket f to f+1 in ff (ff[f]--; ff[f+1]++), with no freq lookup at all. This is the change that makes the boundary-only freq cache correct.

Disabling the Built-in Brute-Force Check

A brute-force recomputation of every answer (walk the path, build a std::map frequency table, compare) is compiled in only under -DBRUTE_FORCE_CHECK and writes MISMATCH … to stderr on any disagreement. The default release build (g++ -O2 solution.cpp) omits it entirely — no path walk, no extra allocation, nothing on stderr — so it does not affect the measured time/memory above.

Complexity Summary

Component	Complexity
Tree setup (DFS, LCA, position lists)	O(N log N)
Checkpoint precomputation	O(N²/B)
Per-query: find checkpoint	O(1)
Per-query: step to query	O(B)
Per-query: count cnt(x)	O(log N)
Per-query: sum ff above k	O(√N)
Total time (B = √N)	O((N + Q)√N) ≈ 6 × 10⁷
Total memory (B = √N)	O(N²/B · (√N + B)) = O(N√N) ≈ 250 MB

Why This Works for T = 1 (Online)

The checkpoint grid is precomputed independently of any queries. For each online query:

Decode the parameters using the previous answer
Map to Euler tour segment [L, R]
Look up checkpoint by integer division — no query knowledge needed
Step to the exact segment and answer

The algorithm never reorders queries or needs to see future queries.

Potential Optimizations

Value compression: Map the up to 10⁹ distinct values to ranks 1..D using coordinate compression. This allows arrays instead of hash maps for freq[].
Block size tuning: The optimal B depends on the ratio of N to Q and the actual constant factors in stepping vs ff-querying. Empirical tuning may help.
Checkpoint layout: Store checkpoints in a 1D array indexed by (l/B) * (N/B) + (r/B) for cache-friendly access.
Parallel checkpoint precomputation: Process different blocks of l independently.

Bugs and Problems Encountered During Implementation

1. Wrong Euler-tour segment formula

Problem: Initially used [tin[s], tin[t]] (after ensuring tin[s] ≤ tin[t]) for all queries. This only works when s is an ancestor of t. For non-ancestor cases, the segment misses or includes extra nodes. Fix: Split into two cases: ancestor case uses [tin[u], tin[v]] where u = LCA(s,t) (LCA included in segment), non-ancestor case uses [tout[u]+1, tin[v]] where u is the node with smaller tin (LCA excluded from segment, added separately).

2. Stepping creates empty window

Problem: When stepping from checkpoint [l₀, r₀] to query [L, R], shrinking an endpoint before expanding the opposite one could create an invalid range (e.g., [1, 0]), causing incorrect appearances_in queries. Fix: Always expand both endpoints first, then shrink. The stepping order is: expand right → shrink left → shrink right → expand left.

3. `appearances_in` uses wrong exit position

Problem: The function checked {tin[v], tout[v]} but the DFS records the exit at tour[timer_dfs] = v after tout[v] = timer_dfs - 1, so the actual tour position of the exit is tout[v] + 1. Fix: Changed to check {tin[v], tout[v]+1} in the range. The tout[v]+1 offset is critical for correctly determining whether a node appears 0, 1, or 2 times in a window.

4. Checkpoint clamping exceeds valid grid

Problem: Nearest-checkpoint rounding ((L + B/2) / B) * B can round up to a position where no checkpoint was precomputed (e.g., timer_dfs=400, B=400 — the only checkpoint is (0,0) but rounding can yield (400,400)). Checkpoints exist only for multiples of B that are < tour_size. Fix: Clamp both l₀ and r₀ into the valid grid [0, max_r], where max_r = timer_dfs - 1 - (timer_dfs - 1) % B is the largest multiple of B inside the tour. Crucially max_r is itself a multiple of B, so the clamp always lands on a real checkpoint: l₀ = max(0, min(l₀, max_r)), r₀ = max(l₀, min(r₀, max_r)). (An interim version sidestepped the issue with floor division (L/B)*B, but that was reverted in favor of true nearest rounding once the clamp made rounding safe — see Bug 9.)

5. `on_path` bitset too large to cache per checkpoint

Problem: The on_path bitset (O(N) bits per checkpoint) made checkpoint storage O(N²/B · N) which is prohibitive. Fix: Eliminated on_path entirely. Instead, toggle_tour_pos recomputes whether a node is on the path by checking appearances_in(v, old_cl, old_cr) — how many of {tin[v], tout[v]+1} fall in the window bounds before the toggle. If 1, the node is on the path and needs to be removed; otherwise it needs to be added.

6. `step_add`/`step_remove` must operate on local maps

Problem: The frequency maps (freq, ff) are copied from checkpoints for each query. The stepping functions needed to work on these local copies, not global state. Fix: Made step_add and step_remove take map references as parameters, and toggle_tour_pos passes the local copies through.

7. `test_b.sh` shell quoting issues

Problem: Embedding a complex Python brute-force verification inline in a bash heredoc caused quoting conflicts with triple-quoted strings. Fix: Moved the B sweep into a standalone test_b.py script that patches B with sed, compiles, and runs all test cases. All B values 1–100 then passed cleanly.

8. `freq_cache` was storing the full map, not the boundary range

Problem: Although the design called for freq_cache to hold only boundary-node frequencies, the code stored a copy of the entire freq map at every checkpoint (freq_cache[{l,r}] = freq). That makes checkpoint memory O((N/B)²·N) instead of O((N/B)²·B) — on the path worst case, 847 MB. Fix: Build the cache from only the values appearing in the symmetric boundary regions [l-B, l+B] and [r-B, r+B]. This depends on Bug 9’s LCA fix to be correct (otherwise stepping reads a non-boundary frequency).

9. Nearest-checkpoint rounding reintroduced the inverted window — and broke the LCA add

Problem: Switching from floor ((L/B)*B) to nearest rounding (((L+B/2)/B)*B) lets each endpoint step ≤ B/2, but it also lets an endpoint move in either direction. (a) The old stepping order (expand right → shrink left → shrink right → expand left) could then invert the window, e.g. l₀=r₀=400, query L=210,R=220 drives cr down to 220 while cl=400 → [400,220] (a resurfaced Bug 2). (b) With the now boundary-only freq cache, adding the LCA via step_add(cur_freq, A[lca]) reads a frequency that isn’t cached (the LCA isn’t a boundary node). Fix: (a) Reorder stepping to expand both sides first, then shrink both so the window only grows then contracts (never inverts). (b) Compute the LCA value’s segment frequency as cnt_x(s,t,A[lca]) − 1 and move the ff bucket directly, with no freq lookup. Also widened the boundary cache regions to be symmetric (±B) to match bidirectional stepping.

10. `map` → `gp_hash_table`, with caches as flat vectors

Problem: std::map for freq/ff adds a log factor and heavy node allocation on the hot stepping path; and storing ~10⁵ live hash tables in the caches wastes memory on per-table bucket overhead. Fix: Use __gnu_pbds::gp_hash_table<int,int> for freq/ff (O(1) open addressing). Store each checkpoint compactly as a vector<pair<int,int>> and rehydrate into a gp_hash_table on query. Note gp_hash_table has no erase(iterator) — erase by key instead.

11. Built-in brute force left enabled by default

Problem: The in-main brute-force comparison ran on every query in normal builds, walking each path and allocating a std::map — pure overhead in production. Fix: Guarded it behind #ifdef BRUTE_FORCE_CHECK (and the path scratch vector with it). Default g++ -O2 builds skip it; build with -DBRUTE_FORCE_CHECK to re-enable the self-check.

12. B too small for the path worst case

Problem: B ≈ √N (316–400), optimal under the O(√N)-distinct-values assumption, blows the budget on the path tree (5.82 s / 847 MB at B=400) because its windows hold many distinct values. Fix: Tuned B = 1500 empirically — the best all-around balance, ≤ 1.76 s on both random and path trees and ≤ 99 MB worst case. See the B sweep table. (A stray B = 300 had earlier been left in the source, contradicting this; it was corrected.)

13. `gp_hash_table` `freq`/`ff` → direct-indexed arrays

Problem: Even open-addressed, hashing freq/ff on every one of the millions of stepping operations is the dominant per-step cost. Fix: (a) freq → a single reused int[D] array keyed by compressed value, kept all-zero between uses. (b) ff → FFTable, a hybrid int[B2] (keys < B2) + gp_hash_table (keys ≥ B2). At fixed B this is ~1.3–1.7× faster end-to-end with unchanged memory; see Measured impact table. Two pitfalls surfaced while doing this:

Leak from the reused freq array (Bug 13a). Reverting a query only zeroed the loaded boundary keys, but stepping can make other values nonzero (they had frequency 0 at the checkpoint, so weren’t cached). Those leaked into the next query, corrupting later answers. Fix: record every value written during stepping (a freq_touch pointer that step_add/step_remove append to) and revert that full set, not just the loaded keys.
Dirty array after precompute (Bug 13b). Precompute now shares the one global freq array with queries and left it nonzero after its final l-sweep, so query 1 read garbage. Fix: zero freq[0..D) once at the end of precompute_checkpoints before the query phase.