From 7c0f230e3de58319bc26558aed5ca3153d5a58fe Mon Sep 17 00:00:00 2001 From: Henri Bourcereau Date: Tue, 10 Mar 2026 08:19:24 +0100 Subject: [PATCH] doc: tensor research --- doc/tensor_research.md | 253 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 253 insertions(+) create mode 100644 doc/tensor_research.md diff --git a/doc/tensor_research.md b/doc/tensor_research.md new file mode 100644 index 0000000..b0d0ede --- /dev/null +++ b/doc/tensor_research.md @@ -0,0 +1,253 @@ +# Tensor research + +## Current tensor anatomy + +[0..23] board.positions[i]: i8 ∈ [-15,+15], positive=white, negative=black (combined!) +[24] active player color: 0 or 1 +[25] turn_stage: 1–5 +[26–27] dice values (raw 1–6) +[28–31] white: points, holes, can_bredouille, can_big_bredouille +[32–35] black: same +───────────────────────────────── +Total 36 floats + +The C++ side (ObservationTensorShape() → {kStateEncodingSize}) treats this as a flat 1D vector, so OpenSpiel's +AlphaZero uses a fully-connected network. + +### Fundamental problems with the current encoding + +1. Colors mixed into a signed integer. A single value encodes both whose checker is there and how many. The network + must learn from a value of -3 that (a) it's the opponent, (b) there are 3 of them, and (c) both facts interact with + all the quarter-filling logic. Two separate, semantically clean channels would be much easier to learn from. + +2. No normalization. Dice (1–6), counts (−15 to +15), booleans (0/1), points (0–12) coexist without scaling. Gradient + flow during training is uneven. + +3. Quarter fill status is completely absent. Filling a quarter is the dominant strategic goal in Trictrac — it + triggers all scoring. The network has to discover from raw counts that six adjacent fields each having ≥2 checkers + produces a score. Including this explicitly is the single highest-value addition. + +4. Exit readiness is absent. Whether all own checkers are in the last quarter (fields 19–24) governs an entirely + different mode of play. Knowing this explicitly avoids the network having to sum 18 entries and compare against 0. + +5. dice_roll_count is missing. Used for "jan de 3 coups" (must fill the small jan within 3 dice rolls from the + starting position). It's in the Player struct but not exported. + +## Key Trictrac distinctions from backgammon that shape the encoding + +| Concept | Backgammon | Trictrac | +| ------------------------- | ---------------------- | --------------------------------------------------------- | +| Hitting a blot | Removes checker to bar | Scores points, checker stays | +| 1-checker field | Vulnerable (bar risk) | Vulnerable (battage target) but not physically threatened | +| 2-checker field | Safe "point" | Minimum for quarter fill (critical threshold) | +| 3-checker field | Safe with spare | Safe with spare | +| Strategic goal early | Block and prime | Fill quarters (all 6 fields ≥ 2) | +| Both colors on a field | Impossible | Perfectly legal | +| Rest corner (field 12/13) | Does not exist | Special two-checker rules | + +The critical thresholds — 1, 2, 3 — align exactly with TD-Gammon's encoding rationale. Splitting them into binary +indicators directly teaches the network the phase transitions the game hinges on. + +## Options + +### Option A — Separated colors, TD-Gammon per-field encoding (flat 1D) + +The minimum viable improvement. + +For each of the 24 fields, encode own and opponent separately with 4 indicators each: + +own_1[i]: 1.0 if exactly 1 own checker at field i (blot — battage target) +own_2[i]: 1.0 if exactly 2 own checkers (minimum for quarter fill) +own_3[i]: 1.0 if exactly 3 own checkers (stable with 1 spare) +own_x[i]: max(0, count − 3) (overflow) +opp_1[i]: same for opponent +… + +Plus unchanged game-state fields (turn stage, dice, scores), replacing the current to_vec(). + +Size: 24 × 8 = 192 (board) + 2 (dice) + 1 (current player) + 1 (turn stage) + 8 (scores) = 204 +Cost: Tensor is 5.7× larger. In practice the MCTS bottleneck is game tree expansion, not tensor fill; measured +overhead is negligible. +Benefit: Eliminates the color-mixing problem; the 1-checker vs. 2-checker distinction is now explicit. Learning from +scratch will be substantially faster and the converged policy quality better. + +### Option B — Option A + Trictrac-specific derived features (flat 1D) + +Recommended starting point. + +Add on top of Option A: + +// Quarter fill status — the single most important derived feature +quarter_filled_own[q] (q=0..3): 1.0 if own quarter q is fully filled (≥2 on all 6 fields) +quarter_filled_opp[q] (q=0..3): same for opponent +→ 8 values + +// Exit readiness +can_exit_own: 1.0 if all own checkers are in fields 19–24 +can_exit_opp: same for opponent +→ 2 values + +// Rest corner status (field 12/13) +own_corner_taken: 1.0 if field 12 has ≥2 own checkers +opp_corner_taken: 1.0 if field 13 has ≥2 opponent checkers +→ 2 values + +// Jan de 3 coups counter (normalized) +dice_roll_count_own: dice_roll_count / 3.0 (clamped to 1.0) +→ 1 value + +Size: 204 + 8 + 2 + 2 + 1 = 217 +Training benefit: Quarter fill status is what an expert player reads at a glance. Providing it explicitly can halve +the number of self-play games needed to learn the basic strategic structure. The corner status similarly removes +expensive inference from the network. + +### Option C — Option B + richer positional features (flat 1D) + +More complete, higher sample efficiency, minor extra cost. + +Add on top of Option B: + +// Per-quarter fill fraction — how close to filling each quarter +own_quarter_fill_fraction[q] (q=0..3): (count of fields with ≥2 own checkers in quarter q) / 6.0 +opp_quarter_fill_fraction[q] (q=0..3): same for opponent +→ 8 values + +// Blot counts — number of own/opponent single-checker fields globally +// (tells the network at a glance how much battage risk/opportunity exists) +own_blot_count: (number of own fields with exactly 1 checker) / 15.0 +opp_blot_count: same for opponent +→ 2 values + +// Bredouille would-double multiplier (already present, but explicitly scaled) +// No change needed, already binary + +Size: 217 + 8 + 2 = 227 +Tradeoff: The fill fractions are partially redundant with the TD-Gammon per-field counts, but they save the network +from summing across a quarter. The redundancy is not harmful (it gives explicit shortcuts). + +### Option D — 2D spatial tensor {K, 24} + +For CNN-based networks. Best eventual architecture but requires changing the training setup. + +Shape {14, 24} — 14 feature channels over 24 field positions: + +Channel 0: own_count_1 (blot) +Channel 1: own_count_2 +Channel 2: own_count_3 +Channel 3: own_count_overflow (float) +Channel 4: opp_count_1 +Channel 5: opp_count_2 +Channel 6: opp_count_3 +Channel 7: opp_count_overflow +Channel 8: own_corner_mask (1.0 at field 12) +Channel 9: opp_corner_mask (1.0 at field 13) +Channel 10: final_quarter_mask (1.0 at fields 19–24) +Channel 11: quarter_filled_own (constant 1.0 across the 6 fields of any filled own quarter) +Channel 12: quarter_filled_opp (same for opponent) +Channel 13: dice_reach (1.0 at fields reachable this turn by own checkers) + +Global scalars (dice, scores, bredouille, etc.) embedded as extra all-constant channels, e.g. one channel with uniform +value dice1/6.0 across all 24 positions, another for dice2/6.0, etc. Alternatively pack them into a leading "global" +row by returning shape {K, 25} with position 0 holding global features. + +Size: 14 × 24 + few global channels ≈ 336–384 +C++ change needed: ObservationTensorShape() → {14, 24} (or {kNumChannels, 24}), kStateEncodingSize updated +accordingly. +Training setup change needed: The AlphaZero config must specify a ResNet/ConvNet rather than an MLP. OpenSpiel's +alpha_zero.cc uses CreateTorchResnet() which already handles 2D input when the tensor shape has 3 dimensions ({C, H, +W}). Shape {14, 24} would be treated as 2D with a 1D spatial dimension. +Benefit: A convolutional network with kernel size 6 (= quarter width) would naturally learn quarter patterns. Kernel +size 2–3 captures adjacent-field "tout d'une" interactions. + +### On 3D tensors + +Shape {K, 4, 6} — K features × 4 quarters × 6 fields — is the most semantically natural for Trictrac. The quarter is +the fundamental tactical unit. A 2D conv over this shape (quarters × fields) would learn quarter-level patterns and +field-within-quarter patterns jointly. + +However, 3D tensors require a 3D convolutional network, which OpenSpiel's AlphaZero doesn't use out of the box. The +extra architecture work makes this premature unless you're already building a custom network. The information content +is the same as Option D. + +### Recommendation + +Start with Option B (217 values, flat 1D, kStateEncodingSize = 217). It requires only changes to to_vec() in Rust and +the one constant in the C++ header — no architecture changes, no training pipeline changes. The three additions +(quarter fill status, exit readiness, corner status) are the features a human expert reads before deciding their move. + +Plan Option D as a follow-up once you have a baseline trained on Option B. The 2D spatial CNN becomes worthwhile when +the MCTS games-per-second is high enough that the limit shifts from sample efficiency to wall-clock training time. + +Costs summary: + +| Option | Size | Rust change | C++ change | Architecture change | Expected sample-efficiency gain | +| ------- | ---- | ---------------- | ----------------------- | ------------------- | ------------------------------- | +| Current | 36 | — | — | — | baseline | +| A | 204 | to_vec() rewrite | constant update | none | moderate (color separation) | +| B | 217 | to_vec() rewrite | constant update | none | large (quarter fill explicit) | +| C | 227 | to_vec() rewrite | constant update | none | large + moderate | +| D | ~360 | to_vec() rewrite | constant + shape update | CNN required | large + spatial | + +One concrete implementation note: since get_tensor() in cxxengine.rs calls game_state.mirror().to_vec() for player 2, +the new to_vec() must express everything from the active player's perspective (which the mirror already handles for +the board). The quarter fill status and corner status should therefore be computed on the already-mirrored state, +which they will be if computed inside to_vec(). + +## Other algorithms + +The recommended features (Option B) are the same or more important for DQN/PPO. But two things do shift meaningfully. + +### 1. Without MCTS, feature quality matters more + +AlphaZero has a safety net: even a weak policy network produces decent play once MCTS has run a few hundred +simulations, because the tree search compensates for imprecise network estimates. DQN and PPO have no such backup — +the network must learn the full strategic structure directly from gradient updates. + +This means the quarter-fill status, exit readiness, and corner features from Option B are more important for DQN/PPO, +not less. With AlphaZero you can get away with a mediocre tensor for longer. With PPO in particular, which is less +sample-efficient than MCTS-based methods, a poorly represented state can make the game nearly unlearnable from +scratch. + +### 2. Normalization becomes mandatory, not optional + +AlphaZero's value target is bounded (by MaxUtility) and MCTS normalizes visit counts into a policy. DQN bootstraps +Q-values via TD updates, and PPO has gradient clipping but is still sensitive to input scale. With heterogeneous raw +values (dice 1–6, counts 0–15, booleans 0/1, points 0–12) in the same vector, gradient flow is uneven and training can +be unstable. + +For DQN/PPO, every feature in the tensor should be in [0, 1]: + +dice values: / 6.0 +checker counts: overflow channel / 12.0 +points: / 12.0 +holes: / 12.0 +dice_roll_count: / 3.0 (clamped) + +Booleans and the TD-Gammon binary indicators are already in [0, 1]. + +### 3. The shape question depends on architecture, not algorithm + +| Architecture | Shape | When to use | +| ------------------------------------ | ---------------------------- | ------------------------------------------------------------------- | +| MLP | {217} flat | Any algorithm, simplest baseline | +| 1D CNN (conv over 24 fields) | {K, 24} | When you want spatial locality (adjacent fields, quarter patterns) | +| 2D CNN (conv over quarters × fields) | {K, 4, 6} | Most semantically natural for Trictrac, but requires custom network | +| Transformer | {24, K} (sequence of fields) | Attention over field positions; overkill for now | + +The choice between these is independent of whether you use AlphaZero, DQN, or PPO. It depends on whether you want +convolutions, and DQN/PPO give you more architectural freedom than OpenSpiel's AlphaZero (which uses a fixed ResNet +template). With a custom DQN/PPO implementation you can use a 2D CNN immediately without touching the C++ side at all +— you just reshape the flat tensor in Python before passing it to the network. + +### One thing that genuinely changes: value function perspective + +AlphaZero and ego-centric PPO always see the board from the active player's perspective (handled by mirror()). This +works well. + +DQN in a two-player game sometimes uses a canonical absolute representation (always White's view, with an explicit +current-player indicator), because a single Q-network estimates action values for both players simultaneously. With +the current ego-centric mirroring, the same board position looks different depending on whose turn it is, and DQN must +learn both "sides" through the same weights — which it can do, but a canonical representation removes the ambiguity. +This is a minor point for a symmetric game like Trictrac, but worth keeping in mind. + +Bottom line: Stick with Option B (217 values, normalized), flat 1D. If you later add a CNN, reshape in Python — there's no need to change the Rust/C++ tensor format. The features themselves are the same regardless of algorithm.