From 7c0f230e3de58319bc26558aed5ca3153d5a58fe Mon Sep 17 00:00:00 2001
From: Henri Bourcereau <henri.bourcereau@gmail.com>
Date: Tue, 10 Mar 2026 08:19:24 +0100
Subject: [PATCH] doc: tensor research

---
 doc/tensor_research.md | 253 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 253 insertions(+)
 create mode 100644 doc/tensor_research.md

diff --git a/doc/tensor_research.md b/doc/tensor_research.md
new file mode 100644
index 0000000..b0d0ede
--- /dev/null
+++ b/doc/tensor_research.md
@@ -0,0 +1,253 @@
+# Tensor research
+
+## Current tensor anatomy
+
+[0..23] board.positions[i]: i8 ∈ [-15,+15], positive=white, negative=black (combined!)
+[24] active player color: 0 or 1
+[25] turn_stage: 1–5
+[26–27] dice values (raw 1–6)
+[28–31] white: points, holes, can_bredouille, can_big_bredouille
+[32–35] black: same
+─────────────────────────────────
+Total 36 floats
+
+The C++ side (ObservationTensorShape() → {kStateEncodingSize}) treats this as a flat 1D vector, so OpenSpiel's
+AlphaZero uses a fully-connected network.
+
+### Fundamental problems with the current encoding
+
+1. Colors mixed into a signed integer. A single value encodes both whose checker is there and how many. The network
+   must learn from a value of -3 that (a) it's the opponent, (b) there are 3 of them, and (c) both facts interact with
+   all the quarter-filling logic. Two separate, semantically clean channels would be much easier to learn from.
+
+2. No normalization. Dice (1–6), counts (−15 to +15), booleans (0/1), points (0–12) coexist without scaling. Gradient
+   flow during training is uneven.
+
+3. Quarter fill status is completely absent. Filling a quarter is the dominant strategic goal in Trictrac — it
+   triggers all scoring. The network has to discover from raw counts that six adjacent fields each having ≥2 checkers
+   produces a score. Including this explicitly is the single highest-value addition.
+
+4. Exit readiness is absent. Whether all own checkers are in the last quarter (fields 19–24) governs an entirely
+   different mode of play. Knowing this explicitly avoids the network having to sum 18 entries and compare against 0.
+
+5. dice_roll_count is missing. Used for "jan de 3 coups" (must fill the small jan within 3 dice rolls from the
+   starting position). It's in the Player struct but not exported.
+
+## Key Trictrac distinctions from backgammon that shape the encoding
+
+| Concept                   | Backgammon             | Trictrac                                                  |
+| ------------------------- | ---------------------- | --------------------------------------------------------- |
+| Hitting a blot            | Removes checker to bar | Scores points, checker stays                              |
+| 1-checker field           | Vulnerable (bar risk)  | Vulnerable (battage target) but not physically threatened |
+| 2-checker field           | Safe "point"           | Minimum for quarter fill (critical threshold)             |
+| 3-checker field           | Safe with spare        | Safe with spare                                           |
+| Strategic goal early      | Block and prime        | Fill quarters (all 6 fields ≥ 2)                          |
+| Both colors on a field    | Impossible             | Perfectly legal                                           |
+| Rest corner (field 12/13) | Does not exist         | Special two-checker rules                                 |
+
+The critical thresholds — 1, 2, 3 — align exactly with TD-Gammon's encoding rationale. Splitting them into binary
+indicators directly teaches the network the phase transitions the game hinges on.
+
+## Options
+
+### Option A — Separated colors, TD-Gammon per-field encoding (flat 1D)
+
+The minimum viable improvement.
+
+For each of the 24 fields, encode own and opponent separately with 4 indicators each:
+
+own_1[i]: 1.0 if exactly 1 own checker at field i (blot — battage target)
+own_2[i]: 1.0 if exactly 2 own checkers (minimum for quarter fill)
+own_3[i]: 1.0 if exactly 3 own checkers (stable with 1 spare)
+own_x[i]: max(0, count − 3) (overflow)
+opp_1[i]: same for opponent
+…
+
+Plus unchanged game-state fields (turn stage, dice, scores), replacing the current to_vec().
+
+Size: 24 × 8 = 192 (board) + 2 (dice) + 1 (current player) + 1 (turn stage) + 8 (scores) = 204
+Cost: Tensor is 5.7× larger. In practice the MCTS bottleneck is game tree expansion, not tensor fill; measured
+overhead is negligible.
+Benefit: Eliminates the color-mixing problem; the 1-checker vs. 2-checker distinction is now explicit. Learning from
+scratch will be substantially faster and the converged policy quality better.
+
+### Option B — Option A + Trictrac-specific derived features (flat 1D)
+
+Recommended starting point.
+
+Add on top of Option A:
+
+// Quarter fill status — the single most important derived feature
+quarter_filled_own[q] (q=0..3): 1.0 if own quarter q is fully filled (≥2 on all 6 fields)
+quarter_filled_opp[q] (q=0..3): same for opponent
+→ 8 values
+
+// Exit readiness
+can_exit_own: 1.0 if all own checkers are in fields 19–24
+can_exit_opp: same for opponent
+→ 2 values
+
+// Rest corner status (field 12/13)
+own_corner_taken: 1.0 if field 12 has ≥2 own checkers
+opp_corner_taken: 1.0 if field 13 has ≥2 opponent checkers
+→ 2 values
+
+// Jan de 3 coups counter (normalized)
+dice_roll_count_own: dice_roll_count / 3.0 (clamped to 1.0)
+→ 1 value
+
+Size: 204 + 8 + 2 + 2 + 1 = 217
+Training benefit: Quarter fill status is what an expert player reads at a glance. Providing it explicitly can halve
+the number of self-play games needed to learn the basic strategic structure. The corner status similarly removes
+expensive inference from the network.
+
+### Option C — Option B + richer positional features (flat 1D)
+
+More complete, higher sample efficiency, minor extra cost.
+
+Add on top of Option B:
+
+// Per-quarter fill fraction — how close to filling each quarter
+own_quarter_fill_fraction[q] (q=0..3): (count of fields with ≥2 own checkers in quarter q) / 6.0
+opp_quarter_fill_fraction[q] (q=0..3): same for opponent
+→ 8 values
+
+// Blot counts — number of own/opponent single-checker fields globally
+// (tells the network at a glance how much battage risk/opportunity exists)
+own_blot_count: (number of own fields with exactly 1 checker) / 15.0
+opp_blot_count: same for opponent
+→ 2 values
+
+// Bredouille would-double multiplier (already present, but explicitly scaled)
+// No change needed, already binary
+
+Size: 217 + 8 + 2 = 227
+Tradeoff: The fill fractions are partially redundant with the TD-Gammon per-field counts, but they save the network
+from summing across a quarter. The redundancy is not harmful (it gives explicit shortcuts).
+
+### Option D — 2D spatial tensor {K, 24}
+
+For CNN-based networks. Best eventual architecture but requires changing the training setup.
+
+Shape {14, 24} — 14 feature channels over 24 field positions:
+
+Channel 0: own_count_1 (blot)
+Channel 1: own_count_2
+Channel 2: own_count_3
+Channel 3: own_count_overflow (float)
+Channel 4: opp_count_1
+Channel 5: opp_count_2
+Channel 6: opp_count_3
+Channel 7: opp_count_overflow
+Channel 8: own_corner_mask (1.0 at field 12)
+Channel 9: opp_corner_mask (1.0 at field 13)
+Channel 10: final_quarter_mask (1.0 at fields 19–24)
+Channel 11: quarter_filled_own (constant 1.0 across the 6 fields of any filled own quarter)
+Channel 12: quarter_filled_opp (same for opponent)
+Channel 13: dice_reach (1.0 at fields reachable this turn by own checkers)
+
+Global scalars (dice, scores, bredouille, etc.) embedded as extra all-constant channels, e.g. one channel with uniform
+value dice1/6.0 across all 24 positions, another for dice2/6.0, etc. Alternatively pack them into a leading "global"
+row by returning shape {K, 25} with position 0 holding global features.
+
+Size: 14 × 24 + few global channels ≈ 336–384
+C++ change needed: ObservationTensorShape() → {14, 24} (or {kNumChannels, 24}), kStateEncodingSize updated
+accordingly.
+Training setup change needed: The AlphaZero config must specify a ResNet/ConvNet rather than an MLP. OpenSpiel's
+alpha_zero.cc uses CreateTorchResnet() which already handles 2D input when the tensor shape has 3 dimensions ({C, H,
+W}). Shape {14, 24} would be treated as 2D with a 1D spatial dimension.
+Benefit: A convolutional network with kernel size 6 (= quarter width) would naturally learn quarter patterns. Kernel
+size 2–3 captures adjacent-field "tout d'une" interactions.
+
+### On 3D tensors
+
+Shape {K, 4, 6} — K features × 4 quarters × 6 fields — is the most semantically natural for Trictrac. The quarter is
+the fundamental tactical unit. A 2D conv over this shape (quarters × fields) would learn quarter-level patterns and
+field-within-quarter patterns jointly.
+
+However, 3D tensors require a 3D convolutional network, which OpenSpiel's AlphaZero doesn't use out of the box. The
+extra architecture work makes this premature unless you're already building a custom network. The information content
+is the same as Option D.
+
+### Recommendation
+
+Start with Option B (217 values, flat 1D, kStateEncodingSize = 217). It requires only changes to to_vec() in Rust and
+the one constant in the C++ header — no architecture changes, no training pipeline changes. The three additions
+(quarter fill status, exit readiness, corner status) are the features a human expert reads before deciding their move.
+
+Plan Option D as a follow-up once you have a baseline trained on Option B. The 2D spatial CNN becomes worthwhile when
+the MCTS games-per-second is high enough that the limit shifts from sample efficiency to wall-clock training time.
+
+Costs summary:
+
+| Option  | Size | Rust change      | C++ change              | Architecture change | Expected sample-efficiency gain |
+| ------- | ---- | ---------------- | ----------------------- | ------------------- | ------------------------------- |
+| Current | 36   | —                | —                       | —                   | baseline                        |
+| A       | 204  | to_vec() rewrite | constant update         | none                | moderate (color separation)     |
+| B       | 217  | to_vec() rewrite | constant update         | none                | large (quarter fill explicit)   |
+| C       | 227  | to_vec() rewrite | constant update         | none                | large + moderate                |
+| D       | ~360 | to_vec() rewrite | constant + shape update | CNN required        | large + spatial                 |
+
+One concrete implementation note: since get_tensor() in cxxengine.rs calls game_state.mirror().to_vec() for player 2,
+the new to_vec() must express everything from the active player's perspective (which the mirror already handles for
+the board). The quarter fill status and corner status should therefore be computed on the already-mirrored state,
+which they will be if computed inside to_vec().
+
+## Other algorithms
+
+The recommended features (Option B) are the same or more important for DQN/PPO. But two things do shift meaningfully.
+
+### 1. Without MCTS, feature quality matters more
+
+AlphaZero has a safety net: even a weak policy network produces decent play once MCTS has run a few hundred
+simulations, because the tree search compensates for imprecise network estimates. DQN and PPO have no such backup —
+the network must learn the full strategic structure directly from gradient updates.
+
+This means the quarter-fill status, exit readiness, and corner features from Option B are more important for DQN/PPO,
+not less. With AlphaZero you can get away with a mediocre tensor for longer. With PPO in particular, which is less
+sample-efficient than MCTS-based methods, a poorly represented state can make the game nearly unlearnable from
+scratch.
+
+### 2. Normalization becomes mandatory, not optional
+
+AlphaZero's value target is bounded (by MaxUtility) and MCTS normalizes visit counts into a policy. DQN bootstraps
+Q-values via TD updates, and PPO has gradient clipping but is still sensitive to input scale. With heterogeneous raw
+values (dice 1–6, counts 0–15, booleans 0/1, points 0–12) in the same vector, gradient flow is uneven and training can
+be unstable.
+
+For DQN/PPO, every feature in the tensor should be in [0, 1]:
+
+dice values: / 6.0
+checker counts: overflow channel / 12.0
+points: / 12.0
+holes: / 12.0
+dice_roll_count: / 3.0 (clamped)
+
+Booleans and the TD-Gammon binary indicators are already in [0, 1].
+
+### 3. The shape question depends on architecture, not algorithm
+
+| Architecture                         | Shape                        | When to use                                                         |
+| ------------------------------------ | ---------------------------- | ------------------------------------------------------------------- |
+| MLP                                  | {217} flat                   | Any algorithm, simplest baseline                                    |
+| 1D CNN (conv over 24 fields)         | {K, 24}                      | When you want spatial locality (adjacent fields, quarter patterns)  |
+| 2D CNN (conv over quarters × fields) | {K, 4, 6}                    | Most semantically natural for Trictrac, but requires custom network |
+| Transformer                          | {24, K} (sequence of fields) | Attention over field positions; overkill for now                    |
+
+The choice between these is independent of whether you use AlphaZero, DQN, or PPO. It depends on whether you want
+convolutions, and DQN/PPO give you more architectural freedom than OpenSpiel's AlphaZero (which uses a fixed ResNet
+template). With a custom DQN/PPO implementation you can use a 2D CNN immediately without touching the C++ side at all
+— you just reshape the flat tensor in Python before passing it to the network.
+
+### One thing that genuinely changes: value function perspective
+
+AlphaZero and ego-centric PPO always see the board from the active player's perspective (handled by mirror()). This
+works well.
+
+DQN in a two-player game sometimes uses a canonical absolute representation (always White's view, with an explicit
+current-player indicator), because a single Q-network estimates action values for both players simultaneously. With
+the current ego-centric mirroring, the same board position looks different depending on whose turn it is, and DQN must
+learn both "sides" through the same weights — which it can do, but a canonical representation removes the ambiguity.
+This is a minor point for a symmetric game like Trictrac, but worth keeping in mind.
+
+Bottom line: Stick with Option B (217 values, normalized), flat 1D. If you later add a CNN, reshape in Python — there's no need to change the Rust/C++ tensor format. The features themselves are the same regardless of algorithm.