Poker Attention

Poker Attention

A transformer-based poker AI that rapidly adapts to unknown opponents using attention-based opponent fingerprints. The system features a 4M-parameter model trained via reinforcement learning and supervised learning, with real-time opponent modeling through persistent memory. Built with PyTorch and React, it includes an interactive frontend for visualizing agent behavior, opponent archetypes, and training metrics. The project demonstrates advanced ML engineering with mixed-precision training, efficient inference (~50ms per batch), and comprehensive evaluation tools.

2026
Master’s University Project (Machine Learning module)
pytorch

PyTorch

React

TypeScript

Python

Poker Attention: Poker is Basically Just a Token Stream

A Master's Project That Got Out of Hand

blog_header

1. The Hook: "How Hard Can Poker Be?"

It started innocently enough. I was sitting in my Machine Learning module, final project deadline looming, when I thought: "You know what? I'll build a poker bot. How hard can it be?"

Narrator: It was, in fact, quite hard.

Two months later, I had written 10,227 lines of code. Alone. Every single line—from the transformer architecture to the tokenization system, from the PPO training loop to the React frontend with real-time WebSocket updates. The kind of project where you wake up at 3 AM thinking "wait, are my opponent embeddings getting gradients?" and immediately open your laptop to check TensorBoard.

My professor asked for a "simple machine learning application." I delivered a production-grade poker AI with dual-branch transformer attention, opponent fingerprinting, mixed-precision training, and a full-stack web interface.

Complete overkill? Absolutely.

Worth it? Also absolutely.

This is the story of how I taught a transformer to play poker by treating it like a language problem. Because when you think about it—and I've thought about it a lot—poker is just a token stream with money on the line.


2. Why Poker Is the Perfect AI Challenge

Before I explain why I chose poker, let me tell you why poker is genuinely one of the most interesting problems in AI. It's not just cards and chips—it's a microcosm of real-world decision-making under uncertainty.

Imperfect Information: The Fog of War

Unlike chess or Go where you see the entire board, in poker you're flying blind. You have two hole cards that only you can see. Your opponents' cards? Hidden. You're making multi-thousand dollar decisions based on incomplete information, probability, and reading patterns in behavior.

This is huge for AI. Most classic game-playing AIs (Deep Blue, AlphaGo) operate in perfect information environments. Poker is closer to the real world: negotiation, trading, military strategy—anywhere you have to act without knowing what your opponent knows.

Adaptation: The Meta-Game

Here's where it gets interesting: good poker isn't just about playing optimal strategy. It's about adapting to your opponents.

If you're at a table with a player who only bets when they have the nuts? You fold when they bet. Simple.

But if they figure out that you're folding all the time? They start bluffing. Then you start calling. Then they stop bluffing. It's a game of recursive mind-reading.

This opponent modeling problem is what makes poker fascinating from an ML perspective. The best play against one opponent might be the worst play against another. Your AI needs to learn who it's playing against, not just how to play poker.

Psychology: Reading Humans (or AIs)

Poker players talk about "tells"—physical or behavioral cues that reveal information. An AI doesn't have physical tells, but it has behavioral patterns:

  • Bet sizing patterns
  • Aggression frequency
  • Position awareness
  • Reaction to pressure

Can you teach a neural network to recognize these patterns? Can you teach it to exploit them? That's what I wanted to find out.

Sequential Decisions: Every Action Matters

Poker is fundamentally sequential. The same board state can be reached through wildly different action sequences, and those sequences contain crucial information:

Hand A: Player 1 raises, you call, flop comes, P1 bets big
Hand B: Player 1 limps, you raise, flop comes, P1 check-raises big

Same cards on the board, but completely different meanings. In Hand A, P1 showed strength early. In Hand B, they're trapping. The history matters as much as the current state.

This is perfect for transformers, which excel at sequence modeling.

So yeah, "simple" poker bot. Sure. Let's go with that.


3. Poker 101: The Basics

If you already know poker, feel free to skip this section. But if you're coming from an ML background without poker knowledge, here's the crash course you need to understand the rest of this post.

The Players: Hero vs. Villains

In poker terminology:

  • Hero = You (or in my case, the AI agent)
  • Villain(s) = Your opponents

This isn't just colorful language—it reflects the information asymmetry. You know your cards and your strategy. Everyone else is a black box you're trying to model.

Big Blinds: The Currency of Poker

We don't talk in dollars—we talk in Big Blinds (BB).

The blinds are forced bets to start the action:

  • Small Blind (SB): Half a bet (e.g., $1)
  • Big Blind (BB): One full bet (e.g., $2)

If you have $200 in chips and the BB is $2, you have 100 BB. This normalization is crucial because it makes strategy comparable across different stakes. "Raise to 3BB" means the same thing whether you're playing $1/$2 or $100/$200.

This is also how I normalized my training data—everything in BB multiples, not absolute amounts. The model learns relative sizing, not dollar amounts.

Positions: Location, Location, Location

Your position at the table matters massively. Players act in order, and acting last gives you more information:

positions

Position order (earliest to latest):

  1. SB (Small Blind) - Worst position, acts first on all streets after preflop
  2. BB (Big Blind) - Slightly better than SB
  3. MP (Middle Position) - Neutral
  4. CO (Cutoff) - Good position, one before the button
  5. BTN (Button) - Best position, acts last on all post-flop streets

Acting last means you see what everyone else does before you decide. This is huge. You can play way more hands from the button than from the small blind.

My tokenization system encodes position relative to the button, not absolute seat numbers. Because what matters isn't that you're in seat 3—it's that you're 2 positions before the button.

Streets: The Four Betting Rounds

A poker hand progresses through four streets:

1. Pre-flop
  • Everyone gets 2 hole cards (private)
  • First betting round
  • No community cards yet
2. Flop
  • 3 community cards revealed
  • Second betting round
  • Now you can make hands (pairs, straights, flushes)
3. Turn
  • 1 more community card (4 total)
  • Third betting round
  • Your hand is almost complete
4. River
  • Final community card (5 total)
  • Final betting round
  • Showdown if multiple players remain

Showdown: If more than one player hasn't folded by the end, everyone reveals their cards. Whoever has the best 5-card hand (using any combination of their 2 hole cards + 5 community cards) wins the pot.

Each street is a checkpoint where the information changes and the optimal strategy shifts. My model treats each street as a separate context in the token stream.

Actions: Your Decision Set

At any decision point, you can:

  • Fold - Give up, lose any chips you've put in
  • Check - Pass action (only if no one has bet)
  • Call - Match the current bet
  • Bet - Put chips in (if no one has bet yet)
  • Raise - Increase the current bet
  • All-in - Bet everything you have

Simple, right? Except bet sizing is continuous (you can bet any amount up to your stack), which means the action space is technically infinite.

More on how I solved that in the tokenization section. Spoiler: discretization with intelligent binning based on BB-relative sizing (e.g., 0.5BB, 1BB, 2BB, 3BB) and pot-relative sizing (e.g., 25% pot, 50% pot, 100% pot). This way the model learns that a "half-pot bet" is semantically similar whether it's $10 or $1000—only the relative sizing matters.

The Goal: Extract Maximum Value

You win chips by:

  1. Having the best hand at showdown, OR
  2. Making everyone else fold

Notice that #2 means you don't need good cards—you just need to convince others you have good cards. This is why poker is part psychology, part math.

The objective isn't to win every hand. It's to maximize expected value over many hands. Sometimes folding a decent hand is correct. Sometimes bluffing with garbage is correct. It depends on your opponents, the board texture, the pot size, your position...

This is why rule-based poker bots suck. The decision tree explodes. You need something that can learn from patterns, not hardcoded rules.

You need a transformer. You need attention. You need my solution.

(See what I did there?)


4. The Big Idea: "What If Poker Was Just a Token Stream?"

Here's where I get to the core insight that drove this entire project.

I didn't start with "I want to build a poker bot." I started with "I'm completely obsessed with transformer attention mechanisms and I need an excuse to implement one from scratch."

Poker was the perfect excuse.

Why Transformers? (The Honest Answer)

Let me be completely transparent: I just really, really wanted to work with attention mechanisms.

I'd spent the semester studying BERT, GPT, and all the transformer architectures that were dominating NLP. The idea that you could learn which parts of a sequence matter for each prediction—the whole query-key-value attention mechanism—was so elegant it kept me up at night.

But here's the thing: it wasn't just because I thought they were cool. There are genuine technical reasons why transformers are perfect for poker.

Poker IS a Sequence

Look at a hand from the model's perspective:

Token stream (single hand)
ordered events → next-action prediction
MARKERCARDACTIONDECISION
0001
HAND_START# sequence boundary
0002
HERO_HOLE[As Kd]# you have Ace-King suited
0003
PREFLOP[P1 RAISE 3BB]# Player 1 raises
0004
PREFLOP[P2 FOLD]# Player 2 folds
0005
PREFLOP[HERO CALL 3BB]# you call
0006
FLOP[Ah 7c 2d]# board update (you have top pair)
0007
FLOP[P1 BET 5BB]# opponent bet
0008
FLOP[HERO ???]# next token to predict

This is literally a token sequence. Each event is a discrete token with semantic meaning. The model needs to understand:

  • What cards you have
  • What the board shows
  • What your opponent has done (bet sizing, timing, frequency)
  • What position you're in
  • What the pot size is

And it needs to use attention to figure out which tokens matter for the current decision.

Does that bet on the flop matter? Absolutely.
Does the fact that P2 folded matter? Maybe not for this decision.
Does the fact that P1 raised preflop matter? Yes—it suggests a strong starting hand.

Attention lets the model learn these relevance weights automatically.

The Five Reasons Transformers Are Perfect for This

Let me break down why I chose this architecture:

1. Variable-Length Sequences

Poker hands have different lengths:

  • Preflop all-in: 3-4 tokens
  • Multi-street hand with lots of action: 30+ tokens

CNNs want fixed-size inputs. RNNs handle variable length but struggle with long-term dependencies. Transformers? They eat variable-length sequences for breakfast.

Plus, I wanted the model to potentially learn from previous hands in the session. If the context window is long enough (I used 128 for the hero encoder), you could theoretically condition your current play on what happened 5 hands ago.

Imagine: "This opponent has been super aggressive for the last 10 hands, so I should call down lighter." That's cross-hand learning, and transformers make it possible.

2. Flexible Player Counts

My tokenization system works for any number of players: 2-player heads-up, 6-max, 9-handed, whatever. Same architecture, same model, no retraining needed.

Why? Because transformers don't care about sequence length. They process however many tokens you give them. More players = more action tokens in the sequence, but the model handles it just fine.

Try doing that with a fixed-architecture CNN. I'll wait.

3. Attention Is Opponent Modeling

Here's the magic: I built two separate transformer branches:

  • Hero encoder: Processes the full hand history
  • Opponent encoder: Processes each opponent's action history

Then I use attention pooling to combine them. The attention weights learn: "In this situation, which opponent's behavior is most relevant?"

If there's an aggressive player behind you who might re-raise, the model should focus on their patterns. If there's a passive player who already folded, their actions matter less.

This is exactly what attention was designed for: learning relevance weights in context.

4. Positional Encoding for Street Awareness

Transformers use positional encoding to understand sequence order. In my case, this naturally captures:

  • Street progression (preflop → flop → turn → river)
  • Action order within a street
  • Relative timing of events

The model learns that a bet on the river means something different than the same bet on the flop, even if the bet size is identical. The position in the sequence provides context.

5. Interpretability (Bonus)

Attention weights are interpretable. I can visualize which tokens the model focuses on when making decisions. This was invaluable for debugging:

"Why did the model fold here? Oh, it's heavily weighting that earlier raise from the opponent. Makes sense."

"Why did it go all-in? Wait, the attention is uniform across all tokens? The opponent encoder isn't learning anything. Time to debug."

The Key Insight: Treat Poker Like Language

Once you see it, you can't unsee it:

  • Cards are words → "Ace of spades" is a word in the poker vocabulary
  • Actions are words → "Raise to 5BB on the flop" is a phrase
  • Hands are sentences → A complete hand is a sentence with semantic meaning
  • Sessions are documents → Multiple hands form a document with context

Language models predict the next word. My model predicts the next action.

Language models learn from context. My model learns from hand history.

Language models use attention to focus on relevant words. My model uses attention to focus on relevant opponents.

It's the same fundamental architecture, just applied to a different sequence domain.

Now let me show you how I actually built this thing...


5. The Architecture: Dual-Branch Attention

This is where the rubber meets the road. I had the big idea (poker = tokens), and now I needed to design an architecture that could actually pull it off.

The core insight was this: poker decisions depend on two fundamentally different things that need different processing.

  1. Hand understanding - What cards do I have? What's on the board? What's the pot size? This requires understanding the full narrative of the hand.

  2. Opponent patterns - Who's betting? How often? How aggressive? This requires modeling individual players separately and then figuring out which ones matter right now.

If I just threw everything into one transformer, these signals would get mixed together. I needed to separate concerns.

The Problem Statement

Imagine you're playing poker with someone you've never seen before. On the first hand, they raise from the button and you have to decide whether to call.

You don't know anything about them yet. So your decision is based on:

  • Your hand strength (pocket aces? bottom pair?)
  • Your position (can you act last?)
  • The pot odds (is it worth calling?)

But by hand 10, you've seen this opponent's patterns:

  • They raise from the button 70% of the time (loose!)
  • When they face a 3-bet, they always fold (weak to aggression)
  • They bet 2.5x the pot with value but 0.8x with bluffs (readable sizing)

Your decision-making has adapted. You're no longer playing "optimal poker"—you're exploiting this specific opponent's weaknesses.

The question was: How do I architect a model to learn and use these insights?

The Solution: Two Encoder Branches

I decided on a dual-branch architecture:

architecture

Why two branches?

  1. Hero encoder learns poker fundamentals: hand strength, position, board texture, stack depth. These are universal truths that apply to any opponent.

  2. Opponent encoder learns opponent behavior patterns: aggression frequency, bet sizing preferences, position adjustments. These are opponent-specific.

  3. Attention pooling learns context-dependent relevance: which opponents matter right now?

The Hero Encoder (Left Branch)

Architecture:

  • 4 transformer layers
  • 8 attention heads
  • 160-dimensional embeddings
  • Max sequence length: 128 tokens

Input: Full hand history—every card, every action, every bet.

Output: A 160-dimensional vector representing "what I've learned about this situation."

Why 4 layers?

I started with 2 layers and the model was too weak. With 8 layers it was overkill and slow. 4 was the sweet spot—enough capacity to model poker complexity without overfitting to opponent-specific quirks.

Why 160-dim?

This is the "compression rate" of hand understanding. I tried:

  • 64-dim: Too constrained, underfitting
  • 160-dim: Good tradeoff between expressiveness and generalization
  • 256-dim: Slightly better but with marginal gains and slower inference

Honestly? 160 became my magic number because my local GPU could barely handle it. I wasn't running on a 5-GPU cluster—I was on a single RTX 4070 training from my apartment. Going to 256-dim meant training took 30% longer and I could only fit smaller batch sizes. Given the deadline pressure, 160-dim was the sweet spot between "good enough performance" and "can actually run on my hardware without waiting 4 hours per epoch."

(This is the part of ML research nobody talks about: half your architecture decisions are made because of hardware constraints, not theoretical optimality.)

The Opponent Encoder (Right Branch)

Architecture:

  • 2 transformer layers
  • 4 attention heads
  • 64-dimensional embeddings per opponent
  • Max sequence length: 15 tokens per opponent

Input: For each opponent, their recent action history (up to 15 actions).

Output: A 64-dimensional "opponent fingerprint" for each opponent.

Why separate from hero encoder?

Because opponent patterns and hand understanding are learned from different training signals. If I mixed them:

  • The opponent patterns might drown out poker fundamentals (or vice versa)
  • It would be harder to debug what the model is learning
  • Information from a folded opponent shouldn't influence hand understanding

By splitting them, the model can learn two things independently and combine them intelligently.

Why 2 layers for opponent encoder?

Opponent behavior is simpler than full hand understanding. You're looking at action sequences, not complex board textures or position dynamics. 2 layers is enough to recognize patterns like "this player raises 40% of the time" or "they bet half-pot when strong, 1/4 pot when weak."

Attention Pooling (The Magic)

This is where the two branches meet. I have:

  • Hero understanding: 160-dim vector
  • Opponent fingerprints: 6 × 64-dim (for 6-max poker)

I can't just concatenate them—that gives me 160 + 384 = 544 dimensions, which is overkill. Instead, I use attention pooling:

# Pseudocode
hero_vector = hero_encoder(hand_history)  # shape: [160]
opponent_vectors = [
    opponent_encoder(opp_actions_1),      # [64]
    opponent_encoder(opp_actions_2),      # [64]
    ...
]  # List of [64] vectors

# Attention: which opponents matter?
attention_weights = softmax(hero_vector @ opponent_vectors.T)
# shape: [6], sums to 1

opponent_context = attention_weights @ opponent_vectors
# Weighted sum: [64]

# Combine
combined = concatenate(hero_vector, opponent_context)  # [224]
action_logits = linear(combined)  # [num_actions]

What's happening:

The hero understanding acts as a query. For each opponent, we compute how "relevant" they are to this situation. If you have top pair on the flop and there's an aggressive opponent behind you, that opponent gets high attention weight.

The output is a weighted average of opponent fingerprints—a single 64-dim vector representing "the opponent context I should focus on right now."

Then I concatenate hero (160) + opponent context (64) = 224-dim, and a linear layer predicts the action.

Why This Matters

Before you object with "Why not just use one big transformer?", let me explain why this separation is crucial:

  1. Gradient flow: The opponent branch only gets gradients when opponent information helps. If an opponent has folded, their branch might get zero gradients for that hand—which is correct.

  2. Interpretability: I can visualize "which opponent is the model focusing on?" by looking at attention weights.

  3. Generalization: The hero encoder learns poker fundamentals that transfer across opponents. The opponent encoder learns patterns that generalize to new opponent types.

  4. Scalability: Adding a new opponent doesn't require retraining. Just compute their fingerprint and let attention pooling handle the rest.

The Numbers

Total parameters: 4 million

Breakdown:

  • Hero encoder embeddings: 1.2M
  • Hero encoder layers: 1.8M
  • Opponent encoders: 0.6M
  • Attention pooling: ~10K
  • Output head: ~50K

Model size on disk: 16MB

Inference time (GPU): ~50ms per decision

This is small enough to run on mobile, fast enough for real-time play.


6. Tokenization: The Make-or-Break Design

If architecture is the skeleton of the project, tokenization is the DNA.

Get tokenization wrong and the entire system collapses. The model could be perfect, but if it's reading gibberish tokens, it will learn nonsense.

This is why I spent so much time on token design. I probably spent 30-40% of the project time here, and it was absolutely worth it.

The Challenge: Infinite Action Space

In chess, there are ~30 legal moves per position. The action space is small and discrete.

In poker, bet sizing is continuous. You can bet any amount from $1 to your entire stack. That's functionally infinite actions.

Some previous poker AIs tried to handle this by:

  • Regression: Predict a continuous bet size. Problem: It's unstable, and you can't really "softmax" over a continuous space.
  • Fixed bins: 10 discrete bet sizes (e.g., 0.5x, 1x, 2x, ...). Problem: Doesn't generalize to different stack sizes or pot sizes.
  • Raw amounts: Predict in dollars. Problem: Model trained on $1/$2 poker doesn't work on $10/$20.

My solution: Mixed-radix factorized tokenization with BB-relative and pot-relative binning.

Token Anatomy

A token is not a single integer—it's a composite of factors. Think of it like a hash code:

TOKEN = player_id + street * 10 + action_type * 100 + bet_size_bin * 1000

Actually, the math is more sophisticated, but the idea is the same: each token encodes multiple pieces of information.

Token types:

  1. MARKER tokens - Structural (HAND_START, STREET_CHANGE, HAND_END)
  2. CARD tokens - Observations (specific card: As, Kd, etc.)
  3. ACTION tokens - Behavior (who did what)

Why Factorized Tokens?

If I used a flat vocabulary where every unique combination had its own token ID, the vocab would explode:

Players × Streets × Actions × Bet-bins × Pot-odds × Board-textures × ...
= 6 × 4 × 6 × 12 × 8 × 20 × ... = millions of tokens

That's memory-inefficient and poor generalization.

Instead, I factor each token:

ACTION_TOKEN = encode(
    player_id: 0-5,           # Which opponent acted
    street: PREFLOP/FLOP/..., # Which street
    action_type: RAISE/BET/...,
    bet_size_bin: 0-11,       # BB-relative or pot-relative
    is_pot_relative: bool,    # Which type of bin
    position: 0-5             # Position relative to button
)

The killer detail: Each factor gets its own embedding:

# Instead of:
embedding = embed_layer(token_id)  # [vocab_size, dim] = [1M, 160]

# Do:
embedding = (
    embed_player(player_id) +
    embed_street(street) +
    embed_action(action_type) +
    embed_bet_size(bet_size_bin) +
    embed_position(position)
)  # All embeddings are [batch, 160]

This reduces parameters from 1M to ~100K for token embeddings, and it teaches the model compositionality. "RAISE on FLOP at different bet sizes" are treated as related concepts, not completely different tokens.

Mixed BB-Relative and Pot-Relative Bins: 16 Actions

This is where the magic happens. I don't use two separate binning systems—instead, I use a single 16-way action space that mixes both BB-relative and pot-relative sizing:

0  FOLD
1  CHECK
2  CALL
3  BET_MIN           # Minimum legal bet
4  BET_1BB           # 1x Big Blind
5  BET_2BB           # 2x Big Blind
6  BET_4BB           # 4x Big Blind
7  BET_HALF_POT      # Half the pot
8  BET_POT           # Full pot (pot-sized bet)
9  RAISE_MIN         # Minimum legal raise
10 RAISE_1BB         # Raise to 1BB over call
11 RAISE_2BB         # Raise to 2BB over call
12 RAISE_4BB         # Raise to 4BB over call
13 RAISE_HALF_POT    # Raise to half-pot
14 RAISE_POT         # Raise pot-sized
15 ALL_IN

Why this mix?

The action space captures both initiation and response patterns:

  • BB-relative (1BB, 2BB, 4BB) captures "how many blinds am I betting?" which is natural for humans and generalizes across stakes
  • Pot-relative (half-pot, pot, min) captures "what fraction of the pot am I committing?" which is the key strategic concept

A player betting 1BB preflop (action 4) is semantically different from a player betting 1BB on the flop when the pot is 20BB. The first is aggressive preflop, the second is passive postflop. By having explicit pot-relative actions (7, 8, 13, 14), the model can learn these contextual distinctions.

Example:

Hand 1: Hero bets 2BB preflop → action 5 (BET_2BB)
Hand 2: Hero bets 2BB on the flop when pot is 8BB → maps to action 7 (BET_HALF_POT)

Different actions, because the meaning is different. But the model learns
that both are "moderate aggression" through the embedding space.

The key insight: the model learns that sizing relative to the pot is what matters for decision-making. Whether you're betting $20 or $2000, a "pot-sized bet" means the same thing strategically.

Example:

Hand 1: Hero stack = 200 BB, bets 4 BB
Hand 2: Hero stack = 2000 BB, bets 40 BB

Both encode to the same token: BET_4x_BB_RELATIVE
Model learns they're similar concepts, not different actions

This dramatically improves generalization across different stakes and stack sizes.

Validation: Token Ranges and Debugging

I spent days debugging token out-of-bounds errors. To prevent this, I built validation:

assert 0 <= player_id < num_players, f"Invalid player: {player_id}"
assert 0 <= street < len(STREETS), f"Invalid street: {street}"
assert 0 <= bet_size_bin < len(BET_BINS), f"Invalid bin: {bet_size_bin}"

And I built decoders to reconstruct actions from tokens:

def decode_token(token: int):
    kind = token % NUM_TOKEN_KINDS
    player = (token // NUM_TOKEN_KINDS) % NUM_PLAYERS
    street = (token // (NUM_TOKEN_KINDS * NUM_PLAYERS)) % NUM_STREETS
    # ... etc
    return {kind, player, street, ...}

This let me print out the token stream for every hand and verify it made sense:

HAND_START
HERO_HOLE[As Kd]
PREFLOP: P1(BTN) RAISE 3BB
PREFLOP: P2(SB) FOLD
PREFLOP: HERO(BB) CALL 3BB
FLOP[Ah 7c 2d]
FLOP: P1 BET 5BB (pot-relative)
FLOP: HERO [???]

Human-readable transcripts caught bugs that raw numbers wouldn't.

Normalization: The Unsung Hero

Beyond binning, I normalized everything:

Effective stack depth:

Hero stack → min(hero_stack, largest_stack_at_table) / BB
Capped at 300 to prevent outliers

Pot size:

Current pot → pot / BB
Capped at 1000

Position:

Hero position → position relative to button
Not absolute seat number

Why? The model learns that "50 BB effective stacks" is similar whether you're 50 BB deep or 500 BB deep in terms of strategy urgency. Normalization reduces the input distribution variance and helps the model generalize.


7. Training: The Journey from Supervised Learning to PPO

Training was the part that taught me the most, mostly because it broke in spectacular ways.

I split training into two phases: supervised learning (copy opponent archetype strategies) and reinforcement learning (learn to beat them).

Phase 1: Supervised Learning

The idea: Generate 50,000 hands of poker with known opponent archetypes (Nit, TAG, Maniac, etc.), then train the model to predict what action they took.

This is basically imitation learning. I'm not teaching the model to be a good poker player—I'm teaching it to recognize patterns in how different opponents play.

Data generation:

uv run python -m poker_attention.training.data_generator \
  --out training_data_50k.npz \
  --num-examples 50000 \
  --num-players 6 \
  --num-actions 16 \
  --opponent-archetypes all \
  --use-memory

This generates 50,000 hands with a diverse set of opponent archetypes. The model learns to recognize and exploit:

  • Nit (only plays AA/KK): Super tight, easy to exploit
  • TAG (Tight-Aggressive): Solid poker, plays top 15% of hands aggressively
  • Maniac (raises everything): Chaos player, plays 100% of hands
  • Calling Station (never folds): Recreational player, bleeds chips
  • Loose-Passive (plays loose but checks often): Weak and predictable
  • Random (legally random): Baseline for comparison
  • Squeezer (aggressive 3-bettor): Specialized strategy, punishes weak opening raises
  • Bluff-Catcher (calls down light): Opposite of Nit, calls everything
  • Adaptive (adjusts to hero's play): Learns and exploits weaknesses
  • Tilt Player (emotional, changes based on recent results): Models human psychology

Training config:

uv run python -m poker_attention.training.train_supervised \
  --data training_data_50k.npz \
  --lr 0.0003 \
  --weight-decay 0.01 \
  --grad-clip 1.0 \
  --epochs 3 \
  --batch-size 32 \
  --device cuda \
  --amp \
  --seed 0 \
  --save model_supervised.pt

Results:

  • Accuracy on training set: 68%
  • Accuracy on validation set: 61%
  • Loss curves smooth and convergent
  • Training time: ~30 minutes on GPU

What does 61% accuracy mean? The model is predicting from 6 action types (fold/check/call/bet/raise/all-in), so random chance is 16%. 61% means it's learning real patterns.

But here's the thing: supervised learning teaches you what opponents do, not how to beat them.

The model learned "when the Nit has AA, they bet 3x." But it didn't learn "I should exploit the Nit by betting more against them." That requires reinforcement learning.

Phase 2: Reinforcement Learning with PPO

This is where it got interesting (and where I hit a massive wall).

The idea: Use PPO (Proximal Policy Optimization) to train the hero against a population of opponents in tournament play. The hero gets rewarded for winning chips and punished for losing them.

Initial attempt (Disaster):

uv run python -m poker_attention.training.train_rl \
  --init-checkpoint model_supervised.pt \
  --total-updates 50 \
  --num-players 6 \
  --device cuda

Training ran. Seemed fine. But then I checked the tournament logs...

{"update":1,"hand":0,"busted_seat":0,"stack_before_bust":200.0}
{"update":1,"hand":0,"busted_seat":1,"stack_before_bust":200.0}
{"update":1,"hand":0,"busted_seat":2,"stack_before_bust":200.0}
{"update":1,"hand":0,"busted_seat":3,"stack_before_bust":200.0}

Everyone was busting on hand 0, with starting stacks of 200 chips. That meant everyone was going all-in preflop. The model had learned the degenerate strategy: "Call preflop, then all-in every hand."

Why does this "work"? Against weak opponents:

  • All-in preflop wins ~50% of the time (best hand or others fold)
  • Tournament format: Winner takes all chips → huge reward spikes
  • Simplified credit assignment: Fewer decisions = easier learning

But it's completely wrong poker. The model skipped learning entirely.

The Root Cause Analysis (3 days of debugging):

  1. Entropy collapse: The policy entropy dropped from 1.83 (healthy exploration) to 0.34 (near-deterministic).
  2. No penalties: Folding had 0 cost, all-in had 0 cost.
  3. Weak opponents: Default archetypes included passive types that don't punish aggression.
  4. Learning rate: Too high (default 3e-4) → converged to local optimum too fast.

The Fix (7 Critical Changes):

I rewrote the training script with:

uv run python -m poker_attention.training.train_rl \
  --init-checkpoint model_supervised.pt \
  --entropy-coef 0.05            # Was: 0.01 (force exploration)
  --fold-penalty 0.02             # Was: 0.0 (discourage passivity)
  --all-in-penalty 0.05           # Was: 0.0 (discourage recklessness)
  --opponent-archetypes "tag,nit,maniac,station" \
  --ppo-lr 1e-4                   # Was: 3e-4 (slower convergence)
  --clip-ratio 0.1                # Was: 0.2 (smaller updates)
  --total-updates 200             # Was: 50 (longer training)
  --device cuda

What each fix does:

  1. Entropy coefficient 0.05: Rewards the model for maintaining diverse actions. "You'll get a bonus for not being too sure about your actions." This prevents the model from collapsing to a single strategy.

  2. Fold penalty 0.02: Small cost per fold. Encourages the model to play more hands actively instead of folding everything.

  3. All-in penalty 0.05: Penalizes going all-in. Forces the model to learn other bet sizes and strategies.

  4. Diverse opponents: Mix of TAG (plays tight, strong), Nit (super tight), Maniac (plays loose, aggressive), and Calling Station (calls everything). This prevents the model from finding a degenerate strategy that works against one archetype.

  5. Lower learning rate (1e-4): Slower gradient updates mean the model converges more gradually. Less likely to overshoot into local optima.

  6. Smaller clip ratio (0.1): PPO clips policy updates to prevent catastrophic forgetting. 0.1 is smaller than standard 0.2, giving more conservative updates.

  7. Longer training (200 updates): More time to explore different strategies.

Results after the fix:

Training logs now showed:

  • Entropy decreasing gradually (healthy learning curve)
  • Mix of actions in tournament play (fold, call, raise, bet)
  • Multiple players busting at different hands (not everyone on hand 0)
  • Win rates improving against specific opponent types

The model learned real poker, not a degenerate trick.


8. Opponent Archetypes: Teaching the AI Psychology

At this point, the model can learn poker. But what's it learning against?

I built a "villain gallery"—ten distinct opponent archetypes that exhibit different playing styles. These aren't just random noise; they're calibrated to be realistic poker strategies that force the hero to learn different exploitations.

The Villain Gallery

villain gallery

Each archetype is implemented as a heuristic policy with parametrized difficulty. Here's the complete lineup:

1. The Nit

"I only play premium hands"

  • Plays: AA, KK, QQ, AK (top 1.5% of hands)
  • Strategy: Tight and passive
  • Exploitable by: Stealing blinds with weak hands
  • Behavior: Only bets when they have monsters

Why include this? Because real poker tables have nits. The hero learns that against super tight opponents, you can steal blinds relentlessly.

2. TAG (Tight-Aggressive)

"I play solid poker"

  • Plays: Top 15% of hands
  • Strategy: Tight preflop, aggressive postflop
  • Exploitable by: Tricky play, reading bet sizing
  • Behavior: Raises with strong hands, bets large when confident

This is closest to "optimal" poker. If the hero can beat TAGs consistently, it's learning real poker, not exploiting recreational quirks.

3. Maniac

"YOLO poker"

  • Plays: Every hand
  • Strategy: Hyper-aggressive
  • Exploitable by: Tightening up, value betting
  • Behavior: Raises 80% of the time, bet sizes massive and random

Pure chaos. But chaos is predictable. The hero learns: "when someone is raising 80% of the time, re-raise with marginal hands."

4. Calling Station

"I never fold"

  • Plays: Most hands
  • Strategy: Loose/Passive
  • Exploitable by: Value betting everything
  • Behavior: Always calls, rarely raises, never bluffs

The opposite of optimal, but realistic—recreational players who just want to see cards. The hero exploits by betting for value aggressively.

5. Loose-Passive

"I'll call, but I won't raise"

  • Plays: Top 40% of hands
  • Strategy: Loose hand selection, passive betting
  • Exploitable by: Aggressive pressure and positioning
  • Behavior: Calls frequently, rarely bets or raises

Weak and predictable. The hero learns to apply relentless pressure and dominate the table.

6. Random

"Legally random"

  • Plays: Any legal action with equal probability
  • Strategy: Injects entropy (anti-jam)
  • Exploitable by: Not the point
  • Behavior: Deliberately unpredictable within legal actions

This one isn’t here as a “baseline.” It’s an anti-jam opponent: by occasionally doing weird-but-legal things, it prevents the training loop from collapsing into brittle patterns like “everyone always jams preflop” or “the meta converges to one degenerate line.” The hero still tends to beat it, but the real value is that it keeps the policy honest.

7. Squeezer

"I 3-bet aggressively"

  • Strategy: Specialized aggressive 3-betting
  • Exploitable by: Wider opening ranges, tighter calling ranges
  • Behavior: Re-raises frequently when facing opens

Teaches position dynamics and how to handle aggressive re-raising preflop.

8. Bluff-Catcher

"I call everything down"

  • Strategy: Opposite of Nit—calls with very wide ranges
  • Exploitable by: Bluffing less, value betting more
  • Behavior: Calls light, rarely folds, catches bluffs

The hero learns when bluffing is profitable and when to just value bet into a calling range.

9. Adaptive

"I learn from you"

  • Strategy: Adjusts play based on hero's tendencies
  • Exploitable by: Balancing your own play, varying bet sizing
  • Behavior: Tightens/loosens based on how exploitable you are

The hardest opponent—forces the hero to play balanced poker instead of one-dimensional exploitation.

10. Tilt Player

"My emotions control me"

  • My favorite archetype
  • Base strategy: Tight-aggressive
  • Behavior: Switches between aggressive and passive based on recent results
  • When ahead: Becomes passive (afraid of losing)
  • When behind: Becomes aggressive (trying to win it back)
def tilt_update(recent_results):
    if recent_results == WIN:
        tilt_level -= tilt_fluctuation  # More passive
    elif recent_results == LOSS:
        tilt_level += tilt_fluctuation  # More aggressive

Parameters:

  • --tilt-fluctuation 0.08: How much wins/losses swing behavior
  • --tilt-randomness 0.03: Random variance in tilt
  • --tilt-retain 0.90: How persistent tilt is (decays slowly)
  • --tilt-baseline 0.50: Base aggression level

This is realistic because humans tilt. Teaching the hero to exploit tilt (play tighter after they win, looser after they lose) teaches real poker psychology.

Why Fixed Archetypes?

You might ask: "Why not just train all the opponents with RL too?"

Good question.

  1. Fast iteration: No training overhead, generate data instantly
  2. Diverse training data: I control how many Nits vs. Maniacs are in my data
  3. Interpretability: I know exactly what each opponent does
  4. Baseline: Compare the hero's learning against known reference strategies

Measuring Skill Against Archetypes

To evaluate the hero, I run 500-hand tournaments and measure:

Win rate %: Percentage of hands won per 100 hands

results

A positive BB/100 means the hero is winning. These numbers are reasonable—it's beating weak opponents significantly but struggling against good ones.

9. The Frontend: Making Attention Visible

web demo

I didn’t build the web demo because I wanted to become a frontend developer. I built it because I needed a way to see what the agent thought it was doing.

If you’ve ever trained an RL policy, you know the feeling:

  • Loss curves look “fine.”
  • Rewards kind of go up.
  • The model still plays like a drunk raccoon.

So I built a small React UI that turns my model into something I can interrogate.

The Loop

The architecture is intentionally simple:

  • A FastAPI backend exposes a WebSocket at /ws.
  • The browser connects with query params like ?checkpoint=...&num_players=...&opponent_archetypes=....
  • The backend streams structured events: hand_started, action_applied, and—most importantly—hero_decision.

There’s also a quality-of-life endpoint to list checkpoint files (/api/models), so I can click through experiments without remembering whether the model was called tmp_model_16.pt or tmp_model_16_amp_smoke_v2_final_really_final.pt.

What the UI Shows

The web demo is basically a poker table plus an x-ray view of the agent:

  • Seat map + stacks (updated every action)
  • Action log (so you can reconstruct the story of the hand)
  • Opponent labels (archetype per seat, color-coded)
  • Board + revealed cards at showdown
  • A hero decision panel with:
    • action names
    • raw probabilities
    • masked probabilities (after removing illegal actions)
    • chosen action
    • opponent-memory lengths per seat
    • the current token count and, optionally, decoded tokens

That last panel is why the UI exists. When the model makes a move, I can tell whether it was:

  • confident vs. uncertain
  • constrained by legality
  • reacting to a specific opponent (via memory lengths and token deltas)

The Unexpected Benefit: Debugging Tokenization

Tokenization bugs are brutal because they don’t crash. They just… slowly poison your model.

Having a UI that can show the latest appended tokens—decoded into human-readable fields—made it dramatically easier to catch issues like “a bin went out of range” or “the token stream stopped updating at hand boundaries.”

The frontend didn’t make the model smarter. It made me smarter.

10. The Cool Technical Details (That Were Way More Important Than They Sound)

This project has a bunch of engineering details that are easy to dismiss as “implementation choices”… until you break them and the whole thing collapses.

1. Action Masking: Legal Poker or Bust

Poker is full of illegal actions:

  • You can’t check if there’s a bet.
  • You can’t raise if you don’t have chips.
  • You can’t bet less than the minimum.

So the model outputs logits for 16 actions, but the system also computes a legal action mask. In the UI you can see both:

  • probs: what the model would do in a vacuum
  • masked_probs: what it can do in this actual game state

This sounds like a detail, but without it the model wastes capacity learning “don’t pick illegal moves” instead of learning poker.

2. Streaming Everything (Not Just the Final Decision)

The WebSocket protocol streams discrete event types: hand_started, action_applied, tokens_delta, hero_decision, hand_ended, and so on.

The key one for debugging is tokens_delta: instead of re-sending the entire history every step, the backend can send only the new slice (start index + token IDs + decoded view). That keeps the UI responsive and makes debugging feel like watching a live log.

3. Opponent Memory: Simple, Persistent, and Visible

A lot of “opponent modeling” projects die because they hide the memory mechanism behind a black-box recurrent state.

Here, memory is boring on purpose: the session tracks per-opponent histories, and the frontend literally shows the memory lengths per seat.

That sounds like a toy feature, but it prevents self-delusion. If I claim the model is “adapting across hands” but memory length is always 0… I don’t get to lie to myself.

4. Deterministic UX Details That Matter

Even the UI has some surprisingly important little helpers:

  • inferring the button seat from the first real action in the log
  • generating position labels correctly for 2–9 players
  • persisting the demo config in localStorage so experiments are reproducible

None of that improves win rate. All of that improves iteration speed.

12. What I Learned (And What I’d Do Differently)

This project made me better at ML engineering in a way that’s hard to fake.

1. Tokenization Is Model Design

I went in thinking tokenization was a preprocessing step.

I left believing tokenization is half the architecture.

In poker, the discretization choices define what the model can even represent. If your action space is wrong, your model will be wrong in a very confident way.

2. RL Is Less About Algorithms and More About Monitoring

PPO wasn’t the “magic ingredient.”

The magic ingredient was:

  • entropy tracking
  • collapse detection
  • reward shaping that discourages degenerate play

If you’re not visualizing those things, you’re basically training blind.

3. Tooling Is a Force Multiplier

The web demo was not an extra.

It was the thing that turned this from “I think it works” into “I can prove what it’s doing, step by step, and debug it when it lies.”

4. Constraints Make the Work Real

I built this on a single consumer GPU. That constraint shaped everything:

  • embedding sizes
  • batch sizes
  • how much history I could afford to encode

And honestly? That’s most ML in the real world. The best design is the one you can iterate on fast enough to actually finish.

Conclusion: Overkill, On Purpose

Poker Attention started as a “simple ML project” and turned into a full-stack agent with tokenization, transformers, opponent memory, RL stability hacks, and a UI that makes the whole thing debuggable.

The headline result isn’t “I solved poker.” It’s that I built a system that can watch a table, form a working model of who’s sitting in each seat, and change its play based on that—without collapsing into the usual RL degeneracy.

If there’s one lesson I’m taking forward, it’s this: in real projects, the difference between a clever idea and a working agent is everything around the model—data, constraints, monitoring, and the tools that tell you when your model is lying.

And yes: poker really is just a token stream. It just happens to be a token stream that punishes you for being wrong.