When AI Fails: Shocking Poker Lessons and the Urgent Case for Human Oversight

To get the lastest report, please subscribe to our newsletter or contact us at [email protected]

TL;DR: The Cognitive Gulf in AI Decision-Making

The Illusion of Competence: In the Kaggle Game Arena poker tournament, top Large Language Models (like OpenAI’s GPT-5.2 and o3) demonstrated a massive gap between generating eloquent Game Theory Optimal (GTO) analyses and executing them correctly under imperfect information.
State Tracking Collapse: Models frequently hallucinated game states. In one absurd hand, GPT-5 Mini and Grok 4.1 shoved all-in believing they had “nut flushes” when neither model even held a pair or a draw.
Cognitive Biases Replicated: The o3 model justified a terrible all-in using a textbook “sunk cost fallacy” (refusing to fold because chips were “already invested”), showing how AI internalizes human irrationality from its training data.
The Paradox of Skill: While AI can mathematically crush top human professionals using perfect GTO, it frequently loses to erratic novices by over-analyzing their random, illogical moves—an area where human intuition and dynamic opponent modeling still reign supreme.
The Human Imperative: In high-stakes, uncertain environments (from poker to finance and military strategy), pure language models are too brittle. A “Human-on-the-loop” framework—where AI handles data and humans provide metacognitive oversight, risk management, and intuition—is absolutely critical for success.

Introduction: The Paradigm Shift from Perfect Calculation to Reasoning Under Uncertainty

Board games have long been AI’s benchmark—Deep Blue in chess (1997) and AlphaGo in Go (2016) showcase dominance in perfect-information settings, where the full state is visible and solutions can be computed via brute force, MCTS, and reinforcement learning.

Real-world decision-making, however, is different: poker (No-Limit Texas Hold’em) is an imperfect-information game with hidden cards, randomness, and opponent-dependent strategy, exposing key weaknesses in current LLMs. The Kaggle Game Arena—an exhibition by DeepMind and Kaggle that ran 180,000 heads‑up hands among top foundation models (GPT‑5, Gemini 3, Claude 4.5, Grok, etc.)—made this clear.

Models can produce polished Game‑Theory‑Optimal analyses yet fail in execution: they mistrack game state, make bizarre tactical errors, and see exploit attempts backfire. The event highlights a “speak eloquently vs. act correctly” gap and illustrates the “paradox of skill”—systems that beat experts can still lose to erratic novices.

This report analyzes those results, examining LLM limitations under uncertainty, calibration differences, neural-network analogues of sunk‑cost bias, and the enduring role of human intuition in imperfect‑information settings.

Empirical Foundation: Analyzing the Kaggle Game Arena Data

To thoroughly understand LLM decision-making mechanisms in uncertain environments, we must begin with the empirical data from this rigorous stress test. The Kaggle Game Arena poker event utilized two distinct evaluation frameworks: a highly statistically significant “Leaderboard” based on 180,000 hands, and a smaller-sample, highly entertaining “Bracket” knockout tournament.

In No-Limit Texas Hold’em, sample size is critical for evaluating true skill (absolute skill). Due to the game’s inherent randomness (the luck of the draw), short-term results are accompanied by extremely high variance. A massive sample of 180,000 hands is sufficient to filter out most random noise, allowing the models’ true Expected Value (EV) to emerge.

Developer	Specific Models	Performance and Technical Characteristics
OpenAI	GPT 5.2, GPT 5 Mini, o3	GPT 5.2 won the 180k-hand leaderboard championship (Profit: +$167,614). o3 won the bracket tournament. Displayed extreme “hyper-aggression.” Conversely, GPT 5 Mini finished last, becoming the biggest loser (Loss: -$341,546).
Google DeepMind	Gemini 3 Pro, Gemini 3 Flash	According to professional player Doug Polk, Gemini 3 exhibited the most fundamentally sound strategy, playing closest to Game Theory Optimal (GTO), despite not winning the overall title.
Anthropic	Claude Opus 4.5, Claude Sonnet 4.5	Performance was relatively stable and “normal.” Their strategy was deemed “pretty reasonable,” but they struggled and ultimately failed against the extreme aggression of the OpenAI models.
xAI	Grok 4, Grok 4.1 Fast Reasoning	Decision-making logic frequently suffered severe disconnects, experiencing disastrous hallucinations regarding basic hand strength recognition.
DeepSeek	DeepSeek 3.2	Participated in the event but failed to demonstrate dominant dynamic adjustment capabilities in extreme hands and aggressive matchups.

Exposed Fault Lines: Logical Flaws in LLM Decision-Making Under Uncertainty

The tournament results presented an interesting asymmetry: in the 180,000-hand statistical robustness test, GPT 5.2 emerged as the most profitable champion; yet, in the short-term bracket tournament, o3 defeated GPT 5.2 to claim the title. This reflects a common phenomenon in poker—in small-sample matchups, strategies with higher volatility or randomness may win due to short-term luck.

However, what truly captured the attention of academia and the industry was not which model won the most chips, but the underlying logical flaws exposed during their decision-making processes. Even the victorious OpenAI models and the worst-performing GPT-5 Mini and Grok 4.1 committed shockingly amateur mistakes. These errors did not stem from insufficient computational power, but from inherent blind spots within the language model architecture itself when handling “State Tracking” and “Uncertainty”.

The Cognitive Gulf: State Tracking Collapse and Latent Space Hallucinations

LLMs often show a sharp gap between polished analysis and actual play. Poker requires exact state-tracking—streets, pot size, invested chips, and shifting ranges. Humans use visual symbols and facts; LLMs rely on attention weights over context. When states grow complex, LLMs frequently suffer severe “state hallucinations” in their latent space—clearly seen in what pro Doug Polk called the “worst AI poker showdown.”

The Phantom Flush: The Absurd Showdown Between Grok 4.1 and GPT-5 Mini

In this highly discussed hand, the Flop community cards were: 6♣‑J♦‑9♦.

GPT‑5 Mini: A♦K♣ — Ace‑high, no pair, no flush draw.
Grok 4.1: A♣10♣ — Ace‑high, no pair, only a weak backdoor.

Despite this, both models escalated: bet → raise → all‑in → call, shoving with nothing.

CoT logs revealed the collapse: Grok claimed a “nut flush draw with three clubs”; GPT‑5 Mini claimed a “nut flush with three diamonds.” Both claims were false.

Likely cause: the models latched onto tokens like “Ace” and “diamonds” and auto‑completed poker phrases (e.g., “nut flush”). Once hallucinated in the reasoning stream, the models treated it as fact.

Outcome: severe state‑tracking failure — confusing made hands with draws and misplaying obvious folds. This highlights a key risk of using pure text‑generation LLMs in high‑stakes, imperfect‑information settings.

Behavioral Economics in Silicon Neural Networks: Re-emergence of the Sunk Cost Fallacy

The o3 model’s catastrophic all‑in during the semi‑finals and finals revealed a different, equally troubling failure mode from the “phantom flush”: it internalized a human cognitive bias and used that as the basis for a high‑stakes decision. In post‑hand analysis o3 justified its shove by claiming that folding would mean “giving up the chips already invested.” This reasoning exemplifies the sunk cost fallacy—an error in which past, irrecoverable costs are allowed to influence forward‑looking choices. In poker, rational decisions must be based solely on current expected value and remaining equity; previously committed chips are no longer yours and should not affect whether you call an opponent’s all‑in.

The root cause is training and fine‑tuning: LLMs absorb vast human text (including irrational reasoning) and RLHF can reinforce responses that “sound” human. Without an explicit EV/equity calculator, o3 relied on semantic associations—linking large pot investments to loss aversion—instead of computing math, thus replicating human bias.

This case shows that language fluency or knowledge of game theory isn’t enough; reliable play in imperfect‑information, high‑stakes settings requires explicit quantitative tools and safeguards to prevent human cognitive biases from driving machine decisions.

The AI Beats the Best but Loses to the Worst

The best LLMs can produce polished, professional-sounding analyses while their underlying logic is rotten. They can cite GTO principles, yet their plays often diverge wildly. This mirrors everyday LLM use: there’s a large gap between “sounding correct” and actually doing the right thing—especially when tasks require exact state tracking, handling uncertainty, and consistent decisions under pressure.

Doug Polk made a salient point: you’d expect AI to excel at data-driven opponent modeling—calculating frequencies, EV, and pot odds—but instead many LLMs reason like an amateur player. Their CoT often reads as narrative intuition (“I saw the opponent open with J5” or “He folded under pressure before”) rather than quantitative inference (“the opponent’s fold-to-3-bet is X%, so I need Y% equity”).

Even worse, these models frequently produce self-contradictory judgments. In one hand, Grok claimed an opponent “calls somewhat loosely” and simultaneously asserted it had “high fold equity”—two mutually exclusive positions—yet continued down the broken logic chain and made a catastrophic play. This is the same failure mode we see when people rely on LLMs for workplace decisions: the output looks fluent and authoritative, but the reasoning often collapses under scrutiny.

A New Paradigm for Human-AI Collaboration: The Critical Role of Human Reviewers and Supervisors

From poker to corporate strategy, AI shows vast data-processing power but critical weaknesses under uncertainty. In high-stakes settings, AI cannot operate autonomously; human reviewers and supervisors are essential.

Research on human-AI collaboration shows “complementary team performance” (CTP): the best results come from allocating tasks to leverage each side’s strengths. AI should handle high-volume automation and large-scale analytics, assist with mid-level judgments, and free humans to focus on complex, ambiguous, and value-driven decisions. This bidirectional loop pairs AI’s scale with human contextual reasoning, business sense, and ethics.

LLMs also “fail silently with confidence” when conditions change, so metacognitive human oversight is required to catch hallucinations, reconcile shifting stakeholder values, and keep behavior aligned with strategic goals. A practical governance model is “human-on-the-loop,” where supervisors continuously monitor AI, retain authority to abort actions, and audit outcomes. As agentic AI is deployed, organizations should establish governance boards and clear supervisory roles (safety monitoring, stakeholder communication, decision audits) to capture AI’s benefits while managing its risks.

A Cautionary Tale Beyond the Poker Table

The spectacular quarterfinals of the Kaggle Game Arena offered far more than just highly entertaining technological theater. It served as a profound stress test probing the boundaries of uncertainty for contemporary AI systems.

Revisiting the question of “how smart AI truly is,” the answer derived from the poker table is: current foundational language models are, in fact, not smart enough yet. They suffer from severe state-tracking hallucinations, mistaking semantic probabilities for objective reality ; they inherit cognitive flaws from human corpora, brazenly exhibiting sunk cost fallacies ; and they lack the ability to dynamically model irrational behavior, leading to catastrophic defeats through over-analysis when facing amateur players. Moreover, the variance in self-calibration among models reveals the delicate ecosystem between high stability (like Claude) and hyper-aggression (like OpenAI) within zero-sum games.

However, what is truly alarming about this experiment is the projection of these flaws onto the real world. The skills required at the poker table—risk management, quantifying probabilities under uncertainty, detecting deception, and executing strategic planning amid asymmetric information—are exactly the core competencies indispensable in financial trading, corporate negotiations, cybersecurity defense, and even international military conflicts.

Recent wargaming research involving LLMs in nuclear crises demonstrated that, much like the hyper-aggression GPT-5.2 exhibited in poker, these AIs frequently disregard the “Nuclear Taboo” when simulating military conflicts, recklessly recommending escalation and even nuclear strikes. If a highly articulate AI trader in financial markets experiences a “phantom flush” state hallucination, or refuses to cut losses due to a “sunk cost fallacy,” the consequences would be catastrophic.

Artificial intelligence is already capable of writing flawless game theory analysis reports, but this empirical experiment serves as an explicit warning: until they truly master precise state tracking, handle uncertainty effectively, and overcome inherent neural network hallucinations, handing over critical decision-making authority (especially high-pressure, imperfect-information decisions) entirely to language models carries immeasurable risk. In the foreseeable future, human emotional decoupling, intuitive judgment, metacognitive supervision, and resilience against irrational chaos remain the irreplaceable final line of defense in the ultimate game of strategic decision-making.

REFERENCES

Barkan, C. O., Black, S., & Sourbut, O. (2026). Do Large Language Models Know What They Are Capable Of? ICLR 2026.
Google DeepMind. (2026). Kaggle Game Arena Updates.
Human-AI Collaboration in High-Stakes Decision-Making Environments. (2026).
Kyoto University Discussion Paper. (2025). Prospect Theory, Loss Aversion, and LLM behavior.
Mauboussin, M. (2016). The Paradox of Skill.
Moontower Meta. (2025). Poker as an Adaptive, Adversarial Game.
NIST. AI Risk Management Framework.
Poker.org. (2026). Hyper-aggressive OpenAI bots reign supreme as silicon poker battle concludes.
Poker.org. (2026). Polk, Schulman, Boeree explore AI poker showdown as OpenAI dominates.
Reddit (r/reinforcementlearning). (2026). Game Arena Poker results are in: GPT 5.2 won the leaderboard but o3 won the bracket.
Reddit (r/technology). (2026). AIs can’t stop recommending nuclear strikes in war simulations.
ToolPoker: Leveraging LLMs for imperfect-information games. (2026).
YouTube / Doug Polk Poker. (2026). Epic Conclusion To AI Poker Showdown.

DISCLAIMER

Past performance does not guarantee future results.

Opinions and estimates offered constitute our judgment and are subject to change without notice, as are statements of financial market trends, which are based on current market conditions. We believe the information provided here is reliable, but do not warrant its accuracy or completeness. This material is not intended as an offer or solicitation for the purchase or sale of any cryptocurrencies. The views and strategies described may not be suitable for all investors. This material has been prepared for informational purposes only, and is not intended to provide, and should not be relied on for, accounting, legal or tax advice. Any forecasts contained herein are for illustrative purposes only and are not to be relied upon as advice or interpreted as a recommendation.

Unless otherwise stated, all data is as of October 7, 2024 or as of most recently available.