The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

TL;DR

PokeAgent Challenge tests AI decision-making via Pokemon battles and RPG, offering a 20M+ dataset and standardized evaluation framework.

cs.LG 🔴 Advanced 2026-03-17 4 citations 96 views
Seth Karten Jake Grigsby Tersoo Upaa Junik Bae Seonghun Hong Hyunyoung Jeong Jaeyoon Jung Kun Kerdthaisong Gyungbo Kim Hyeokgi Kim Yujin Kim Eunju Kwon Dongyu Liu Patrick Mariglia Sangyeon Park Benedikt Schink Xianwei Shi Anthony Sistilli Joseph Twin Arian Urdu Matin Urdu Qiao Wang Ling Wu Wenli Zhang Kunsheng Zhou Stephanie Milani Kiran Vodrahalli Amy Zhang Fei Fang Yuke Zhu Chi Jin
multi-agent systems partial observability long-horizon planning reinforcement learning large language models

Key Findings

Methodology

The PokeAgent Challenge evaluates AI decision-making capabilities through two complementary tracks: the Battling Track and the Speedrunning Track. The Battling Track provides a dataset of over 20 million battle trajectories and includes heuristic, reinforcement learning (RL), and large language model (LLM)-based baselines. The Speedrunning Track offers the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons. The NeurIPS 2025 competition validated the quality of these resources and the research community's interest in Pokemon.

Key Results

  • Result 1: In the Battling Track, significant gaps were found between generalist LLM, specialist RL, and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites.
  • Result 2: The Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, with participants using novel methods such as Scripted Policy Distillation and iterative offline RL with dynamic data weighting.
  • Result 3: The NeurIPS 2025 competition attracted over 100 teams, revealing considerable gaps between generalist LLM, specialist RL, and elite human performance.

Significance

The PokeAgent Challenge provides a large-scale benchmark for AI decision-making research, particularly in partial observability, game-theoretic reasoning, and long-horizon planning. By leveraging Pokemon's multi-agent battle system and RPG environment, researchers can simultaneously examine these three aspects under realistic conditions. This challenge not only fills a gap in existing benchmarks but also provides new momentum for RL and LLM research. With a standardized evaluation framework and rich datasets, the PokeAgent Challenge offers an important research tool for academia and industry.

Technical Contribution

The technical contributions of the PokeAgent Challenge include providing a standardized evaluation framework that combines competitive battling via Pokemon Showdown with RPG speedrunning via Pokemon Emerald. It offers the largest publicly available Pokemon battle dataset and introduces the first open-source multi-agent orchestration system for long-horizon RPG play. Empirical validation through the NeurIPS 2025 competition reveals significant gaps between generalist LLM, specialist RL, and elite human performance, with orthogonality analysis showing that Pokemon battling captures capabilities not predicted by the BenchPress evaluation matrix.

Novelty

The PokeAgent Challenge is the first benchmark to simultaneously examine partial observability, game-theoretic reasoning, and long-horizon planning under realistic conditions. Unlike existing benchmarks, it combines adversarial reasoning with large-scale long-horizon planning and provides a living competitive ecosystem.

Limitations

  • Limitation 1: Although the PokeAgent Challenge offers rich datasets and a standardized evaluation framework, its complexity may lead to high computational costs, limiting participation by smaller research teams.
  • Limitation 2: The complexity and dynamism of the Pokemon environment may pose challenges for models adapting to continuously evolving metagames.
  • Limitation 3: Despite revealing gaps between generalist LLM and specialist RL, the challenge does not fully address how to close this gap in practical applications.

Future Work

Future research directions include developing more efficient algorithms to address the complexity of the PokeAgent Challenge, particularly in partial observability and long-horizon planning. Researchers can also explore applying the challenge's techniques to other complex multi-agent systems and dynamic environments.

AI Executive Summary

The PokeAgent Challenge is a large-scale benchmark for AI decision-making research, addressing the core challenges of partial observability, game-theoretic reasoning, and long-horizon planning. Existing benchmarks often focus on one aspect, whereas the PokeAgent Challenge simultaneously examines these three capabilities under realistic conditions through Pokemon's multi-agent battle system and RPG environment.

The challenge is divided into two complementary tracks: the Battling Track and the Speedrunning Track. The Battling Track provides a dataset of over 20 million battle trajectories and includes heuristic, reinforcement learning (RL), and large language model (LLM)-based baselines. The Speedrunning Track offers the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons.

In the NeurIPS 2025 competition, over 100 teams participated across both tracks, revealing significant gaps between generalist LLM, specialist RL, and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites.

The technical contributions of the PokeAgent Challenge include providing a standardized evaluation framework that combines competitive battling via Pokemon Showdown with RPG speedrunning via Pokemon Emerald. It offers the largest publicly available Pokemon battle dataset and introduces the first open-source multi-agent orchestration system for long-horizon RPG play.

Despite the rich datasets and standardized evaluation framework, the complexity of the PokeAgent Challenge may lead to high computational costs, limiting participation by smaller research teams. Future research directions include developing more efficient algorithms to address the challenge's complexity, particularly in partial observability and long-horizon planning.

Deep Analysis

Background

In the field of artificial intelligence, partial observability, game-theoretic reasoning, and long-horizon planning have been core challenges in sequential decision-making. However, existing benchmarks often focus on one aspect, such as imperfect-information games emphasizing equilibrium computation in short episodes, while open-ended environments test exploration but lack adversarial opponents. Pokemon is an environment that combines all three: competitive battles require reasoning under hidden information against a strategic adversary, while single-player campaigns demand thousands of cumulative decisions spanning exploration, resource management, and combat over extended horizons. Pokemon's complexity and dynamism make it a more complex testbed than most existing benchmarks. Recently, Pokemon has gained significant interest for evaluating frontier AI systems. Demonstrations like Claude Plays Pokemon, Gemini 2.5 Pro, and OpenAI's GPT-5 have reinforced Pokemon's suitability as an AI testbed, but efforts have been fragmented due to different games, harnesses, and evaluation criteria.

Core Problem

The PokeAgent Challenge aims to address the core challenges of partial observability, game-theoretic reasoning, and long-horizon planning. Existing benchmarks often focus on one aspect, whereas the PokeAgent Challenge simultaneously examines these three capabilities under realistic conditions through Pokemon's multi-agent battle system and RPG environment. Pokemon's complexity and dynamism make it a more complex testbed than most existing benchmarks. With a standardized evaluation framework and rich datasets, the PokeAgent Challenge offers an important research tool for academia and industry.

Innovation

The core innovations of the PokeAgent Challenge lie in its standardized evaluation framework and rich datasets. Firstly, it combines competitive battling via Pokemon Showdown with RPG speedrunning via Pokemon Emerald, providing a living competitive ecosystem. Secondly, it offers the largest publicly available Pokemon battle dataset, comprising over 20 million battle trajectories. Lastly, it introduces the first open-source multi-agent orchestration system for long-horizon RPG play. These innovations enable the PokeAgent Challenge to simultaneously examine the capabilities of partial observability, game-theoretic reasoning, and long-horizon planning under realistic conditions.

Methodology

The PokeAgent Challenge evaluates AI decision-making capabilities through two complementary tracks:


  • �� Battling Track: Provides a dataset of over 20 million battle trajectories and includes heuristic, reinforcement learning (RL), and large language model (LLM)-based baselines.

  • �� Speedrunning Track: Offers the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons.

  • �� Dataset: The Battling Track dataset comprises over 20 million battle trajectories, while the Speedrunning Track dataset offers the first standardized evaluation framework for RPG speedrunning.

  • �� Baselines: The Battling Track baselines include heuristic, RL, and LLM-based strategies, while the Speedrunning Track baselines offer the first standardized evaluation framework for RPG speedrunning.

Experiments

The experimental design of the PokeAgent Challenge includes two complementary tracks: the Battling Track and the Speedrunning Track. The Battling Track provides a dataset of over 20 million battle trajectories and includes heuristic, reinforcement learning (RL), and large language model (LLM)-based baselines. The Speedrunning Track offers the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons. In the NeurIPS 2025 competition, over 100 teams participated across both tracks, revealing significant gaps between generalist LLM, specialist RL, and elite human performance.

Results

The key results of the PokeAgent Challenge include:


  • �� In the Battling Track, significant gaps were found between generalist LLM, specialist RL, and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites.

  • �� The Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, with participants using novel methods such as Scripted Policy Distillation and iterative offline RL with dynamic data weighting.

  • �� The NeurIPS 2025 competition attracted over 100 teams, revealing considerable gaps between generalist LLM, specialist RL, and elite human performance.

Applications

The application scenarios of the PokeAgent Challenge include:


  • �� Academic Research: With a standardized evaluation framework and rich datasets, the PokeAgent Challenge offers an important research tool for academia.

  • �� Industrial Applications: The PokeAgent Challenge provides a standardized platform for evaluating AI decision-making capabilities, particularly in partial observability, game-theoretic reasoning, and long-horizon planning.

  • �� Game Development: The PokeAgent Challenge offers a standardized platform for testing AI decision-making capabilities, particularly in complex multi-agent systems and dynamic environments.

Limitations & Outlook

Despite the rich datasets and standardized evaluation framework, the complexity of the PokeAgent Challenge may lead to high computational costs, limiting participation by smaller research teams. Additionally, the complexity and dynamism of the Pokemon environment may pose challenges for models adapting to continuously evolving metagames. Future research directions include developing more efficient algorithms to address the challenge's complexity, particularly in partial observability and long-horizon planning.

Plain Language Accessible to non-experts

Imagine you're playing a complex board game where the rules keep changing, and you can't see all of your opponent's pieces. This is the essence of the PokeAgent Challenge: making the best decisions in an environment full of uncertainties. It's like cooking in a kitchen where you have to quickly adjust your recipe based on the available ingredients and tools. The PokeAgent Challenge tests AI's decision-making capabilities in such a complex environment through Pokemon battles and RPG settings. It's like a large cooking competition where AI needs to make quick decisions with limited information and adapt strategies in a constantly changing environment. In this way, the PokeAgent Challenge provides a unique testbed for AI research, helping researchers develop smarter algorithms.

ELI14 Explained like you're 14

Imagine you're playing a super complex game where the rules are always changing, and you can't see all of your opponent's moves. That's the core of the PokeAgent Challenge! It's like being in a big science competition at school where you have to make quick decisions based on limited information. The PokeAgent Challenge tests AI's decision-making abilities in such a complex environment through Pokemon battles and RPG settings. It's like a big game competition where AI needs to make quick decisions with limited information and adapt strategies in a constantly changing environment. This way, the PokeAgent Challenge provides a unique testbed for AI research, helping researchers develop smarter algorithms.

Glossary

Multi-agent System

A system involving multiple interacting agents, often used to simulate complex social or natural phenomena.

In the PokeAgent Challenge, Pokemon's battle system is considered a multi-agent system.

Partial Observability

Refers to a decision-making process where the agent cannot fully observe all states of the environment.

In Pokemon battles, players cannot see all of the opponent's information, which is an example of partial observability.

Game-theoretic Reasoning

The use of game theory methods to analyze and formulate strategies in competitive environments.

The Battling Track of the PokeAgent Challenge requires AI to perform game-theoretic reasoning.

Long-horizon Planning

A decision-making process that involves a long time span, often requiring consideration of multiple future actions.

The Speedrunning Track requires AI to perform long-horizon planning to achieve RPG game objectives.

Reinforcement Learning

A machine learning method where agents learn strategies by interacting with the environment to maximize cumulative rewards.

In the PokeAgent Challenge, reinforcement learning is used to train AI performance in the Battling Track.

Large Language Model

A deep learning-based language model capable of generating and understanding natural language.

The baselines in the PokeAgent Challenge include strategies based on large language models.

Standardized Evaluation Framework

A unified evaluation standard used to compare the performance of different algorithms or models.

The PokeAgent Challenge provides a standardized evaluation framework for comparing AI performance across different tracks.

Open-source Multi-agent Orchestration System

An open-source software system for coordinating multiple agents, supporting modular and reproducible experiments.

The Speedrunning Track uses an open-source multi-agent orchestration system for RPG speedrunning evaluation.

BenchPress Evaluation Matrix

A matrix used to evaluate AI model performance, containing multiple benchmark tests.

Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks.

Scripted Policy Distillation

A technique for converting high-level strategies into executable policies, often used in reinforcement learning.

In the Speedrunning Track, participants used Scripted Policy Distillation to improve AI performance.

Open Questions Unanswered questions from this research

  • 1 Open Question 1: How can AI decision-making capabilities be improved in environments with partial observability and long-horizon planning? Existing methods still face challenges in handling complex multi-agent systems, requiring the development of more efficient algorithms.
  • 2 Open Question 2: How can models adapt to new strategies in dynamically changing metagames? The complexity and dynamism of the Pokemon environment pose challenges for models adapting to continuously evolving metagames.
  • 3 Open Question 3: How can the performance gap between generalist LLM and specialist RL be narrowed? Although the PokeAgent Challenge reveals this gap, it does not fully address how to close this gap in practical applications.
  • 4 Open Question 4: How can participation by smaller research teams be increased despite high computational costs? The complexity of the PokeAgent Challenge may lead to high computational costs, limiting participation by smaller research teams.
  • 5 Open Question 5: How can the techniques from the PokeAgent Challenge be applied to other complex multi-agent systems and dynamic environments? Future research directions include developing more efficient algorithms to address the complexity of the PokeAgent Challenge, particularly in partial observability and long-horizon planning.

Applications

Immediate Applications

Academic Research

With a standardized evaluation framework and rich datasets, the PokeAgent Challenge offers an important research tool for academia.

Industrial Applications

The PokeAgent Challenge provides a standardized platform for evaluating AI decision-making capabilities, particularly in partial observability, game-theoretic reasoning, and long-horizon planning.

Game Development

The PokeAgent Challenge offers a standardized platform for testing AI decision-making capabilities, particularly in complex multi-agent systems and dynamic environments.

Long-term Vision

Future Development of Agent Systems

The PokeAgent Challenge provides an important research direction for the future development of agent systems, particularly in partial observability and long-horizon planning.

AI Applications in Complex Environments

The PokeAgent Challenge provides an important research direction for AI applications in complex environments, particularly in dynamically changing environments.

Abstract

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

cs.LG cs.AI

Cited By (4)

Automatic Generation of High-Performance RL Environments

Benchmarking In-context Experiential Learning Through Repeated Product Recommendations

2025 1 citations View Analysis →

A Survey on Large Language Model-Based Game Agents

2024 116 citations View Analysis →

GameDevBench: Evaluating Agentic Capabilities Through Game Development