NitroGen: A Foundation Model for Generalist Gaming Agents

1 NVIDIA, 2 Stanford, 3 Caltech, 4 UChicago, 5 UT Austin
* Co-lead Co-advise
Corresponding authors: lmagne@nvidia.com, guanzhiw@nvidia.com, anasa@stanford.edu


Abstract

We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action policy trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.


Overview. NitroGen consists of three main components: (1) Multi-game foundation agent (center) - a generalist vision-action model that takes in game observations and generates gamepad actions, enabling zero-shot gameplay across multiple titles and serving as a foundation for fine-tuning on new games; (2) Universal simulator (left) - an environment wrapper that allows any commercial game to be controlled through a Gymnasium API; and (3) Internet-scale dataset (right) - the largest and most diverse open-source gaming dataset curated from 40,000 hours of publicly available gaming videos, spanning more than 1,000 games with extracted action labels.

Introduction

Building generally capable embodied agents that can operate in unknown environments has long been considered a holy grail of AI research. While computer vision and large language models (LLMs) have achieved this generalization through large-scale pre-training on internet data, comparable progress in embodied AI has been impeded by the lack of large, diverse, and labeled action datasets. Video games present an ideal domain for advancing embodied AI since they offer visually rich environments and tasks that span a wide range of complexities and temporal horizons. However, prior approaches face substantial limitations. LLM-based methods exploit either (1) hand-crafted programmatic APIs to expose internal game states and control agents or (2) complicated perception modules for textual information extraction and object detection. They enable complex task-solving but require complicated domain-specific design and tuning. Reinforcement learning has achieved superhuman performance in individual games such as StarCraft II and Dota 2, but these agents are narrow, costly to train, and depend on specialized simulators rarely available for arbitrary games. Behavior-cloning approaches based on pixel observations have relied on expensive contractor-collected demonstrations, constraining training to only a few game titles due to prohibitive data collection costs.

Internet-Scale Multi-Game Video-Action Dataset


Video-action dataset pipeline overview. We extract actions from on-screen displays which show the gamepad actions of the player in real-time; called ``input overlays''. (Left) Dataset curation. We collect publicly available videos displaying a “gamepad overlay”. The diversity of these overlays presents significant challenges, as gamepads vary widely across content creators in controller types (e.g., Xbox, PlayStation, or others), transparency levels, and visual artifacts introduced by video compression. (Right) Action extraction. For each collected video, we localize the gamepad by sampling 25 frames and running keypoint matching against a curated set of templates using SIFT and XFeat features. We use the template-matching results to localize and crop the gamepad region from each video. A hybrid classification–segmentation network is then trained to predict joystick positions and button states from the cropped controller images, enabling accurate reconstruction of player inputs.

Action Quality Control


Action extraction quality. We verify the correctness of our action extraction pipeline by comparing performance across different controller families against ground-truth data. (a) shows joystick R² correlation scores (averaged for both left and right joysticks) with an overall average of 0.84. (b) shows button frame accuracy with an overall average of 0.96.

Dataset Analysis


Distribution of the NitroGen dataset across games and genres. After filtering, the dataset contains 40,000 hours of gameplay videos spanning more than 1,000 games. (a) Hours per game shows broad coverage, with 846 games having over one hour of data, 91 games with over 100 hours, and 15 games exceeding 1,000 hours each. (b) Genre distribution reveals Action-RPG games are most common (34.9% of total hours), followed by Platformer (18.4%) and Action-Adventure (9.2%) games, with the remainder distributed across various genres.

Experiments



Off-the-shelf Multi-Game Capabilities

NitroGen 500M pre-training results across different games. We train a single 500M parameters model on the entire NitroGen dataset using a flow-matching GR00T architecture. We evaluate it after behavior-cloning pre-training. For each game, we measure the average task completion rate on 3 tasks with 5 rollouts per task. Without further fine-tuning and despite being trained on a very noisy internet dataset, NitroGen is able to perform non-trivial tasks across games with different visual styles (3D, 2D top-down, 2D side-scrolling) and genres (platformer, action-RPG, roguelike, etc.).

Pre-Training Transfer on Unseen Games

Post-training experiments. NitroGen pre-training improves downstream agents in unseen environments. We pre-train NitroGen on the dataset described above, holding out one game. We then fine-tune the pre-trained checkpoint on the held-out game and compare the results with a model trained from scratch using the same architecture, data and compute budget. (a) When varying data quantity, task-completion rate scales with dataset size, and fine-tuning achieves on average a 10% relative improvement in task-completion rate. (b) When varying task type in the low-data regime (30h), fine-tuning achieves up to 52% relative improvement in task-completion rate.

Conclusion

In this work, we introduce NitroGen, an approach to scale up foundation pre-training for video game agents and demonstrate how internet pre-training can yield a generalist policy. We leverage a new source of publicly available data to build an internet-scale video-action dataset, and empirically demonstrate its effectiveness by successfully training a multi-game policy. NitroGen shows positive signs of generalization in fine-tuning experiments. By lowering the barrier to train agents on new environments, NitroGen serves as a starting point to develop more powerful and general-purpose agents.

Team

Loïc Magne *
Anas Awadalla *
Guanzhi Wang *
Yinzhen Xu
Joshua Belofsky
Fengyuan Hu
Joohwan Kim
Ludwig Schmidt
Georgia Gkioxari
Jan Kautz
Yisong Yue
Yejin Choi
Yuke Zhu
Linxi "Jim" Fan

* Co-lead    † Co-advise