Leaderboard
Best scores normalized against state-of-the-art baselines. Higher is better, 100% = matches SOTA.
| # | Agent | Model | CL | MDT | CMR | TIM | IRB | Avg |
|---|---|---|---|---|---|---|---|---|
| 1 | RG-Agent | gpt-5-high | 94.3% | 49.4% | 96.3% | 107.2% | 34.3% | 76.3% |
| 2 | Codex | gpt-5.2-codex xhigh | 97.9% | 49.0% | 96.6% | 17.1% | 49.7% | 62.1% |
| 3 | Claude Code | claude-opus-4.5 | 98.5% | -- | -- | -- | 21.7% | 24.0% |
Research Tasks
Each task mirrors the full ML research cycle: problem understanding, approach design, implementation, and experimentation with fixed compute.
Continual Learning
Scalable continual learning for foundation models without rehearsal. ImageNet-R/A, CIFAR-100, CUB-200.
Cross-Modal Retrieval
Query shift adaptation in cross-modal retrieval. COCO-C, Flickr-C with 16 image + 15 text corruptions.
Improving Replay Buffers
Memory systems for efficient experience replay. DeepMind Control Suite, OpenAI Gym environments.
Materials Tokenization
Domain-preserving tokenization for materials science. SciBERT backbone, MatSci-NLP benchmark.
Time-Series Explanation
Directionally-aware explanations for time series. PAM, Boiler, Epilepsy, Wafer, Freezer datasets.
Agent Scaffolds
Three primary agent implementations with frontier model backends for comprehensive research capabilities.
RG-Agent
gpt-5-highPrimary research agent with comprehensive tooling for file operations, code execution, and experiment management.
Codex
gpt-5.2-codex xhighCode-specialized agent optimized for implementation tasks, rapid prototyping, and iterative development.
Claude Code
claude-opus-4.5Anthropic's agentic coding assistant with strong reasoning and multi-step planning capabilities.
Architecture
Modular adapter pattern with unified interfaces for agents, runtimes, and evaluation systems.
Budget Enforcement
Real-time cost tracking with per-model pricing via LiteLLM
Resume & Continuation
Session state preservation with transcript-based context recovery
Docker Isolation
Containerized execution with deterministic environments
Comprehensive Logging
Transcripts, cost summaries, and execution artifacts
Workspace Security
Path-bounded file operations with process isolation
Citation
If you use ResearchGym in your research, please cite our paper.
@misc{garikaparthi2026researchgymevaluatinglanguagemodel,
title={ResearchGym: Evaluating Language Model Agents on Real-World AI Research},
author={Aniketh Garikaparthi and Manasi Patwardhan and Arman Cohan},
year={2026},
eprint={2602.15112},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.15112},
}
Quick Start
Requires Python 3.12+ and uv package manager.