autoresearch.studio

Turn any ML training script into an autonomous research loop. An AI agent modifies your code, runs experiments, keeps what improves the metric, discards what doesn't — and repeats, while you sleep.

Get started
terminal
# Install and init β€” one command creates everything
$ pip install autoresearchstudio
$ ars init
Creating project files:
  train.py
  prepare.py
  CLAUDE.md
Connecting to dashboard...
  API key: ars_bd02b4cc...
  Dashboard: https://autoresearch.studio/?key=ars_bd02...

# Launch Claude Code β€” it handles everything
$ claude
> configure and start the experiments

Claude: Setting up run...
  ars setup --tag mar23
  ars run -d "baseline"
  val_accuracy: 0.9506
  KEEP β€” baseline

Claude: Trying dropout + cosine lr...
  ars run -d "add dropout + cosine lr"
  val_accuracy: 0.9720
  KEEP β€” 0.9506 β†’ 0.9720

Claude: Testing gelu activation...
  ars run -d "replace relu with gelu"
  val_accuracy: 0.9680
  DISCARD β€” regressed, reverted

Claude: Continuing to next experiment...

The loop

The agent repeats this cycle autonomously until your model converges.

Claude
claude
ModifyAI edits code
TrainRuns experiment
JudgeLLM evaluates
Keepor discard

One command, full project

ars init generates everything. Customize for your task, or run the MNIST default as-is.

πŸ§ͺ train.py
# ── Model (modify this) ──────────
class Net(nn.Module):
    def __init__(self):
        self.conv1 = nn.Conv2d(1, 32, 3)
        self.conv2 = nn.Conv2d(32, 64, 3)
        self.fc1 = nn.Linear(64*7*7, 128)
        self.fc2 = nn.Linear(128, 10)

# ── Hyperparams (modify these) ───
LEARNING_RATE = 1e-3
BATCH_SIZE = 128

# ── Output (keep this format) ────
print(f"val_accuracy: {acc:.6f}")
The agent modifies this file. Every change is a git commit — keeps advance the branch, discards are reverted.
πŸ”’ prepare.py
# Read-only during experiments
TIME_BUDGET = 20  # seconds

def download_data():
    # downloads MNIST (or your data)
    ...

def load_data():
    # returns (train_X, train_y,
    #          test_X, test_y)
    ...

def evaluate_accuracy(model, X, y):
    # fixed eval β€” agent can't game it
    ...
Data loading + evaluation harness. Read-only during experiments so the agent can't game the metric.
βš™οΈ autoresearch.yaml
project:
  name: my-autoresearch
  goal: Get the highest val_accuracy.

files:
  editable: [train.py]
  readonly: [prepare.py]

experiment:
  run_command: python train.py
  timeout: 60

metric:
  name: val_accuracy
  pattern: "^val_accuracy:\\s+([\\d.]+)"
  direction: maximize
Config: metric, timeout, files. API key and dashboard URL are set automatically by ars init.
πŸ˜΄πŸ’€

Now go to sleep.
Claude will take it from here.