Hopp til hovedinnhold
AIKI

Autoresearch: 100+ AI Experiments While You Sleep

||4 min lesing

Key takeaways

  • 630 lines of Python run 100+ experiments autonomously overnight on a single GPU
  • Shopify CEO saw 19% improvement in 8 hours; 0.8B model outperformed larger 1.6B model
  • Paradigm shift: ML researchers write strategy, the AI agent writes the code

Andrej Karpathy, former AI lead at Tesla and co-founder of OpenAI, recently released autoresearch: an open source tool that lets AI agents run hundreds of machine learning experiments fully autonomously overnight. The project reached 30,000 GitHub stars in a single week, and the results show that autonomous research is no longer science fiction.


What is autoresearch?

Autoresearch is a minimalist Python framework of just 630 lines of code. The premise is simple: you write a strategy description in a file called program.md, and an AI agent takes over from there. The agent reads the training code, proposes changes, runs experiments, evaluates results and decides whether the changes are worth keeping.

What makes autoresearch particularly effective is that the entire codebase fits within the context window of a large language model. This means the AI agent can understand the whole system at once, which drastically reduces errors compared to working with large, complex codebases.

How the autonomous loop works

Autoresearch follows a tight, repeating cycle:

  1. You write strategy: A program.md file describes what the agent should optimize, such as architecture changes, hyperparameters or training setup
  2. The agent reads the code: It analyzes train.py (the training script) and understands the entire system
  3. The agent proposes changes: Based on the strategy, it modifies the code
  4. 5 minutes of training: Each experiment runs for exactly 5 minutes, regardless of hardware. This makes all results directly comparable
  5. Automatic evaluation: The agent measures validation bits-per-byte (val_bpb) and determines whether the change is an improvement
  6. Git commit or discard: Improvements are committed; worse results are discarded via git reset
  7. Repeat: The loop continues autonomously, roughly 12 experiments per hour

That means approximately 100 experiments overnight, with zero human intervention.

Results that speak for themselves

Karpathy tested autoresearch on his own nanochat project (a compact but complete GPT training setup), and the results are striking:

  • One night: 126 experiments run autonomously. Validation loss improved from 0.9979 to 0.9697
  • Two days: ~700 changes processed, ~20 additive improvements found that transferred to larger models. 11% efficiency improvement on the "Time to GPT-2" benchmark (from 2.02 hours to 1.80 hours)

But the most remarkable example comes from Shopify. CEO Tobi Lutke tested autoresearch on an internal 0.8 billion parameter model:

  • 37 experiments in 8 hours (overnight)
  • 19% improvement in model quality
  • The improved 0.8B model outperformed the previous 1.6B model it was meant to replace

In other words: a model half the size became better than the original, purely through automated optimization. The implications for cost and resource usage are significant.

From coding to strategy: A paradigm shift

Autoresearch represents something bigger than a useful tool. It marks a shift in how machine learning research is conducted.

Traditionally, ML research has required researchers to manually write and adjust training code, run experiments, analyze results and iterate. It is time-consuming and limited by how many hours a human has in a day.

With autoresearch, the roles change:

  • The human becomes the strategist who defines what to explore in program.md
  • The AI agent becomes the executor who writes code, runs experiments and reports results

Research quality is now largely determined by how well the human formulates the strategy document. That is a skill closer to product management and research strategy than traditional programming.

What does this mean for Norwegian businesses?

For Norwegian businesses that train or fine-tune AI models, autoresearch is directly relevant:

  • Cost reduction: When a 0.8B model can beat a 1.6B model after automated optimization, it means lower GPU costs and faster inference
  • Accessibility: Autoresearch runs on a single GPU. You do not need a large data center or access to hundreds of GPUs
  • Round-the-clock research: Researchers and developers can start experiments at the end of the workday and analyze results the next morning
  • Competency shift: ML teams spend less time on code and more time on strategy, experiment design and result analysis

For businesses considering building or adapting their own AI models, whether for Norwegian language technology, domain-specific chatbots or internal tools, autoresearch significantly lowers the barrier.

If you want to explore how autonomous AI agents can automate processes in your business, this is a good example of the direction technology is moving.

FAQ: Autoresearch and autonomous AI research

What does it cost to run autoresearch?

Autoresearch is free and open source. The main cost is GPU time (one GPU, e.g. NVIDIA H100) and API calls to a language model (Claude, GPT or similar) that serves as the agent. An overnight run of 100 experiments typically costs a few hundred NOK in API usage.

How long does it take to set up autoresearch?

Setup is minimal. You clone the GitHub repo, run a preparation script and define your program.md. For someone with basic Python experience, it takes less than an hour to get started.

Can autoresearch be used for things other than language models?

Today, autoresearch is designed for GPT-style training setups. The principle, an AI agent that autonomously runs experiments with a clear evaluation metric, can be adapted to other domains over time. The community has already started creating adaptations for other platforms.

Do you need an expensive GPU to use autoresearch?

Autoresearch has been tested on NVIDIA H100, but the design (5-minute training budget) makes it adaptable to different GPUs. Simpler GPUs yield fewer training steps per experiment, but the principle works. Cloud-based GPU services are an alternative for businesses without their own hardware.

Does autoresearch replace the need for ML engineers?

No. Autoresearch automates repetitive experiment execution, but still requires someone to define the research strategy, evaluate results and make decisions about the path forward. The ML engineer shifts role from code writer to research strategist.

What is the difference between autoresearch and traditional hyperparameter tuning?

Traditional hyperparameter tuning systematically searches through a predefined parameter space. Autoresearch is more flexible: the AI agent can change architecture, optimizers, batch sizes and training setup, not just adjust numerical values. It is closer to what a human researcher would do, just faster and more persistent.

Del:LinkedInXFacebook