Autoresearch: 100+ AI Experiments While You Sleep

AIKIAI Consultancy

|16. mars 2026|4 min lesing

Key takeaways

630 lines of Python run 100+ experiments autonomously overnight on a single GPU
Shopify CEO saw 19% improvement in 8 hours; 0.8B model outperformed larger 1.6B model
Paradigm shift: ML researchers write strategy, the AI agent writes the code

Andrej Karpathy, former AI lead at Tesla and co-founder of OpenAI, recently released autoresearch: an open source tool that lets AI agents run hundreds of machine learning experiments fully autonomously overnight. The project reached 30,000 GitHub stars in a single week, and the results show that autonomous research is no longer science fiction.

What is autoresearch?

Autoresearch is a minimalist Python framework of just 630 lines of code. The premise is simple: you write a strategy description in a file called program.md, and an AI agent takes over from there. The agent reads the training code, proposes changes, runs experiments, evaluates results and decides whether the changes are worth keeping.

What makes autoresearch particularly effective is that the entire codebase fits within the context window of a large language model. This means the AI agent can understand the whole system at once, which drastically reduces errors compared to working with large, complex codebases.

How the autonomous loop works

Autoresearch follows a tight, repeating cycle:

You write strategy: A program.md file describes what the agent should optimize, such as architecture changes, hyperparameters or training setup
The agent reads the code: It analyzes train.py (the training script) and understands the entire system
The agent proposes changes: Based on the strategy, it modifies the code
5 minutes of training: Each experiment runs for exactly 5 minutes, regardless of hardware. This makes all results directly comparable
Automatic evaluation: The agent measures validation bits-per-byte (val_bpb) and determines whether the change is an improvement
Git commit or discard: Improvements are committed; worse results are discarded via git reset
Repeat: The loop continues autonomously, roughly 12 experiments per hour

That means approximately 100 experiments overnight, with zero human intervention.

Results that speak for themselves

Karpathy tested autoresearch on his own nanochat project (a compact but complete GPT training setup), and the results are striking:

One night: 126 experiments run autonomously. Validation loss improved from 0.9979 to 0.9697
Two days: ~700 changes processed, ~20 additive improvements found that transferred to larger models. 11% efficiency improvement on the "Time to GPT-2" benchmark (from 2.02 hours to 1.80 hours)

But the most remarkable example comes from Shopify. CEO Tobi Lutke tested autoresearch on an internal 0.8 billion parameter model:

37 experiments in 8 hours (overnight)
19% improvement in model quality
The improved 0.8B model outperformed the previous 1.6B model it was meant to replace

In other words: a model half the size became better than the original, purely through automated optimization. The implications for cost and resource usage are significant.

From coding to strategy: A paradigm shift

Autoresearch represents something bigger than a useful tool. It marks a shift in how machine learning research is conducted.

Traditionally, ML research has required researchers to manually write and adjust training code, run experiments, analyze results and iterate. It is time-consuming and limited by how many hours a human has in a day.

With autoresearch, the roles change:

The human becomes the strategist who defines what to explore in program.md
The AI agent becomes the executor who writes code, runs experiments and reports results

Research quality is now largely determined by how well the human formulates the strategy document. That is a skill closer to product management and research strategy than traditional programming.

What does this mean for Norwegian businesses?

For Norwegian businesses that train or fine-tune AI models, autoresearch is directly relevant:

Cost reduction: When a 0.8B model can beat a 1.6B model after automated optimization, it means lower GPU costs and faster inference
Accessibility: Autoresearch runs on a single GPU. You do not need a large data center or access to hundreds of GPUs
Round-the-clock research: Researchers and developers can start experiments at the end of the workday and analyze results the next morning
Competency shift: ML teams spend less time on code and more time on strategy, experiment design and result analysis

For businesses considering building or adapting their own AI models, whether for Norwegian language technology, domain-specific chatbots or internal tools, autoresearch significantly lowers the barrier.

If you want to explore how autonomous AI agents can automate processes in your business, this is a good example of the direction technology is moving.

FAQ: Autoresearch and autonomous AI research

What does it cost to run autoresearch?

Autoresearch is free and open source. The main cost is GPU time (one GPU, e.g. NVIDIA H100) and API calls to a language model (Claude, GPT or similar) that serves as the agent. An overnight run of 100 experiments typically costs a few hundred NOK in API usage.

How long does it take to set up autoresearch?

Setup is minimal. You clone the GitHub repo, run a preparation script and define your program.md. For someone with basic Python experience, it takes less than an hour to get started.

Can autoresearch be used for things other than language models?

Today, autoresearch is designed for GPT-style training setups. The principle, an AI agent that autonomously runs experiments with a clear evaluation metric, can be adapted to other domains over time. The community has already started creating adaptations for other platforms.

Do you need an expensive GPU to use autoresearch?

Autoresearch has been tested on NVIDIA H100, but the design (5-minute training budget) makes it adaptable to different GPUs. Simpler GPUs yield fewer training steps per experiment, but the principle works. Cloud-based GPU services are an alternative for businesses without their own hardware.

Does autoresearch replace the need for ML engineers?

No. Autoresearch automates repetitive experiment execution, but still requires someone to define the research strategy, evaluate results and make decisions about the path forward. The ML engineer shifts role from code writer to research strategist.

What is the difference between autoresearch and traditional hyperparameter tuning?

Traditional hyperparameter tuning systematically searches through a predefined parameter space. Autoresearch is more flexible: the AI agent can change architecture, optimizers, batch sizes and training setup, not just adjust numerical values. It is closer to what a human researcher would do, just faster and more persistent.

Del:LinkedIn X Facebook

Relaterte innlegg

Enterprise team deploying AI agents from pilot phase to full production, with dashboards showing ROI metrics

Strategi

Enterprise AI in 2026: From Experiment to Operation

In brief: Nearly every company is increasing AI investments in 2026, but few are seeing concrete results. Fresh data from Deloitte and Gartner reveals a growing gap between ambition and impact. Here's

15. mars 20264 min lesing

Premium office setting where leaders compare two AI workspaces on large screens, without logos or text.

Strategi

ChatGPT vs Claude: What should your business choose?

When Norwegian businesses compare ChatGPT vs Claude, the decision is rarely about which model looks most impressive in a demo. It is about what employees need to do on Monday morning: write proposals,

28. mai 202612 min lesing

Scandinavian meeting room with whiteboard workflow diagrams and laptop for AI automation

Strategi

AI Automation for Norwegian Businesses: Complete Guide with 9 Examples, ROI Figures and a Step-by-Step Plan

AI automation means software uses artificial intelligence to perform tasks that previously required manual work. According to a McKinsey survey from 2025, 67% of Norwegian mid-sized businesses reporte

22. mai 20267 min lesing

Strategi · 15. mars 2026Enterprise AI in 2026: From Experiment to Operation Strategi · 28. mai 2026ChatGPT vs Claude: What should your business choose?Strategi · 22. mai 2026AI Automation for Norwegian Businesses: Complete Guide with 9 Examples, ROI Figures and a Step-by-Step Plan