Autoresearch: the end of A/B-testing as we know it

On March 6, 2026, Andrej Karpathy published a 630-line Python script on GitHub (opent in nieuw venster). Within a week it had 30,000 stars. Not because it looked spectacular, but because it broke a paradigm.

The project is called autoresearch. It changes everything we thought we knew about optimisation.

What is autoresearch?

The idea is almost offensively simple.

You give an AI agent a piece of code, a measurable metric, and a five-minute time budget. The agent reads the code, thinks of an improvement, modifies the code, runs the experiment, and checks whether the metric improved. If yes: keep it. If no: discard it. And repeat.

Twelve experiments per hour. A hundred experiments per night. While you sleep.

Karpathy originally used it to train a small language model. The agent discovered optimisations he had missed in twenty years of manual research. Shopify CEO Tobi Lutke aimed the same loop at an internal search model. The result: a model of 0.8 billion parameters that scored 19% better than the previous model of 1.6 billion parameters. Smaller, faster, better. Without human intervention.

The pattern escaped machine learning within a week.

A/B testing on steroids? No. The end of A/B testing

Let's be honest about how A/B testing works today.

Your marketing team comes up with a hypothesis. "Maybe a green button converts better than a blue one." Someone creates a ticket. A developer builds the variant. A tool like VWO or Optimizely splits the traffic. After two weeks you have statistical significance. Or not. Then you start again.

Thirty experiments per year. If you're lucky.

Autoresearch runs a hundred experiments per night.

The real difference runs deeper than speed. A/B testing tools test choices that humans come up with. Green versus blue. Text A versus text B. Human imagination is the bottleneck.

An autoresearch loop comes up with its own hypotheses. The agent looks at the current code, combines that with what did and didn't work before, and proposes improvements no human would think of. Not just the colour of a button, but the entire structure of the page. The order of elements. The way JavaScript is loaded. The dimensions of images. Everything at once.

This is not a better version of VWO. This is a different category.

Faster code. Every night. Automatically

Here it gets concrete for anyone running a website.

There is a variant called pi-autoresearch that applies the pattern to web performance. You aim the loop at Lighthouse scores, bundle size, or build times. The agent modifies your frontend code, runs a Lighthouse audit, checks whether the score improved, and continues.

Imagine this. You go home on Friday afternoon. On Monday morning your Lighthouse score has gone from 72 to 94. Not because someone worked through the weekend, but because an agent ran 200 experiments of which 15 actually produced improvements that stacked on top of each other.

Tobi Lutke did something similar with Shopify's Liquid templating engine. The agent ran 120 experiments and found 93 commits with improvements. The result: 53% faster parsing, 61% less memory usage. The agent discovered that a simple switch from regex to direct byte-matching delivered 12% speed — an optimisation that had been within reach for years but nobody had picked up.

This is not about "we need to make our site faster." This is about a world where your site automatically gets faster every night, where Core Web Vitals are no longer a project but an ongoing process running in the background.

Code gets cheaper, structurally

The implications for development costs are enormous.

A senior developer costs at least €100 per hour. Performance optimisation is specialist work. A two-week sprint focused on improving load times easily costs €15,000 to €20,000. And then you might have found five improvements.

Autoresearch finds twenty in a night. On a GPU that costs less than your development team's coffee budget.

But it goes further than just performance. Harrison Chase, the founder of LangChain, built a variant within days where an agent optimises the code of another agent. Agent-on-agent optimisation. The metric: an evaluation score. The loop: endless.

An ever-larger portion of optimisation work is shifting from human expertise to compute. Not the creative work. Not deciding what to build. But the endless grinding of making something better, faster, and more efficient. That is now compute. And compute gets cheaper every day.

Six applications nobody is thinking about yet

So far the conversation about autoresearch is mainly about machine learning, web performance, and marketing. Logical. Those are the first use cases.

But the pattern is universal. Anything you can measure, you can put into an autoresearch loop. And that opens doors that almost nobody sees yet.

1. Token optimisation: programming that makes itself cheaper

This may be the most meta application you can imagine. AI agents that write code consume tokens. Every token costs money. With Claude Sonnet you pay per million input and output tokens. With a serious codebase of thousands of files those costs add up quickly. A complex refactoring can easily consume hundreds of thousands of tokens in one session.

But how many of those tokens are really necessary?

Imagine an autoresearch loop aimed at reducing token consumption in code generation. The metric: the number of tokens needed to correctly execute a defined set of programming tasks. The agent adjusts the system prompt, optimises the structure of instructions, experiments with more compact code patterns and smarter context window strategies.

The implications are dizzying. What if the loop discovers that a particular way of structuring functions requires 30% fewer tokens when generating comparable code? Or that a specific prompting strategy forces the agent to write shorter but equally correct solutions?

This is AI optimising itself to program more efficiently. Every round it gets cheaper to run the next round.

Prompt optimisation tools like GEPA from ICLR 2026 use genetic evolution to improve prompts on frozen models. But autoresearch goes a step further: it optimises not just the prompt, but the entire workflow. How code is structured, how context is presented, how tasks are divided. Everything that affects token consumption is fair game.

2. Legal contracts: clauses that minimise risk

Large companies have thousands of contracts running. Each contract contains clauses that distribute risks: liability limitations, penalty clauses, warranty periods. The question of which combination of clauses best protects against financial loss is currently the domain of expensive lawyers operating on experience and intuition.

But contractual risk is measurable. You can quantify historical claims, disputes and outcomes. The metric becomes expected financial exposure per contract type.

An autoresearch loop can generate variants of standard clauses, simulate these against historical dispute data, and find the combination that minimises financial exposure. With the guard rail that clauses remain legally valid, validated by a compliance check.

No lawyer thinks of a hundred variants of a liability clause. An agent does. And does it in a night.

3. Supply chain routing: logistics that optimises itself

Logistics companies optimise routes with software. But current systems work with fixed algorithms and predefined constraints. They find local optima. Not global ones.

Autoresearch can approach this fundamentally differently. The metric: total transport costs per delivered unit, including fuel, time, personnel costs and CO₂ levies. The agent adjusts the routing logic — not just the routes themselves but the rules by which routes are determined. Simulates a week of deliveries. Checks the metric. Repeats.

The difference from existing route optimisation is that the agent may rewrite the rules themselves. Perhaps it discovers that splitting certain deliveries across two smaller vehicles is cheaper than one large transport. Or that shifting deliveries to night hours in certain regions reduces total costs by 8%.

4. Chips and data centres: hardware that learns to consume less energy

Data centres already consume more electricity than some countries. The expectation is that AI workloads' energy consumption will double or triple in the coming years.

But here lies an enormous opportunity. Because the energy consumption of chips and data centres is largely a software problem.

Take GPU kernels. These are the small pieces of code running on the chip. How those kernels are written determines how much energy a calculation costs. A project called AutoKernel applies the autoresearch pattern to GPU kernel optimisation. The agent profiles which kernels consume the most energy, rewrites them, benchmarks the result and repeats.

A data centre has thousands of configuration parameters. Cooling algorithms, workload schedulers, power management, thermal models. These systems are currently tuned by engineers based on best practices and manual tuning. But the interactions between those systems are so complex that no human can find the optimal point.

The most beautiful irony of all: AI deploying itself to make AI cheaper and greener.

5. Energy networks: balancing that learns itself

The electricity grid is becoming increasingly complex. Solar panels, wind turbines, home batteries, electric cars charging and discharging. The balance between supply and demand must be right in real time.

The metric is clear: minimum cost of energy balancing, with the constraint that grid frequency stays within tolerances. An autoresearch loop can simulate and optimise the control logic of an energy network. When does your home battery switch from charging to supplying? At what price threshold do you ask large industrial consumers to reduce their consumption?

This is the kind of problem with thousands of interacting variables that is impossible for humans to oversee. But perfect for a loop that tests a new configuration every five minutes.

The energy transition is not just a hardware problem. It is a software optimisation problem. And autoresearch is built for this type of challenge.

6. Education personalisation: learning paths that improve themselves

Educational platforms have struggled with personalisation for years. Which order of learning material leads to the best learning outcomes? Which combination of video, text, and exercises works for which type of learner?

The metric: score on a standardised test after completing a module. The agent adjusts the order, difficulty level, and mix of content. Simulates learning paths based on historical student data. Checks whether the average test score improves without the dropout rate rising.

An autoresearch loop can test thousands of variants of a learning path and discover that a counterintuitive order — for example, first a difficult exercise and then the theory — leads to better results.

Education is too important to base only on intuition.

The pattern is the product

This is what most people don't see yet.

Autoresearch is not a tool. It is a pattern. A recipe. Take a modifiable file, define a measurable metric, set an AI agent on it, and let the loop run.

Karpathy himself summarised it:

You no longer write the code. You write the markdown that tells the agent how to write the code. The human becomes the meta-researcher. The strategy is yours. The tactics belong to the machine.

The implications are enormous. Every company that is trying to improve a number somewhere now has access to a method that experiments a hundred times faster than a human team. Not in ten years. Now. The script is on GitHub. MIT licence. 630 lines of Python.

The question is not whether this pattern will impact your industry.

The question is: where do you aim the loop?