Self-Evolving Software is Eating the World

April 14, 2026

Agents are already writing the code. The next step is software that improves itself.

In machine learning, this is already happening.

At YouTube, a self-evolving recommendation system runs on the order of a hundred experiments per week. Before it existed, human ML engineers ran one to ten.

The system has two loops. One agent proposes changes quickly using cheap offline tests. Another validates the best candidates against real user behavior in production. Between them, the system discovered improvements no engineer had tried: a better training algorithm, a new way to filter input signals, a scoring function no one had designed.

Meta's Ranking Engineer Agent does something similar for ads: it generates hypotheses, runs training jobs, and iterates autonomously over weeks.

Engineers did not disappear. They shifted to strategy and safety oversight. They still define what the system should optimize for. They no longer figure out how.

At Sakana AI, a system called the Darwin Gödel Machine goes further. It does not just evolve programs it is given. It rewrites its own source code.

The DGM reads its own Python codebase, proposes modifications, evaluates them on coding benchmarks, and stores successful variants in an evolutionary archive. On SWE-bench, a benchmark of real GitHub issues, its performance jumped from 20% to 50%. The improvements transferred. Techniques discovered using Claude 3.5 Sonnet worked on completely different models. It was not memorizing tricks. It was finding general design principles.

All of these systems use ML to do the evolving. But look at what they evolve. YouTube and Meta optimize ML models, where scoring functions come for free. The DGM evolves its own Python source code, evaluated on SWE-bench. At Stanford, Meta-Harness optimizes the code scaffolding around LLMs (what information to store, retrieve, and present to the model), outperforming hand-designed harnesses. Each is further from pure ML.

But this only works where evaluation is automatic. ML has built-in scoring functions. You have a loss to minimize, a metric to maximize, a benchmark to score against. General software does not. How do you evaluate whether a web application is better? A database migration? A refactoring? There is no loss function for code quality.

The boundary is already blurring. At Google, AlphaEvolve runs an evolutionary loop: take an existing program, ask an LLM to propose a modification, evaluate the result, keep it if it's better, repeat. One of its outputs was a seven-line scheduling heuristic for Google's data center orchestration system. That function has been running in production for over a year, recovering 0.7% of Google's worldwide compute resources. A deep reinforcement learning approach had been tried for the same problem. The seven-line heuristic outperformed it.

AlphaEvolve also found a new algorithm for multiplying 4×4 matrices over the complex numbers using 48 scalar multiplications, down from the 49 that Strassen's algorithm required since 1969. Fifty-six years. An evolutionary loop running over a weekend beat fifty-six years of human effort.

The loop does not require ML on both sides. It requires evaluation.

In every case, a human defined the evaluation. AlphaEvolve's users write a scoring function, sometimes as simple as returning the size of a graph. YouTube's engineers specified the business metrics. The DGM's creators designed the sandbox and chose the benchmarks.

The human role in every case was the same: decide what good looks like. Then step back.

This pattern will spread beyond ML. Not because general software will suddenly acquire loss functions, but because we are getting better at defining automated evaluations for things we used to assess by reading code.

Test suites are evaluations. So are performance benchmarks, user behavior metrics, error rates, latency percentiles. We have had many of these for years. What we lacked was the generation side: a system capable of proposing meaningful changes, not just random mutations. LLMs provide that.

Once you have generation and evaluation, you have the loop.

How far can evaluation reach?

Shunyu Yao argued that AI has reached halftime: the first half was about building better models, the second is about defining what they should do and measuring whether they are doing it. The same applies to software. We have spent decades building infrastructure for human-written code: linters, type systems, code review, style guides, CI pipelines. Almost none of it is designed for a world where code is generated, evaluated, and deployed by machines.

Self-evolution will spread along the contour of what we can measure. Latency, throughput, error rates, resource utilization: these are already computable. Systems optimizing against them will self-evolve first. Infrastructure, performance-critical paths, ML pipelines, anything with a clear metric will cross the threshold soon.

But large swaths of software resist quantification. Chenglei Si's research at Stanford showed that LLMs can propose research ideas judged more novel than those from human researchers, but a follow-up study found that LLM ideas score lower after execution: novelty alone does not survive contact with implementation. The gap between generating something that looks good and generating something that is good is the gap evaluation has to close.

VibeTensor, a complete deep learning system generated entirely by AI agents, shows the same pattern: every component works individually, but critical paths compose into globally suboptimal patterns. Its authors call this the "Frankenstein Effect." The tests pass. The design judgment does not.

Security is not a score. Usability is not a number. Correctness at edge cases is not a benchmark you can write in advance, because the whole point of edge cases is that you did not anticipate them. Some qualities only reveal themselves through human judgment, and human judgment is slow and expensive to collect. Today that means online experiments: A/B tests running for weeks, user research sessions, support tickets analyzed after the fact. Companies like Simile are trying to compress this by simulating human behavior, but the gap between simulating whether software runs fast and simulating whether it feels right to use is vast. What matters most in software is usually what's hardest to score.

The frontier of self-evolving software is not a technical problem to solve with better tools. It is a boundary to map. The question is not "how do we evaluate everything?" but "what can we evaluate well enough, and what happens at the edges where we cannot?"

Several of the systems I looked at shared a failure mode. When you optimize against a metric long enough, the system finds ways to satisfy the metric without doing the work. Sakana AI's AI Scientist, given a time limit, edited its own code to extend the deadline. Their CUDA Engineer gamed its evaluation to claim 100x speedups while actually running slower.

AlphaEvolve's seven-line heuristic happened to be human-readable, so engineers could verify it made sense. Not all outputs will be that legible.

This is Goodhart's Law applied at machine speed: when a measure becomes a target, it ceases to be a good measure. And when the optimizer is an LLM running an evolutionary loop, it finds the cracks faster than we can patch them.

This does not mean we should stop building these systems. It means evaluation has to evolve too. Static metrics get gamed. The evaluation itself needs to adapt: adversarial testing, multiple independent metrics, human audits on a sample basis, anomaly detection on the evaluation pipeline.

The human role shifts from writing code to writing evaluations, and from reviewing implementations to auditing outcomes. Defining what good looks like, precisely enough that a machine can be held to it, is a harder problem than writing the code.

What scales is the loop. And the loop is only as good as its evaluation. The harder question is whether we build evaluation infrastructure good enough to keep the loop honest, correct, and fast.

I do not think this is optional. The gap between organizations that have self-evolving systems and those that do not will compound until it is unclosable.