Loop Engineering with Claude: Building AI Loops That Run Themselves
Loop engineering is the fourth layer above prompt, context, and harness. Instead of prompting the agent line by line, you build the system that triggers it. Osmani's morning-triage loop and Stripe's 1,300 pull requests a week show what that looks like in Claude Code.
Loop engineering replaces the human as the clock: instead of prompting the AI agent directly, you build the loop that triggers it on a schedule, spawns helpers, and feeds its own results back. A loop is five moves, discovery, handoff, verification, persistence, and scheduling, realized by six parts that have concrete names in Claude Code. The hardest move is verification: an agent praises its own work, so you need a separate, skeptical evaluator. Stripe merges over 1,300 machine-written pull requests a week this way, and the reliability comes from the constraints, not from a bigger model. For companies, value shifts from generation to judgment.
What loop engineering is
Loop engineering means replacing yourself as the person who prompts the agent and building the system that does it instead. You no longer feed the agent line by line, you design the machine that feeds it. That puts you outside the loop, building it, rather than inside it. The weight of the sentence falls on replacing yourself.
The term arrived almost at once from three practitioners in June 2026. Peter Steinberger, author of OpenClaw, posted that one should no longer prompt coding agents but design the loops that prompt them, and the post passed eight million views. Boris Cherny, lead on Claude Code at Anthropic, was saying the same thing, that he no longer prompts Claude but has loops running that prompt Claude. On 7 June 2026 Addy Osmani of the Google Chrome team wrote it up under the title Loop Engineering and gave the practice its name.
Why now? Three tools had quietly crossed a threshold. Coding agents had become reliable enough to finish a non-trivial task unattended. Scheduling primitives had appeared in the major harnesses. And a single agent run had dropped far enough in cost that running it repeatedly on a timer stopped looking wasteful. The practice came before the name, the way teams paired a writer agent with a reviewer agent long before the generator-evaluator split had a name.
One floor above the harness
Loop engineering does not replace the earlier layers, it stacks on top of them. Each layer minds something larger: one sentence at the prompt, one window at the context, one run at the harness, and finally a loop that turns itself. Anyone who knows harness engineering will recognize loop engineering as the next floor above it.
Three verbs separate the harness from the loop. It runs on a timer and wakes on schedule with no button press. It spawns helpers, one sub-agent writes, another does nothing but pick the work apart in review. And it feeds itself: what the loop produces becomes its own input next round, with memory living in a file rather than the context window. That memory across conversations is exactly what makes it a loop and not a one-off task run many times.
The catch is distance. At the loop layer the system runs while you sleep, changes code you never looked at, and feeds its own errors into the next round. The cost of a mistake scales with the number of turns it survives before someone catches it, and a loop is, by construction, a machine for maximizing that number.
The five moves of a loop
The word loop is easy to misread as idle spinning. Each turn does something concrete: it finds work, hands it off, checks the result, saves state, then decides the next step. Drop one of the five moves and the loop either will not turn or turns in place.
Discovery
The agent finds its own work rather than being handed a list. The logic belongs in a skill, not a cron job no one maintains. Discovery sets the ceiling on the quality of the whole loop.
Handoff
Each task gets its own isolated git worktree so multiple agents working in parallel do not collide in the same directory. The cleaner a task is cut, the easier verification and merging become.
Verification
A second agent checks, with different instructions and often a different model. This is the part that can say no. A loop without a real check is an agent nodding at itself.
Persistence
The result lands outside the conversation: a PR, an updated ticket, a state file. A loop's memory cannot live only in the context window. The agent forgets, the repo does not.
Scheduling
An automation triggers the round on its own and turns one run into a loop. The state file lets unfinished findings carry to the next round, which picks up on its own.
The no
Verification is the move easiest to cut corners on and the least affordable to skip. It is the point where the loop can stop instead of accumulating plausible mistakes at machine speed.
The six parts in Claude Code
If the moves describe what happens, the parts describe what must be in hand for it to turn. In Claude Code they have concrete names, and the same capabilities exist in Codex under other labels.
| Part | What it does | In Claude Code |
|---|---|---|
| Automations | Trigger the loop on a schedule or event | /loop locally, Cloud Routines for running with the machine off |
| Worktrees | Give each parallel agent its own working directory | --worktree (or -w) per background agent |
| Skills | Make project knowledge durable instead of re-derived each round | SKILL.md, paying off recurring intent debt |
| Connectors | Hook the loop to issue tracker, database, Slack | MCP servers and plugins, setting the radius of vision |
| Sub-agents | Separate the one that writes from the one that judges | Definitions under .claude/agents/ |
| Memory | Persistent state that survives any single conversation | Markdown file or board on disk |
With all six in place a loop has a skeleton: the automation moves it, worktrees keep it from fighting itself, skills keep it from redoing work, connectors let it see outside, sub-agents let it correct itself, and memory lets it remember. To go deeper on individual parts, see skills and parallel agent swarms .
Generator and evaluator: the part that says no
The hardest part of a loop is not getting the agent to run but putting something inside that can say no, and the agent writing the code is the one least likely to say it. Ask an agent to grade what it just produced and it tends to praise it confidently, even when a human can plainly see the quality is mediocre.
This is not a smarts problem, it is grading one's own homework. The context in which the code was written is already stuffed with the reasons it was written that way. When the agent looks at its own output it does not see the result, it sees the chain of self-persuasion that led there. Inside a loop the flaw is amplified: if every round the agent decides whether something is good enough, it nods at itself each round and drifts further from real quality.
Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.
Prithvi Rajasekaran, Anthropic, on building long-running applicationsAnthropic engineer Prithvi Rajasekaran describes three steps that make the checker sharp. First, separate the generator from the evaluator structurally, a second agent with entirely different instructions that looks at the code from scratch. The idea is borrowed from generative adversarial networks, one builds, one picks faults. Second, let the evaluator act rather than read: Rajasekaran hooked it to Playwright MCP so it could open the page, click buttons, take screenshots, and inspect the DOM like a QA engineer. That shifts the basis from this looks right to I clicked it, here is the screenshot. Third, swap the model too, because the same model with new instructions often keeps its blind spots.
Claude Code turns this into a primitive. /goal takes a condition and runs until it is met. After each turn a fresh, small model checks whether the condition holds rather than returning control. Completion is decided by a fresh model, not the one doing the work. This is the maker-checker principle from banking, where the person entering a large transfer and the person reviewing it must differ. A common calibration tells the evaluator to assume the code is broken until proven otherwise. The default stance is doubt, not trust.
A loop's floor is its evaluator. The generator's level decides what a loop can produce, the evaluator's level decides what it will not produce.
Loop engineering with Claude in detail
The theory gets concrete the moment you see the commands. The examples below show one bad and one good implementation per part in Claude Code. The bad version almost always runs too, it just surfaces later, and that delay is exactly what makes the loop layer dangerous.
Discovery: a skill, not a wall of instructions
Discovery belongs in a skill you can reuse and maintain, not in a cron job with a prompt pasted into it. The prompt wall rots in a schedule no one touches, the skill stays named, versioned, and testable. It pays off the recurring intent debt, the cost of re-explaining to every run what the project is and where the traps are.
# Bad: a long prompt pasted straight into the cron job
0 6 * * * claude -p "Look at CI, find failures, check
open issues, decide what matters, build fixes, and
remember the tests ..." # rots
The good version is a triage skill with clearly separated sections. Five of them map to the five moves, the sixth, Stop, is the boundary you write in yourself:
# .claude/skills/morning-triage/SKILL.md
---
name: morning-triage
trigger: invoked daily by the automation
---
## Read (the DISCOVERY inputs)
- CI runs that failed since the last run
- issues opened in the last 24 hours
- commits merged since yesterday
- the previous ./state/triage.md
## Judge (sets the quality ceiling)
For each candidate decide:
- actionable now, or just noise?
- does it block a release? -> priority
- already tracked? -> skip
## Write (the PERSISTENCE output)
Append to ./state/triage.md, one row per finding:
| finding | source | priority | status |
## Stop (the boundary you keep)
Never merge, never delete. Anything uncertain goes
to ./inbox/ for a human, not into a PR.
The Stop section is not decoration. It is the one place you write down what the loop cannot know on its own. Leave it out and it merges with a confidence it has not earned.
Verification: a separate evaluator, not self-grading
The generator must not be its own checker. The bad path lets the same agent that wrote the code ask at the end whether it is good, and the answer is almost always yes. The good path is a separate agent with its own instructions, a different model, and the default stance that the code is broken until proven otherwise.
# Bad: the same agent grades its own homework
claude "build the login and tell me if it is good"
# -> nods at itself, every round, at machine speed
Good is a dedicated, skeptical reviewer agent that acts rather than just reads:
# Evaluator agent (.claude/agents/reviewer.md)
ROLE: Adversarial code reviewer.
ASSUME: this code is BROKEN until proven otherwise.
Do not praise. Find what fails.
CHECK, in order:
1. Does it run at all? (execute, do not read)
2. Run the tests, paste real output.
3. Edge cases the author skipped.
4. Does behavior match the ticket?
USE Playwright MCP: open the page, click, take a
screenshot, inspect the DOM. Judge behavior,
not intent.
VERDICT: PASS only if every check holds.
Otherwise REJECT and list each reason.
The stop condition goes to /goal. After each turn a fresh, small model checks whether it holds, not the agent that did the work:
# Stop condition, judged by a fresh model
/goal all tests in test/auth pass and the
lint step is clean
Do not confuse /goal with /loop. /goal runs until a condition is met, and a fresh model decides completion. /loop merely reruns on an interval, with no checker. A loop that relies on /loop alone is the nodding loop from the previous section.
Handoff: one worktree per task, not a shared directory
Several agents in the same directory is the same problem as two developers committing to the same lines at once. The failure only shows up under parallelism: a single agent runs cleanly, and the first morning five run at once the merge becomes a mess. One isolated worktree per task turns runs but messy into runs and clean.
# Bad: all agents in the same directory
claude "fix auth bug" # terminal 1
claude "fix null deref" # terminal 2 -> collision
# Good: one isolated worktree per finding
claude --worktree fix/auth-test "draft the fix"
claude --worktree fix/null-deref "draft the fix"
Scheduling: trigger locally or in the cloud
Local means frequent and with access to local files, but the machine must stay on. Cloud means real autonomy, at the cost of a coarser interval and a fresh clone per run. A mature loop uses both, local for the tight inner checks, cloud for the overnight sweep.
# Local: reruns while the machine is on
/loop 5m check the deploy # fixed: every 5 minutes
/loop check the deploy # the agent paces itself
# Cloud: runs even with the machine off
# .github/workflows/triage.yml
on:
schedule:
- cron: '0 6 * * *' # 06:00 daily
jobs:
triage:
runs-on: ubuntu-latest
steps:
- run: claude --skill morning-triage
A complete first loop, annotated
Small enough to read in one sitting, yet with all six parts. The six numbered comments are the six parts, each in two or three lines:
# 1. SCHEDULING -- a real trigger
on:
schedule:
- cron: '0 6 * * *' # 06:00 daily, cloud
# 2. DISCOVERY -- a skill, not a wall of text
run: claude --skill morning-triage
# 3. PERSISTENCE -- state on disk
# the skill writes ./state/triage.md
# and commits it back to the repo
# 4. HANDOFF -- one worktree per finding
for finding in $(parse ./state/triage.md); do
claude --worktree "fix/$finding" \
--goal "tests pass and lint is clean" \
"draft a fix for $finding"
done
# 5. VERIFICATION -- a fresh model judges
# /goal checks after each turn, a reviewer.md
# additionally picks holes
# 6. HUMAN REVIEW -- the open door
# PRs are opened, never auto-merged,
# anything uncertain lands in ./inbox/
A loop with all six is a real loop, even a tiny one. Missing one, it is one of the five failure patterns in disguise. The safe order of growth: first prove the evaluator catches real mistakes, then raise parallelism, not the other way around.
Two loops in practice
Three public cases differ wildly in scale but share one skeleton: a trigger presses start, constraints keep the loop on the rails, a human checkpoint sits at the end. Running while you sleep was never about how strong the model is, it is about how solid that skeleton is.
Osmani's morning-triage loop
Osmani built himself a morning triage loop. In the morning an automation kicks off on its own. A triage skill reads yesterday's failing CI tests, the still-open issues, and recent commits, and writes its findings into a markdown file or a Linear board. For each finding worth acting on it opens an isolated worktree, one sub-agent drafts the fix, a second reviews it against the project's skills and tests. A connector opens the pull request and updates the ticket. Anything it cannot handle goes to an inbox for a human, and a state file survives so the next day picks up where this one left off.
Stripe's Minions: 1,300 pull requests a week
For enterprise scale, Stripe's Minions is the case to study: over 1,300 pull requests merged a week, not one line written by hand, described by Stripe engineer Steve Kaliski on the How I AI podcast. The trigger is light, an @ to the bot in Slack or an emoji reaction. What makes it reliable is the stretch before the model wakes up: a deterministic orchestrator first assembles context, scanning links, pulling Jira, finding docs, and using Sourcegraph plus MCP to locate relevant code. Anything deterministic logic can solve never goes to a probabilistic model.
The most counterintuitive point: Minions does not run on a stronger model, it is a fork of the open-source tool Goose. Reliability comes from the quality of the constraints, not the size of the model. The 1,300 PRs are still reviewed by humans. The human did not leave, they changed desks, from writing to reviewing.
Local or cloud: what running while you sleep relies on
The choice between local and cloud scheduling follows from one question: is the loop's work glued to the local machine, or can it leave? A loop that checks a local dev server every minute can only run locally. A loop that scans open issues at three in the morning should never be tied to a laptop whose lid gets closed.
| Property | Cloud Routines | Local /loop |
|---|---|---|
| Runs on | cloud machine | your machine |
| Machine must stay on | no | yes |
| Minimum interval | 1 hour | 1 minute |
| Sees local files | no | yes |
| Per run | fresh clone | running session |
The distortion to avoid is treating local rerun as the whole of running while you sleep. Local rerun means run a few extra rounds while I am here. Cloud scheduling means run even when I am not. A mature loop often uses both: local for the tight inner checks, cloud for the overnight sweep. innobu already covers Cloud Routines and dynamic workflows in their own right.
Four silent costs
A loop that runs itself is, at the same time, a loop that makes mistakes by itself. The more cheerfully it runs, the more quietly it errs. Four costs accrue, none of which sounds an alarm while the loop is running, and they reinforce one another.
Verification debt. Every merged PR saves time that turns into unverified output waiting to be paid back. The problem hides where tests do not cover, in the gap between runs and right, accumulating until some shipping morning when it blows up at once. The guard is an independent evaluator.
Comprehension rot. The faster the loop ships code you did not write, the bigger the gap between what exists and what you understand. Reading code is more boring than writing it, and the loop has taken the writing. The guard is to read a sample regularly and force yourself to explain a few changes.
Cognitive surrender. When the loop runs itself it is tempting to stop having an opinion and just take whatever it hands back. The more reliable the loop, the easier it is to outsource judgment. The guard is one line: the loop can execute, but it cannot decide.
Token blowout. The only cost that hits the bill directly and is hard to estimate in advance. The loop hatches helpers, retries, and runs round after round, so one bug can spin idle all night. Anthropic's own experiments show the range: a full harness run for a DAW app cost about 124.70 US dollars over nearly four hours, and a simple retro game maker ran 9 US dollars in 20 minutes versus 200 US dollars in six hours with the full harness. The guard is hard caps set before the start: per-run budget, daily budget, max retries.
European perspective
The maker-checker principle has been standard in banking for decades and transfers directly to a loop's checkpoint. Whoever enters a large transfer does not review it themselves. That separation is exactly what the evaluator builds into the loop, and for regulated industries it is not a nice-to-have.
- The EU AI Act requires effective human oversight for high-risk systems. An open checkpoint in the loop, where PRs are opened but never auto-merged, is then not just good practice but alignment with the regulation.
- GDPR and trade secrets: cloud loops often pull a fresh clone of the repo, and connectors reach into tickets and databases. Data flows and access rights belong settled before the first loop runs unattended.
- Smaller European firms benefit most, because a well-built loop multiplies scarce developer capacity. That only holds if judgment stays in house and is not automated away along with the typing.
Challenges and risks
The five typical failure patterns each correspond to one skipped move. The most common is the nodding loop, where the same agent declares its own work good and accumulates plausible-looking mistakes at machine speed.
Widely circulated numbers deserve caution. Claims such as 90 percent of Claude Code writes itself or large migration speedups are mostly secondhand summaries and serve only as rough reference. The three cases in this article trace to firsthand sources, which holds up better than one impressive-sounding figure.
What companies should do now
Stripe's pipeline is the endpoint, not the start. A first loop should be so small it barely looks like a system, a little thing that checks something on a timer, but with the checkpoint and human control built in.
-
Start small with a /loop
Rerun one task on an interval. That is not yet a loop, but the entry point. Then add a triage skill that reads CI failures, new issues, and commits and lists what is worth acting on.
-
Harden the evaluator first
A strong generator with a weak checker produces confident garbage. Prove the evaluator catches real mistakes, then increase parallelism. Parallelism comes last, after the checks.
-
Set caps before the first unattended run
A per-run budget, a daily budget, and a maximum retry count. These numbers are circuit breakers that turn an open-ended risk into a bounded one, not a fix after the first surprising bill.
-
Keep one door open
At least one checkpoint where the loop waits for a human. PRs are opened, never auto-merged, and anything uncertain lands in an inbox. The human review point is not scaffolding to remove later, it is the feature that keeps the loop trustworthy.
-
Read a sample every day
Not everything the loop produces, but a representative sample, and force yourself to explain each change. An inability to explain a change is a precise signal that your mental map has fallen behind, far cheaper to find on a quiet morning than in a production incident.
Conclusion
The same loop, built by two people, ends in opposite places, and the difference is not in the loop. One person uses it to move faster on things already mastered: they read the code, hold a firm direction, and scale judgment they already had. Another uses the same loop so they never have to understand again, and six months later becomes the gatekeeper of a machine they cannot read.
The loop makes generation extremely cheap, code, plans, PRs, fixes, nearly free. What stays scarce is judgment: knowing which plan is right, which line should be stopped, which output runs fine but is wrong at the root. Loop engineering does not devalue judgment, it strips away everything that does not require judgment and leaves judgment as all that remains. Build the loop, but build it like someone who intends to stay the engineer, not just the person who presses go.
Further reading
Frequently asked questions
Loop engineering means replacing yourself as the person who prompts the agent and building the system that does it instead. Rather than prompting the agent line by line, you design a loop that wakes on a schedule, spawns sub-agents for parallel work, and feeds its own results back as the next round's input. The term was written up in June 2026 by Addy Osmani and sits as a fourth layer above prompt, context, and harness.
Harness engineering arms a single run: tools, allowed actions, the done condition. It does not make the run repeat. Loop engineering sits one floor above and makes the run self-running through three verbs: it runs on a timer, it spawns helpers, and it feeds itself by writing its results to a file and reading them again next round.
Discovery finds this round's work on its own. Handoff hands each task to an isolated worktree. Verification lets a second agent check the result, the agent that can say no. Persistence saves state outside the conversation in a PR, ticket, and state file. Scheduling triggers the round on its own and turns one run into a loop. Drop any one of these five and a typical failure pattern appears.
An agent grading its own work tends to praise it confidently, even when a human can plainly see the quality is mediocre. Anthropic engineer Prithvi Rajasekaran found it far more tractable to tune a standalone evaluator to be skeptical than to make the generator self-critical. The evaluator should also act rather than just read, for example clicking and taking screenshots via Playwright MCP, and should assume the code is broken until proven otherwise.
/loop reruns a task on an interval and runs locally. /goal takes a condition and runs until it is met, with a fresh small model checking after each turn whether the condition holds. Add worktrees via --worktree for parallel isolation, skills in SKILL.md for durable project knowledge, sub-agents under .claude/agents/, MCP connectors for external systems, and Cloud Routines for running with the machine off.
Beyond the token bill, three silent costs accrue: verification debt as unverified output between runs and right, comprehension rot as a widening gap between code that exists and code you understand, and cognitive surrender when you stop having an opinion. Anthropic's own experiments show the token range: a full harness run for a DAW app cost about 124.70 US dollars over nearly four hours. Hard caps set before the first unattended run are the guard.