Loop Engineering with Claude: Self-Running AI Loops

What loop engineering is

Loop engineering means replacing yourself as the person who prompts the agent and building the system that does it instead. You no longer feed the agent line by line, you design the machine that feeds it. That puts you outside the loop, building it, rather than inside it. The weight of the sentence falls on replacing yourself.

Loop engineering is the practice of designing the system that triggers an AI agent instead of prompting it by hand. The loop runs on a schedule, spawns sub-agents, and feeds its own results back as the next round's input.

The term arrived almost at once from three practitioners in June 2026. Peter Steinberger, author of OpenClaw, posted that one should no longer prompt coding agents but design the loops that prompt them, and the post passed eight million views. Boris Cherny, lead on Claude Code at Anthropic, was saying the same thing, that he no longer prompts Claude but has loops running that prompt Claude. On 7 June 2026 Addy Osmani of the Google Chrome team wrote it up under the title Loop Engineering and gave the practice its name.

Why now? Three tools had quietly crossed a threshold. Coding agents had become reliable enough to finish a non-trivial task unattended. Scheduling primitives had appeared in the major harnesses. And a single agent run had dropped far enough in cost that running it repeatedly on a timer stopped looking wasteful. The practice came before the name, the way teams paired a writer agent with a reviewer agent long before the generator-evaluator split had a name.

One floor above the harness

Loop engineering does not replace the earlier layers, it stacks on top of them. Each layer minds something larger: one sentence at the prompt, one window at the context, one run at the harness, and finally a loop that turns itself. Anyone who knows harness engineering will recognize loop engineering as the next floor above it.

Diagram of the four-layer stack from prompt through context and harness up to loop engineering, with scope growing upward — The four-layer stack: each layer minds a larger unit, with loop engineering at the very top.

Three verbs separate the harness from the loop. It runs on a timer and wakes on schedule with no button press. It spawns helpers, one sub-agent writes, another does nothing but pick the work apart in review. And it feeds itself: what the loop produces becomes its own input next round, with memory living in a file rather than the context window. That memory across conversations is exactly what makes it a loop and not a one-off task run many times.

The catch is distance. At the loop layer the system runs while you sleep, changes code you never looked at, and feeds its own errors into the next round. The cost of a mistake scales with the number of turns it survives before someone catches it, and a loop is, by construction, a machine for maximizing that number.

The five moves of a loop

The word loop is easy to misread as idle spinning. Each turn does something concrete: it finds work, hands it off, checks the result, saves state, then decides the next step. Drop one of the five moves and the loop either will not turn or turns in place.

1

Discovery

The agent finds its own work rather than being handed a list. The logic belongs in a skill, not a cron job no one maintains. Discovery sets the ceiling on the quality of the whole loop.

2

Handoff

Each task gets its own isolated git worktree so multiple agents working in parallel do not collide in the same directory. The cleaner a task is cut, the easier verification and merging become.

3

Verification

A second agent checks, with different instructions and often a different model. This is the part that can say no. A loop without a real check is an agent nodding at itself.

4

Persistence

The result lands outside the conversation: a PR, an updated ticket, a state file. A loop's memory cannot live only in the context window. The agent forgets, the repo does not.

5

Scheduling

An automation triggers the round on its own and turns one run into a loop. The state file lets unfinished findings carry to the next round, which picks up on its own.

+

The no

Verification is the move easiest to cut corners on and the least affordable to skip. It is the point where the loop can stop instead of accumulating plausible mistakes at machine speed.

The six parts in Claude Code

If the moves describe what happens, the parts describe what must be in hand for it to turn. In Claude Code they have concrete names, and the same capabilities exist in Codex under other labels.

Part	What it does	In Claude Code
Automations	Trigger the loop on a schedule or event	/loop locally, Cloud Routines for running with the machine off
Worktrees	Give each parallel agent its own working directory	--worktree (or -w) per background agent
Skills	Make project knowledge durable instead of re-derived each round	SKILL.md, paying off recurring intent debt
Connectors	Hook the loop to issue tracker, database, Slack	MCP servers and plugins, setting the radius of vision
Sub-agents	Separate the one that writes from the one that judges	Definitions under .claude/agents/
Memory	Persistent state that survives any single conversation	Markdown file or board on disk

With all six in place a loop has a skeleton: the automation moves it, worktrees keep it from fighting itself, skills keep it from redoing work, connectors let it see outside, sub-agents let it correct itself, and memory lets it remember. To go deeper on individual parts, see skills and parallel agent swarms .

Generator and evaluator: the part that says no

The hardest part of a loop is not getting the agent to run but putting something inside that can say no, and the agent writing the code is the one least likely to say it. Ask an agent to grade what it just produced and it tends to praise it confidently, even when a human can plainly see the quality is mediocre.

This is not a smarts problem, it is grading one's own homework. The context in which the code was written is already stuffed with the reasons it was written that way. When the agent looks at its own output it does not see the result, it sees the chain of self-persuasion that led there. Inside a loop the flaw is amplified: if every round the agent decides whether something is good enough, it nods at itself each round and drifts further from real quality.

Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.

Prithvi Rajasekaran, Anthropic, on building long-running applications

Anthropic engineer Prithvi Rajasekaran describes three steps that make the checker sharp. First, separate the generator from the evaluator structurally, a second agent with entirely different instructions that looks at the code from scratch. The idea is borrowed from generative adversarial networks, one builds, one picks faults. Second, let the evaluator act rather than read: Rajasekaran hooked it to Playwright MCP so it could open the page, click buttons, take screenshots, and inspect the DOM like a QA engineer. That shifts the basis from this looks right to I clicked it, here is the screenshot. Third, swap the model too, because the same model with new instructions often keeps its blind spots.

Claude Code turns this into a primitive. /goal takes a condition and runs until it is met. After each turn a fresh, small model checks whether the condition holds rather than returning control. Completion is decided by a fresh model, not the one doing the work. This is the maker-checker principle from banking, where the person entering a large transfer and the person reviewing it must differ. A common calibration tells the evaluator to assume the code is broken until proven otherwise. The default stance is doubt, not trust.

Key point

A loop's floor is its evaluator. The generator's level decides what a loop can produce, the evaluator's level decides what it will not produce.

Loop engineering with Claude in detail

The theory gets concrete the moment you see the commands. The examples below show one bad and one good implementation per part in Claude Code. The bad version almost always runs too, it just surfaces later, and that delay is exactly what makes the loop layer dangerous.

Discovery: a skill, not a wall of instructions

Discovery belongs in a skill you can reuse and maintain, not in a cron job with a prompt pasted into it. The prompt wall rots in a schedule no one touches, the skill stays named, versioned, and testable. It pays off the recurring intent debt, the cost of re-explaining to every run what the project is and where the traps are.

# Bad: a long prompt pasted straight into the cron job
0 6 * * *  claude -p "Look at CI, find failures, check
open issues, decide what matters, build fixes, and
remember the tests ..."                    # rots

The good version is a triage skill with clearly separated sections. Five of them map to the five moves, the sixth, Stop, is the boundary you write in yourself:

# .claude/skills/morning-triage/SKILL.md
---
name: morning-triage
trigger: invoked daily by the automation
---

## Read (the DISCOVERY inputs)
- CI runs that failed since the last run
- issues opened in the last 24 hours
- commits merged since yesterday
- the previous ./state/triage.md

## Judge (sets the quality ceiling)
For each candidate decide:
- actionable now, or just noise?
- does it block a release? -> priority
- already tracked? -> skip

## Write (the PERSISTENCE output)
Append to ./state/triage.md, one row per finding:
| finding | source | priority | status |

## Stop (the boundary you keep)
Never merge, never delete. Anything uncertain goes
to ./inbox/ for a human, not into a PR.

The Stop section is not decoration. It is the one place you write down what the loop cannot know on its own. Leave it out and it merges with a confidence it has not earned.

Verification: a separate evaluator, not self-grading

The generator must not be its own checker. The bad path lets the same agent that wrote the code ask at the end whether it is good, and the answer is almost always yes. The good path is a separate agent with its own instructions, a different model, and the default stance that the code is broken until proven otherwise.

# Bad: the same agent grades its own homework
claude "build the login and tell me if it is good"
# -> nods at itself, every round, at machine speed

Good is a dedicated, skeptical reviewer agent that acts rather than just reads:

# Evaluator agent (.claude/agents/reviewer.md)
ROLE: Adversarial code reviewer.
ASSUME: this code is BROKEN until proven otherwise.
        Do not praise. Find what fails.

CHECK, in order:
1. Does it run at all? (execute, do not read)
2. Run the tests, paste real output.
3. Edge cases the author skipped.
4. Does behavior match the ticket?

USE Playwright MCP: open the page, click, take a
screenshot, inspect the DOM. Judge behavior,
not intent.

VERDICT: PASS only if every check holds.
Otherwise REJECT and list each reason.

The stop condition goes to /goal. After each turn a fresh, small model checks whether it holds, not the agent that did the work:

# Stop condition, judged by a fresh model
/goal all tests in test/auth pass and the
      lint step is clean

Do not confuse /goal with /loop. /goal runs until a condition is met, and a fresh model decides completion. /loop merely reruns on an interval, with no checker. A loop that relies on /loop alone is the nodding loop from the previous section.

Handoff: one worktree per task, not a shared directory

Several agents in the same directory is the same problem as two developers committing to the same lines at once. The failure only shows up under parallelism: a single agent runs cleanly, and the first morning five run at once the merge becomes a mess. One isolated worktree per task turns runs but messy into runs and clean.

# Bad: all agents in the same directory
claude "fix auth bug"      # terminal 1
claude "fix null deref"    # terminal 2  -> collision

# Good: one isolated worktree per finding
claude --worktree fix/auth-test  "draft the fix"
claude --worktree fix/null-deref "draft the fix"

Scheduling: trigger locally or in the cloud

Local means frequent and with access to local files, but the machine must stay on. Cloud means real autonomy, at the cost of a coarser interval and a fresh clone per run. A mature loop uses both, local for the tight inner checks, cloud for the overnight sweep.

# Local: reruns while the machine is on
/loop 5m check the deploy   # fixed: every 5 minutes
/loop check the deploy      # the agent paces itself

# Cloud: runs even with the machine off
# .github/workflows/triage.yml
on:
  schedule:
    - cron: '0 6 * * *'      # 06:00 daily
jobs:
  triage:
    runs-on: ubuntu-latest
    steps:
      - run: claude --skill morning-triage

A complete first loop, annotated

Small enough to read in one sitting, yet with all six parts. The six numbered comments are the six parts, each in two or three lines:

# 1. SCHEDULING -- a real trigger
on:
  schedule:
    - cron: '0 6 * * *'        # 06:00 daily, cloud

# 2. DISCOVERY -- a skill, not a wall of text
run: claude --skill morning-triage

# 3. PERSISTENCE -- state on disk
#    the skill writes ./state/triage.md
#    and commits it back to the repo

# 4. HANDOFF -- one worktree per finding
for finding in $(parse ./state/triage.md); do
  claude --worktree "fix/$finding" \
    --goal "tests pass and lint is clean" \
    "draft a fix for $finding"
done

# 5. VERIFICATION -- a fresh model judges
#    /goal checks after each turn, a reviewer.md
#    additionally picks holes

# 6. HUMAN REVIEW -- the open door
#    PRs are opened, never auto-merged,
#    anything uncertain lands in ./inbox/

A loop with all six is a real loop, even a tiny one. Missing one, it is one of the five failure patterns in disguise. The safe order of growth: first prove the evaluator catches real mistakes, then raise parallelism, not the other way around.

Built badly

Prompt wall in a cron job no one maintains

The same agent writes and grades itself

All agents share one working directory

/loop with no checker, automatic merge

No caps, one bug spins idle all night

Built well

Discovery in a named SKILL.md

Separate reviewer.md, different model, default doubt

One --worktree per task

/goal with a fresh judge model, PR instead of merge

Per-run and daily budget before the first run

Two loops in practice

Three public cases differ wildly in scale but share one skeleton: a trigger presses start, constraints keep the loop on the rails, a human checkpoint sits at the end. Running while you sleep was never about how strong the model is, it is about how solid that skeleton is.

Osmani's morning-triage loop

Osmani built himself a morning triage loop. In the morning an automation kicks off on its own. A triage skill reads yesterday's failing CI tests, the still-open issues, and recent commits, and writes its findings into a markdown file or a Linear board. For each finding worth acting on it opens an isolated worktree, one sub-agent drafts the fix, a second reviews it against the project's skills and tests. A connector opens the pull request and updates the ticket. Anything it cannot handle goes to an inbox for a human, and a state file survives so the next day picks up where this one left off.

A single developer reviewing changes on a monitor early in the morning, surrounded by empty desks in an open-plan office — The human has not left, they changed desks, from writing to reviewing.

Stripe's Minions: 1,300 pull requests a week

For enterprise scale, Stripe's Minions is the case to study: over 1,300 pull requests merged a week, not one line written by hand, described by Stripe engineer Steve Kaliski on the How I AI podcast. The trigger is light, an @ to the bot in Slack or an emoji reaction. What makes it reliable is the stretch before the model wakes up: a deterministic orchestrator first assembles context, scanning links, pulling Jira, finding docs, and using Sourcegraph plus MCP to locate relevant code. Anything deterministic logic can solve never goes to a probabilistic model.

1,300+

pull requests merged per week at Stripe

How I AI / Lenny's Newsletter, 2026

0

lines of those PRs written by hand

Steve Kaliski, Stripe

Goose

open-source fork as the base, not a bigger model

How I AI, 2026

8M+

views on Steinberger's loop post

June 2026

The most counterintuitive point: Minions does not run on a stronger model, it is a fork of the open-source tool Goose. Reliability comes from the quality of the constraints, not the size of the model. The 1,300 PRs are still reviewed by humans. The human did not leave, they changed desks, from writing to reviewing.

Local or cloud: what running while you sleep relies on

The choice between local and cloud scheduling follows from one question: is the loop's work glued to the local machine, or can it leave? A loop that checks a local dev server every minute can only run locally. A loop that scans open issues at three in the morning should never be tied to a laptop whose lid gets closed.

Property	Cloud Routines	Local /loop
Runs on	cloud machine	your machine
Machine must stay on	no	yes
Minimum interval	1 hour	1 minute
Sees local files	no	yes
Per run	fresh clone	running session

The distortion to avoid is treating local rerun as the whole of running while you sleep. Local rerun means run a few extra rounds while I am here. Cloud scheduling means run even when I am not. A mature loop often uses both: local for the tight inner checks, cloud for the overnight sweep. innobu already covers Cloud Routines and dynamic workflows in their own right.

Four silent costs

A loop that runs itself is, at the same time, a loop that makes mistakes by itself. The more cheerfully it runs, the more quietly it errs. Four costs accrue, none of which sounds an alarm while the loop is running, and they reinforce one another.

An empty desk at night with a monitor left on and a vacant office chair, an image for an AI loop running unattended — A loop keeps running while the human is gone, and that is exactly where the silent costs accrue.

Verification debt. Every merged PR saves time that turns into unverified output waiting to be paid back. The problem hides where tests do not cover, in the gap between runs and right, accumulating until some shipping morning when it blows up at once. The guard is an independent evaluator.

Comprehension rot. The faster the loop ships code you did not write, the bigger the gap between what exists and what you understand. Reading code is more boring than writing it, and the loop has taken the writing. The guard is to read a sample regularly and force yourself to explain a few changes.

Cognitive surrender. When the loop runs itself it is tempting to stop having an opinion and just take whatever it hands back. The more reliable the loop, the easier it is to outsource judgment. The guard is one line: the loop can execute, but it cannot decide.

Token blowout. The only cost that hits the bill directly and is hard to estimate in advance. The loop hatches helpers, retries, and runs round after round, so one bug can spin idle all night. Anthropic's own experiments show the range: a full harness run for a DAW app cost about 124.70 US dollars over nearly four hours, and a simple retro game maker ran 9 US dollars in 20 minutes versus 200 US dollars in six hours with the full harness. The guard is hard caps set before the start: per-run budget, daily budget, max retries.

European perspective

The maker-checker principle has been standard in banking for decades and transfers directly to a loop's checkpoint. Whoever enters a large transfer does not review it themselves. That separation is exactly what the evaluator builds into the loop, and for regulated industries it is not a nice-to-have.

The EU AI Act requires effective human oversight for high-risk systems. An open checkpoint in the loop, where PRs are opened but never auto-merged, is then not just good practice but alignment with the regulation.
GDPR and trade secrets: cloud loops often pull a fresh clone of the repo, and connectors reach into tickets and databases. Data flows and access rights belong settled before the first loop runs unattended.
Smaller European firms benefit most, because a well-built loop multiplies scarce developer capacity. That only holds if judgment stays in house and is not automated away along with the typing.

Challenges and risks

The five typical failure patterns each correspond to one skipped move. The most common is the nodding loop, where the same agent declares its own work good and accumulates plausible-looking mistakes at machine speed.

Failure pattern

Nodding loop: verification skipped, the agent self-approves

Amnesiac loop: persistence skipped, each morning starts over

Manual loop: scheduling skipped, a script someone forgets

Blind loop: discovery skipped, the human still hands over the work

Tangled loop: handoff skipped, parallel agents collide

Remedy

Separate, skeptical evaluator with a /goal stop check

State file on disk, the agent forgets, the repo does not

A real trigger: timer or event, no manual button

Move discovery into a skill that finds its own work

One isolated worktree per task

Widely circulated numbers deserve caution. Claims such as 90 percent of Claude Code writes itself or large migration speedups are mostly secondhand summaries and serve only as rough reference. The three cases in this article trace to firsthand sources, which holds up better than one impressive-sounding figure.

What companies should do now

Stripe's pipeline is the endpoint, not the start. A first loop should be so small it barely looks like a system, a little thing that checks something on a timer, but with the checkpoint and human control built in.

Start small with a /loop

Rerun one task on an interval. That is not yet a loop, but the entry point. Then add a triage skill that reads CI failures, new issues, and commits and lists what is worth acting on.
Harden the evaluator first

A strong generator with a weak checker produces confident garbage. Prove the evaluator catches real mistakes, then increase parallelism. Parallelism comes last, after the checks.
Set caps before the first unattended run

A per-run budget, a daily budget, and a maximum retry count. These numbers are circuit breakers that turn an open-ended risk into a bounded one, not a fix after the first surprising bill.
Keep one door open

At least one checkpoint where the loop waits for a human. PRs are opened, never auto-merged, and anything uncertain lands in an inbox. The human review point is not scaffolding to remove later, it is the feature that keeps the loop trustworthy.
Read a sample every day

Not everything the loop produces, but a representative sample, and force yourself to explain each change. An inability to explain a change is a precise signal that your mental map has fallen behind, far cheaper to find on a quiet morning than in a production incident.

Conclusion

The same loop, built by two people, ends in opposite places, and the difference is not in the loop. One person uses it to move faster on things already mastered: they read the code, hold a firm direction, and scale judgment they already had. Another uses the same loop so they never have to understand again, and six months later becomes the gatekeeper of a machine they cannot read.

The loop makes generation extremely cheap, code, plans, PRs, fixes, nearly free. What stays scarce is judgment: knowing which plan is right, which line should be stopped, which output runs fine but is wrong at the root. Loop engineering does not devalue judgment, it strips away everything that does not require judgment and leaves judgment as all that remains. Build the loop, but build it like someone who intends to stay the engineer, not just the person who presses go.

Frequently asked questions

What is loop engineering? +

Loop engineering means replacing yourself as the person who prompts the agent and building the system that does it instead. Rather than prompting the agent line by line, you design a loop that wakes on a schedule, spawns sub-agents for parallel work, and feeds its own results back as the next round's input. The term was written up in June 2026 by Addy Osmani and sits as a fourth layer above prompt, context, and harness.

How does loop engineering differ from harness engineering? +

Harness engineering arms a single run: tools, allowed actions, the done condition. It does not make the run repeat. Loop engineering sits one floor above and makes the run self-running through three verbs: it runs on a timer, it spawns helpers, and it feeds itself by writing its results to a file and reading them again next round.

What are the five moves of a loop? +

Discovery finds this round's work on its own. Handoff hands each task to an isolated worktree. Verification lets a second agent check the result, the agent that can say no. Persistence saves state outside the conversation in a PR, ticket, and state file. Scheduling triggers the round on its own and turns one run into a loop. Drop any one of these five and a typical failure pattern appears.

Why does a loop need a separate evaluator? +

An agent grading its own work tends to praise it confidently, even when a human can plainly see the quality is mediocre. Anthropic engineer Prithvi Rajasekaran found it far more tractable to tune a standalone evaluator to be skeptical than to make the generator self-critical. The evaluator should also act rather than just read, for example clicking and taking screenshots via Playwright MCP, and should assume the code is broken until proven otherwise.

Which Claude Code commands belong to a loop? +

/loop reruns a task on an interval and runs locally. /goal takes a condition and runs until it is met, with a fresh small model checking after each turn whether the condition holds. Add worktrees via --worktree for parallel isolation, skills in SKILL.md for durable project knowledge, sub-agents under .claude/agents/, MCP connectors for external systems, and Cloud Routines for running with the machine off.

What does a self-running loop really cost? +

Beyond the token bill, three silent costs accrue: verification debt as unverified output between runs and right, comprehension rot as a widening gap between code that exists and code you understand, and cognitive surrender when you stop having an opinion. Anthropic's own experiments show the token range: a full harness run for a DAW app cost about 124.70 US dollars over nearly four hours. Hard caps set before the first unattended run are the guard.

Loop Engineering with Claude: Building AI Loops That Run Themselves