Massive disclaimer

I do not do this at work, and I am not suggesting you do. I will never push a line of code to a shared codebase that I have not at least read and thought about. For me, this exercise is about seeing what I can get out of these tools and how far I can push them.

How I get good quality code from Claude over long autonomous sessions

I have reached the point that I can let Claude build software unattended for hours at a time while I sleep, work my job, or spend time with my family. The huge frontier models we have access to now are a big part of that of course, but mostly it's thanks to the fantastic skills people have written to get the most out of those models.

Most of my effort goes into the preparation, spec & planning phase now. I have also developed a command that wraps some of those skills to let Claude work autonomously through large features which I will share at the end.

The skills behind this come from Addy Osmani's Agent Skills, which encode Google's software development lifecycle as mandated workflows rather than advice: specify, plan, build, verify, review, ship. Left alone, agents skip most of those phases. The skills hold them to the steps and make each one produce evidence before the next begins.

I usually start away from Claude because I find it does a great job of critically reviewing half thought through ideas with a fresh unpolluted context window. This probably sounds daft, but for some reason I also find it helpful to do the creative brainstorming bit in other tools before moving to Claude Code for the serious technical stuff. Most of that thinking and research happens with Codex or my Hermes agent, until I have a rough draft of what I think I want. All the features, but none of the implementation details. Only then do I open Claude Code and run the spec-driven-development skill. I do not ask for the spec straight away though, first I have Claude read the rough draft then play the goals and end state back to me in its own words. That tells me whether the idea is clearly expressed, and ensures that Claude gets it.

Next I have it tear the draft apart. Claude asks both functional and implementation questions, and we thrash out the details together. This is slow, and deliberately so. A spec routinely takes me a couple of days across many sittings. One recent feature came into Claude as a 9KB research note and ended as a 38KB specification, much of that growth being decisions I had not realised I still needed to make.

When the spec is right the planning-and-task-breakdown skill turns it into a short implementation plan: ordered tasks, each with acceptance criteria and a way to verify it.

By that point if I did a good job, there should be enough detail for Claude to implement it all with little to no further input, so I launch the build with a command I wrote called autonomous-build-and-review. It is about seventy lines, and all it really does is chain Addy's skills together with a bit of extra sauce to keep Claude moving through a plan with multiple phases: incremental implementation and test-driven development for each task, then a simplification pass, then a review sub-agent that checks the work across correctness, readability, architecture, security and performance, then fixes, and finally a pull request.

If Claude hits a blocker it will stop and ask for help, but for anything small and easy to reverse it just makes a call then adds the decision to a notes file for me to review later. I work through those notes with Claude when I am back, and have it change anything I don't agree with.

Sometimes, Claude will stop after a phase and recommend doing the rest in a new session with fresh context. I usually take its advice. Before ending the session I like asking "Is there anything you want to tell yourself in the new session?". Claude responds with "Done — left myself a note in project memory (auto-loads next session in this repo).".

Not vibe coding

This is not "vibe coding" - that awful term for having LLMs build fun throwaway projects that Andrej Karpathy coined back in February 2025. A year later he described what he calls "agentic engineering": orchestrating agents against detailed specifications, under human oversight.

"engineering" to emphasize that there is an art & science and expertise to it. It's something you can learn and become better at, with its own depth of a different kind.

I can leave Claude to eight hours of unattended work because by the time it begins, as many as possible of the important decisions have been settled up front, and there is enough context for it to know how I would want it to approach the rest.

Like Andrej, I am excited to see how these tools progress in the near future, especially the agentic harness layer where I can make meaningful contributions.

Here is my autonomous-build-and-review.md command. Just drop it in your ~/.claude/commands directory if you'd like to give it a try.

---
description: Implement the plan incrementally — build, test, verify, commit, simplify, review
---

This is to be completely autonomous. Do not ask me for confirmation or permission to proceed to the next task in the plan. You should only stop if there is a blocking decision or conflict in requirements that you need my help to resolve, and that would require significant work to change later. For any small non-blocking issues, or decisions that would be quick to reverse, maintain an AUTONOMOUS-BUILD-NOTES-{YY-MM-DD}.md file with points for me to consider at the end. Keep this notes file out of every commit (add it to `.gitignore` if needed) so it never lands in a PR.

As the one exception to "do not ask for confirmation" above: before any work begins, ask me to confirm auto-mode is enabled, and do not start until I confirm. (You cannot detect or change the permission mode yourself.)

## Build

Work from the implementation plan I specify: its path is in `$ARGUMENTS`. If no path was given, ask me for it before starting rather than guessing which file is the plan.

Always work from a fresh feature branch taken from up to date (pull if needed) `main` (or `master`).

Invoke the agent-skills:incremental-implementation skill alongside agent-skills:test-driven-development.

Pick the next pending task from the plan. For each task:

1. Read the task's acceptance criteria
2. Load relevant context (existing code, patterns, types)
3. Write a failing test for the expected behavior (RED)
4. Implement the minimum code to pass the test (GREEN)
5. Run the full test suite to check for regressions
6. Run the build to verify compilation
7. Commit with a descriptive message
8. Mark the task complete and move to the next one

If any step fails, follow the agent-skills:debugging-and-error-recovery skill.

## Simplify

Invoke the agent-skills:code-simplification skill.

Simplify recently changed code (or the specified scope) while preserving exact behavior:

- Read CLAUDE.md and study project conventions
- Identify the target code — recent changes unless a broader scope is specified
- Understand the code's purpose, callers, edge cases, and test coverage before touching it

Scan for simplification opportunities:

- Deep nesting → guard clauses or extracted helpers
- Long functions → split by responsibility
- Nested ternaries → if/else or switch
- Generic names → descriptive names
- Duplicated logic → shared functions
- Dead code → remove after confirming

Apply each simplification incrementally — run tests after each change

Verify all tests pass, the build succeeds, and the diff is clean, then commit the simplifications so the Review step covers them

If tests fail after a simplification, revert that change and reconsider. Use code-review-and-quality to review the result.

## Review

Spawn a sub agent that invokes the agent-skills:code-review-and-quality skill.

The sub agent is to review the current changes (staged or recent commits) across all five axes:

Correctness — Does it match the spec? Edge cases handled? Tests adequate?

Readability — Clear names? Straightforward logic? Well-organized?

Architecture — Follows existing patterns? Clean boundaries? Right abstraction level?

Security — Input validated? Secrets safe? Auth checked? (Use security-and-hardening skill)

Performance — No N+1 queries? No unbounded ops? (Use performance-optimization skill)

Categorize findings as Critical, Important, or Suggestion. Output a structured review with specific file:line references and fix recommendations.

## Fix

Fix Critical and Important issues from the review. Consider Suggestions and implement those that make sense.

## Push

Commit, push, and raise a PR, then poll the checks with `gh pr checks --watch`.

If CI fails, diagnose with the agent-skills:debugging-and-error-recovery skill, push a fix, and re-check. Only merge on green. If the same check fails twice for the same root cause, stop and record it in the notes file rather than thrashing.

Once CI is green, merge and delete the feature branch.

How I get good quality code from Claude over long autonomous sessions

Massive disclaimer

How I get good quality code from Claude over long autonomous sessions

Not vibe coding

The index

How I get good quality code from Claude over long autonomous sessions

It turns out the hardest part of live multi-agent dev teams is getting them to shut up

The Librarian