Spec-Driven Development: Keeping Humans in the Loop While Scaling AI-Assisted Code Review

A field report from building Real Polite Protocol, an open protocol and reference implementation on Deno Deploy.


The problem with “vibe coding” at scale

AI coding agents have gotten genuinely good. They will happily implement a feature end-to-end, write the tests, run them, and report green. The trouble is that “the tests pass” is a much weaker signal than it used to be — because the agent that wrote the code also wrote the tests, and both were optimizing to satisfy a prompt you typed two minutes ago.

That is fine for a throwaway script. It is a disaster for a protocol implementation, an API surface, or anything where what the system is supposed to do matters more than whether today’s code happens to do something.

The reviewer’s job is being squeezed from two sides at once. Agents can produce more code per unit time than any reviewer can meaningfully read. And the prompts driving them are ephemeral, so by the time you notice a behavior drifted, the conversation that justified it is gone.

The technique I want to describe is the one I’ve been using on Real Polite Protocol (RPP). It is not novel — formal methods folks have been doing this for decades — but it adapts cleanly to a workflow where most of the typing is done by an AI agent and most of the deciding still needs to be done by a human.

The authority chain

The core idea is a strict, written authority order that every change in the repo must respect:

  1. RFC / spec — rpp-spec.md
  2. Requirement documents.github/requirements/**
  3. Requirement testssrc/requirements/**
  4. Scenarios.github/scenarios/** + src/scenarios/**
  5. Implementation codesrc/**

A lower-authority artifact MUST NOT contradict a higher-authority one. When a test fails, the implementation gets fixed — the test is not weakened. When the desired behavior conflicts with a requirement, the requirement (and possibly the spec) is changed deliberately and explicitly, then the tests and code follow.

Tying specs directly to tests does something subtler than enforce discipline — it gives the agent durable context across features. The requirement files are not scratch notes thrown away at the end of a conversation; they accumulate as a structured record of every behavior the system has ever committed to. When a user asks for a change that is contradictory, partially overlapping, or just complex enough to interact with three other features, the agent loads the relevant requirement files and reasons against the full history of decided behavior, not against whatever the current code happens to look like. The default failure mode of an unscoped agent is to silently regress old requirements while satisfying the new prompt — because the old requirements existed only in a prior conversation it cannot see. With the requirements pinned in the repo and pinned to tests, the agent has to confront the conflict explicitly: either the new request is consistent with the existing requirements, or it requires changing one of them, and that change is itself a reviewable artifact.

Requirement documents are the unit of intent

In RPP, every behavior — “listeners can accept a pending invitation”, “rotating a verification key archives the previous active key”, “the server refuses messages to a blocked contact” — is a small Markdown file under requirements, grouped by feature area. Each one carries frontmatter that ties it back to the spec:

---
id: invitations-003
title: Listeners can accept a pending invitation
spec_ref: "10.2, 10.4, 11.2, 12.2"
---

The body describes the expected behavior in plain language: what the tool accepts, what it returns, what error codes it emits, what state transitions it performs. It is detailed enough that two different implementers — human or agent — would produce functionally equivalent code from it.

Critically, the requirement file is the smallest reviewable unit of intent in the repo. When an agent proposes a change, the most important diff in the PR is usually the requirement diff. If that diff is empty and the code diff is large, something is wrong — either the agent is doing unrequested work, or there is a behavior change that should have been promoted to a requirement first.

Requirement tests are the mechanical pin

Every requirement document has a mirrored test file under requirements at the exact same path:

Requirement documentTest file
startup.requirement.mdstartup.requirement.test.ts
003-accept-invitation.requirement.md003-accept-invitation.requirement.test.ts

Each test file has a single top-level Deno.test() named after the requirement id:

Deno.test("req:invitations-003 - Listeners can accept a pending invitation", async (t) => {
await t.step("creates a bilateral contact", async () => { /* ... */ });
await t.step("dispatches an invitation_reply envelope", async () => { /* ... */ });
await t.step("rejects non-pending invitations with E_INVITATION_NOT_PENDING", async () => { /* ... */ });
});

A few details that turn out to matter a lot:

  • Mirrored paths are mandatory, which makes structural coverage analysis trivial — a script can walk both trees and report any requirement without a test, or any test without a requirement, without understanding the contents of either.
  • Tests are integration-scoped. They exercise the real controllers, managers, and repositories against an isolated Deno KV store. They verify what the requirement says, not what the implementation happens to do today.
  • No inline helpers. Reusable test helpers live in helpers, one per file. Test files contain only imports and Deno.test() calls. This sounds like nitpicking but it is load-bearing — it prevents agents from quietly introducing test-local “helpers” that paper over behavior the requirement does not actually mandate.
  • Tautological tests are explicitly banned. Asserting typeof fn === "function" is not a test. Asserting assertEquals(true, true) is not a test. If a test fails, the rule is to diagnose and fix the root cause — never to weaken the assertion until it passes.

The last point is where most of the human-in-the-loop value lives. The instruction file says it directly:

Never reduce a test’s scope or remove assertions to work around a failure. The requirement document is the source of truth — the test must faithfully verify what the requirement states.

Agents will absolutely try to weaken tests when they get stuck. Codifying “no” in a rule the agent reads on every turn is what makes the system robust.

Scenarios for whole-system journeys

Requirement tests verify a single normative behavior in-process. They are not enough on their own, because protocols are emergent — the interesting bugs live in the seams between two users, two domains, and two HTTP round-trips.

So RPP has a second tier of executable spec: scenarios. Each is a Markdown document under scenarios describing a real-world journey in plain English (“Alice opens a receptive window; Justin sends her an invitation; Alice accepts and replies with a message”), paired with a mirrored src/scenarios/**/<name>.scenario.test.ts that drives the journey through the real MCP HTTP endpoint using authenticated persona clients.

Scenarios sit below requirement tests in the authority chain. They complement requirement tests; they do not replace them. If a scenario fails because of a conflict with a higher-authority artifact, the implementation gets fixed — the scenario is not weakened.

Coverage as a continuous artifact

Because the structure is mechanically uniform, coverage can be reported as a continuous, regenerable artifact rather than a manual review pass. Two reports live under reports and are regenerated as the codebase evolves:

  • gap-analysis-report.md — structural coverage. Which RFC sections have requirement docs? Which requirement docs have tests? What is missing?
  • evaluate-report.md — semantic coverage. How thoroughly does each requirement test actually verify the meaning of its requirement document, with concrete suggestions for closing gaps?

Coverage is reported as X/Y requirements covered by tests (Z%) and significant gaps — missing requirements, weak normative language, untested behaviors — are called out explicitly so they can be addressed before they become regressions. The reports overwrite themselves on every run; git history is the audit trail.

This is the part that changes the reviewer’s job. Instead of reading every line of a 2,000-line PR, a reviewer can look at:

  1. The requirement diff (did intent change?).
  2. The gap-analysis diff (did coverage regress?).
  3. The evaluate-report diff (did any requirement get less thoroughly verified?).

…and then spot-check the code only where one of those three signals fires.

The workflow

When adding a new behavior, the order is always:

  1. If the spec does not already mandate the behavior, propose a spec change first.
  2. Add or update a requirement document under requirements with a stable id and a spec_ref back to the relevant spec section(s).
  3. Add or update a mirrored requirement test under requirements that fails until the behavior is implemented.
  4. Implement the behavior in src until the test passes.
  5. Where the behavior spans multiple personas or HTTP round-trips, add a scenario.
  6. Re-run the gap-analysis and evaluate reports and address any regressions.

An agent can do all six steps. A human only needs to deeply read step 2 — and lightly review steps 3 and 5 — to know that the rest of the work is anchored to something they approved.

Where spec-driven flexes: repo-level and implementation-level instructions

The thing I want to push back on, before someone reads the above and concludes “this is just waterfall in a trench coat,” is that the spec only governs what the system does, not how. The repo still needs opinions about how to write the code, and those opinions live in a separate, complementary layer of customization files. They are loaded by the agent on every relevant turn, and they apply alongside the spec, not in tension with it.

In RPP, that layer looks like this:

Instructions

instructions holds rules that apply automatically based on file path globs:

  • general.instructions.md (applyTo: **) — repo-wide rules: Deno-first, JSR over npm, all third-party imports must be mapped in deno.json, Date objects are the canonical internal representation everywhere (strings only exist on serialized boundaries), Zod validates every external boundary, and so on.
  • patterns.instructions.md (applyTo: src/**) — Grove API conventions: how the context initialization chain layers services → repositories → managers, how controllers accept Request and return Response, where MCP tools register.
  • requirement-testing.instructions.md (applyTo: **/*.requirement.*) — the rules described above: mirrored paths, single top-level Deno.test(), no inline helpers, no tautological tests, never weaken a failing test.
  • scenario-testing.instructions.md (applyTo: **/*.scenario.{md,test.ts}) — scenario conventions: persona clients only, HTTP only, fresh server per test.

These instructions are orthogonal to the spec. The spec says “the system must accept invitations.” The instructions say “and when you write the code, use Zod at the boundary, keep dates as Date objects internally, register the tool through Grove’s MCP layer, and follow the mirrored test path convention.” Neither is sufficient without the other.

Skills

skills holds reusable, named procedures the agent can invoke:

  • deno-add-module — the right way to add a dependency: map it in deno.json first, prefer JSR, then import from the bare specifier.
  • gap-analysis — regenerate the structural coverage report.
  • evaluate — regenerate the semantic coverage report.
  • get-rpp-token — acquire a bearer token for client scripts (prefers RPP_TOKEN, falls back to az account get-access-token).

Skills are how you capture “the validated way we do this thing” without inlining it into the spec or the instructions. When an agent needs to add a module, it follows deno-add-module rather than improvising — and when the procedure changes, you update one file and every future agent run picks it up.

Agents and prompts

agents defines specialized subagents (deno-deploy-builder, deno-deploy-reviewer) with narrower personas and tool sets than the default. prompts holds named, reusable prompts (e.g. quality-checks.prompt.md) that bundle a standard check into a single invocation.

These exist so that implementation-specific concerns — “this is a Deno Deploy edge runtime, no Node-only APIs, no local disk state, fast startup” — can be enforced without polluting the spec, which should remain implementation-agnostic. The RPP spec does not say “the reference implementation runs on Deno Deploy.” That fact lives in the customization layer, exactly where it belongs.

Why this scales the reviewer

The combination is what actually makes a human reviewer effective against an agent producing PRs faster than they can read code:

  • The spec is small, slow-changing, and human-authored. Reviewing a spec diff is a focused intellectual task, not a marathon.
  • The requirements are mechanically pinned to the spec by spec_ref and mechanically pinned to tests by mirrored paths. A reviewer can check coverage without reading test bodies.
  • The tests cannot be tautological, cannot share inline helpers, and cannot be weakened to pass. The agent reads those rules on every turn.
  • The scenarios verify the seams between requirements that no single requirement test would catch.
  • The coverage reports make regressions visible at the PR level instead of at the next outage.
  • The instructions, skills, agents, and prompts keep the how — runtime, framework, dependency policy, dev workflow — out of the spec and in a layer that can evolve independently.

The agent does the typing. The human does the deciding. The artifacts mechanically force the typing to match the deciding.

That, in one sentence, is the whole technique.

Try it

The full repo is at github.com/justinmchase/real-polite-protocol. The interesting files to read first, in order, are:

  1. rpp-spec.md — the protocol itself.
  2. requirements — the decomposition of the spec into reviewable units of intent.
  3. requirements — the executable pins.
  4. instructions — the layer that adapts the technique to this particular codebase.
  5. reports — what the continuous coverage artifact actually looks like.

You do not need to adopt all of it at once. The single highest-leverage change — the one that pays for itself the first week — is the authority chain itself. Write it down. Make it the first thing every agent reads. Then start moving behaviors into requirement files one PR at a time.

The agents get faster. The reviewer stays in control. Both need to be true at the same time, and spec-driven development is the cheapest way I have found to make them so.

Author: justinmchase

I'm a Software Developer from Minnesota.

Leave a Reply

Discover more from justinmchase

Subscribe now to keep reading and get access to the full archive.

Continue reading