My CLAUDE.md is 155 lines. My harness is 28,000.

Why the context file is the index, not the encyclopedia — and what actually scaled my agentic workflow.

Jun 12, 2026

Agentic Workflow Harness & Governance Framework

Here are two numbers from my Git repository.

Over the past quarter of building with — and deliberately experimenting on — agentic development practices, my CLAUDE.md grew from 154 lines to 155.

In the same period, the .claude/ directory beside it grew to 28,431 lines across 118 files.

That is a ratio of roughly 183 to 1. And the context file growing by exactly one line was not stagnation or neglect — it was the design goal. Everything I have learned about making agentic development work in earnest is contained in why that ratio looks the way it does.

This piece covers five things: the failure mode almost everyone hits first; the reframe that fixed it; the six layers where the real weight now lives; the security and governance lens that ties the layers together; and what the arrangement bought me in measurable terms. The argument that frames all of it: a context file is an index, not a manual.

1. The failure mode everyone hits first

If you have used Claude Code, Cursor, or any comparable agentic tool for more than a week, you will recognise the arc:

You start with a small, tidy context file.
The agent does something wrong — uses the wrong naming convention, skips a test, logs a user’s email.
You add a rule to the file.
Return to step 2.

Six weeks later your context file is a 900-line wall of imperatives — and, here is the hard truth, the agent ignores it more than it did when the file was small. This is not mere anecdote: Anthropic’s own guidance names the over-specified context file as an anti-pattern, and the creator of Claude Code keeps his own team’s file deliberately short — his advice when it bloats is to delete it and start again.

However, this is not the model being lazy. It is three structural problems compounding:

Everything loads on every request. Your migration rules are in context when the agent is fixing a CSS bug. Your CSS conventions are in context when it is writing a database trigger. Every irrelevant rule is noise diluting the relevant ones.

Instruction-following degrades with volume. A model attending to 40 rules follows them noticeably worse than a model attending to 8. Past a certain point, each rule you add reduces compliance with the rules you already had.

Prose rules are requests, not enforcement. “Never commit directly to main” is an instruction. I have experienced multiple times that the agent can lose the thread in a long session or get confused across a subagent handoff — and the instruction does nothing. The distinction that matters: an instruction is something an agent can ignore, whereas a rule is something that can be enforced. A “rule” that exists only in prose is not a rule at all — it is an instruction wearing a rule’s clothing.

I hit all three. An early version of my context file had the engineering standards inlined in full — naming tables, the TDD workflow, security rules, the lot. About a month in, I gutted it in one deliberate restructuring: the standards moved out, and what stayed behind was pointers.

For example, my testing guidance used to live in the context file as prose:

Testing. We follow strict TDD: write a failing test first (RED), write minimal code to pass (GREEN), then refactor. Coverage must stay above 80% on branches, functions, lines, and statements. Unit tests are co-located next to source files. Never use test.only() or test.skip() in committed code. Integration tests live in…

…and so on, for every discipline. After the restructuring, the same topic occupies one row of a table:

One sentence of essence, one pointer to the file that owns the detail. That pattern, repeated across every engineering concern, is how the file lost hundreds of lines of prose while gaining authority.

2. The reframe: an index, not a manual

The mental model that fixed this for me: a context file is a README plus a routing table. Its job is to tell the agent what exists and where to look — not everything there is to know.

My current 155 lines contain four things:

Short project overview — stack, packages, status; what any new collaborator needs in the first thirty seconds.
Principles table — 39 engineering principles, each one sentence, each with an “Owner” column pointing at the file that holds the normative detail.
Essential commands (npm test, the full verification gate, and friends).
Skill Navigator — a map of where the deep standards live and which auto-loaded rules exist.

Almost nothing normative lives in the file itself. Moreover, the file says so explicitly, in a line I would now call the most important one in it:

“This is a navigation index. The Owner column points to the file containing the normative detail. Update standards/rules — never duplicate definitions here.”

That is the single-source-of-truth principle, applied to agent instructions exactly as you would apply it to code. The moment a rule exists in two places, the copies drift, and the agent follows whichever version it happened to read.

3. Where the 28,000 lines actually live: the six layers

So if the context file is thin, where did the weight go? Into six layers, ordered from “always loaded” to “physically cannot be ignored”. This stack is the real answer to “how do you make agents follow the rules?”

Table of the six harness layers with four columns — layer, weight, load trigger, enforcement strength — The six layers, ordered from always-loaded persuasion to prompt-independent denial.

Layer 1: the context file — always loaded, pure navigation. 155 lines index file.

Layer 2: path-scoped rules — loaded when relevant. Ten short, numbered files (coding, API routes, database, security, testing, AI…) that auto-load based on what the agent is touching. Working on an API route? The route pattern and schema-validation rules are in context. Writing a React component? They are not. Each file is terse — constraints only, no tutorials — because it loads alongside live work. This layer alone fixed most of the “irrelevant rules as noise” problem.

Layer 3: standards skills — loaded on demand. Eleven deep standards (testing, security, architecture, error handling, git, database, and so on), each a directory with a skill file plus reference documents. This is where the bulk of the 28,000 lines sits: full TDD workflows, row-level-security policy patterns, error taxonomies, migration safety procedures. None of it loads until the work calls for it — and when it does load, it is the complete treatment, not a summary that loses the edge cases.

Layer 4: subagents — context isolation by role. Ten specialised agents (developer, test-designer, code-reviewer, critic, architect…), each with its own role definition, allowed tools, and operating modes. The point is not merely specialisation — it is that each agent carries its own scoped context instead of inflating one giant session. The test-designer knows the testing standard intimately and does not need the deployment runbook.

Layer 5: hooks — where “please don’t” becomes “you can’t”. This is the layer that changed how I think about agent reliability. Tool-use hooks run before every file write: one blocks writes to sensitive paths (.env, keys, .git/) outright; another logs every modification to an audit trail; and a fail-closed write-scope engine enforces capabilities — the developer agent literally cannot edit test files, the test-designer literally cannot edit implementation files, and no agent can edit the workflow configuration itself, because the hook denies those paths too. These are not instructions the agent might forget. They are denials at the tool layer. The prompt could say “ignore all rules” and the hook would still refuse the write.

Layer 6: CI gates — the backstop that does not care what anyone intended. Git hooks enforce branch naming, conventional commits, and a full verification gate before push. CI enforces a coverage gate on changed files, an architecture-alignment check that validates the diff against machine-readable architecture documents, lint, type-checks, and a secrets scan. By the time code reaches a pull request, it has passed gates that no amount of agent confusion can talk its way through.

The organising principle across all six layers, stated plainly:

Push every instruction down to the cheapest layer that can enforce it.

Prose is the most expensive and least reliable enforcement mechanism — it consumes context on every request and depends entirely on the model’s attention. Tool-level denial and CI checks are the cheapest and most reliable — they consume zero context and cannot be ignored. Put differently: the workflow itself is deterministic — scopes, gates, and pipelines behave identically on every run — and the model’s stochastic intelligence is spent only where judgement is genuinely required: designing, implementing, reviewing. Determinism owns the rails; intelligence rides on them. A rule should only live in the context file if no lower layer can hold it.

4. The security and governance lens

It took me a while to notice what I had actually built, because I had seen it before — just never applied to an AI. This is control design. An agent with write access to a codebase is a privileged user, and twenty plus years in cybersecurity says you never govern a privileged user with a policy document alone. You govern them with controls:

Table mapping six security principles to where each lives in the harness — The same controls you'd apply to any privileged user — least privilege, separation of duties, change control, audit trail — expressed as harness mechanisms rather than policy.

Seen through this lens, the six layers stop being a productivity hack and become a governance architecture: policy expressed as versioned, reviewable files; enforcement pushed into mechanisms the governed party cannot alter; and an evidence trail for every decision. However, the inverse also holds — if your agentic setup has none of these properties, you have granted a tireless, fallible collaborator unrestricted production access and a polite request to behave. Few organisations would accept that posture for a human contractor. It is worth asking why we accept it for agents.

5. What this bought me

The payoffs were concrete, not aesthetic.

Instructions stopped being ignored — because most of them stopped being instructions. The rules that matter most are no longer competing for the model’s attention; they are hooks and gates.

Violations became measurable and fixable. For example, when I decided to ban unsafe type assertions, I did not add a paragraph of prose — I added a short rule plus lint enforcement, and double-casts in the codebase dropped from 17 instances to 2 audited bridge modules. When agent handoff documents kept failing validation on false positives, the fix was schema validation with drift detection, and false failures dropped from 62 to 6. Prose cannot produce numbers like that; mechanisms can.

Maintenance collapsed to single edits. When a standard changes, I edit one file. The index never needs updating because it never contained the detail. No drift, no archaeology, no “which version of this rule is current?”

6. The honest caveats

Three things to know before copying any of this.

It was not designed up front. This harness co-evolved with the work over the quarter, through a steady stream of deliberate tooling changes. The layers appeared in response to real failures, in roughly this order: the architecture documents became machine-checkable early on; the rules extraction happened around the one-month mark; the specialised agents and pipelines matured over the following fortnight; the capability-based write scopes landed last. Do not build six layers for a weekend project.

Each layer has a trigger. Extract rules out of your context file when it passes ~200 lines. Add a hook when an agent violates the same prose rule twice. Add a CI gate when a violation makes it all the way into a pull request. Let the failures tell you which layer to build next.

Some of it is still aspirational — and the index says so. My principles table honestly tags rows as partial, pending, or aspirational. AI-behaviour evals, for example, are a stated principle with a defined boundary but harness to be built yet. An index that claims more enforcement than actually exists is worse than no index — agents and humans alike learn to distrust it.

Final Thoughts

The context file didn’t scale. The architecture around it did.

The instinct, when an agent misbehaves, is to write a longer manual. The discipline that actually works is the opposite: a smaller index, with every instruction pushed down to a layer that can genuinely enforce it — scoped rules, on-demand standards, role-isolated agents, tool-level denials, and pipeline gates that do not negotiate. Persuasion is a last resort, not a strategy.

This is the first post in a series on how my agentic harness has evolved — and it is the only one that argues from principle. The posts that follow take the layers one at a time: architecture documents that bite back when code drifts from them; path-scoped rules in depth; agents treated as versioned APIs with schema-validated handoffs; capability-based write sandboxing; making RED-GREEN-REFACTOR agent-proof; the time I audited my own pipeline with 129 agents; and the eval harness that I am building.

If your context file has been growing while your compliance has been shrinking, the ratio at the top of this post is the diagnosis — and the next post is the first treatment. Subscribe and the series lands in your inbox, one layer at a time.

And before you go, I would genuinely like to hear it: what is the largest context file you have let an agent ignore?

Vishal Garg

Discussion about this post

Ready for more?