Back to homepage

Understanding Harness Engineering: A Simple Yet Deep Dive into Context-Layer Architecture for Agentic Development

For developers navigating between AGENTS.md/CLAUDE.md, skills, hooks, MCP, and everything in between.


Why this matters

You’ve set up Claude Code, added a few MCP servers, launched an /init command to generate a CLAUDE.md, and maybe dropped in some skills. With the latest generation of state-of-the-art LLMs now genuinely capable of producing high-quality production code, it mostly works.

But over time (more features, more repos, more edge cases), things can still get messy: the agent ignores skills, context fills up fast, and output quality degrades across long sessions.

The usual reaction is to add more: more skills, more rules, more docs. Counterintuitively, that often decreases code quality. Most agent failures are context-management failures, and stuffing more content into the window usually makes things worse.

Every rule added to CLAUDE.md, every skill, every hook, is a patch.

It compensates for something the codebase fails to communicate on its own. A well-structured module with consistent conventions and enforced boundaries does not need a paragraph of unwritten conventions explaining it: the agent can read it.

That reframe matters because it changes what harness engineering is actually for. The goal is not to accumulate layers, but to make each one unnecessary, one decision at a time, by moving that decision into the codebase itself, where it becomes permanent, visible, and impossible to ignore.

Analyzing the context layers is precisely what reveals where those gaps are.

The core problem

Context is a signal-quality problem, not a capacity problem

The core execution model of an agent is an iterative loop:

Gather context -> Take action -> Verify result -> Done or loop back

At every step, the agent draws from its context window: a fixed-size buffer holding everything it currently “knows” about the session, including instructions, conversation history, file contents, and tool call results. When that buffer gets noisy or overloaded, the agent doesn’t degrade cleanly, it starts making subtle mistakes.

Wrong information in context is worse than missing information. Useful signal gets buried under irrelevant content, and the agent stops separating the two reliably.

The layered answer

The answer is not more context, but a harness designed around how each tool interacts with the context window.

Your harness is the set of tools, constraints, and feedback loops that make those layers work together.

A simple mental model

Context-Layer Architecture

Prompt
Permanent Layer CLAUDE.md · AGENTS.md
On-Demand Layer Skills · MCP · WebSearch · CLI · Subagents
System Layer Hooks · Permissions
Feedback Layer Tests · Linter · Type Checker · Build

In practice:

  1. Permanent: what belongs in every turn.
  2. On-demand: what should load only when needed.
  3. System: what must be enforced without trusting the model.
  4. Feedback: what checks the result after execution.

Layer 1: Permanent context (AGENTS.md/CLAUDE.md)

This is the Markdown file at the root of your project that loads into the agent’s context on every turn, without being explicitly invoked.

The first instinct when setting one up is to write everything: architecture overview, folder structure, team conventions, library choices, onboarding notes. That instinct is worth resisting. A permanent context file should be small, strict, and operational. If a rule is not worth enforcing on every single task, it probably does not belong here.

Keep it short

AGENTS.md/CLAUDE.md tend to reduce task-success rates compared to providing no AGENTS.md/CLAUDE.md at all, while simultaneously increasing inference cost by over 20%. Auto-generated files (via /init or similar) are primary culprits: they force the agent to spend reasoning tokens on information it could infer directly from reading the code. Bloated, contradictory, or over-specified files turn useful signal into noise.

What belongs here

The agent can already read your codebase. What helps is the stuff it cannot infer from code: tribal knowledge, non-obvious constraints, reasoning directives, and traps that have already caused real issues. Think of it as the short list of things you would tell a senior engineer on day one.

Three kinds of content are worth keeping:

Hard technical obligations

Constraints that apply unconditionally and that the agent might not pick up from context alone.

  • “Always use pnpm, not npm.”

Gotchas

Non-obvious traps specific to this codebase.

  • “We need to keep folder /pointOfSaleOld for backward compatibility. We will remove it once we’ll turn the feature flag on.”
  • “The auth token lifecycle is per-session, not per-request. Storing it in a closure or WeakMap will cause stale-token bugs on long connections.”

Retrieval nudges

Help the agent get relevant context when API docs are not in training data by routing to relevant skills, using web search, or looking into sibling repos.

  • “Prefer retrieval-led reasoning over pre-training-led reasoning when using the Expo SDK: always use WebSearch to get docs matching the specific version.”
  • business-logic is a sibling repo you may need to navigate and edit when necessary (cd ../business-logic).”

Note: in upcoming sections, you will see that some of these can be moved to the on-demand or system layer to improve context engineering further.


Layer 2: On-Demand tools (Skills, MCP, WebSearch, CLI, Subagents)

This layer covers everything the agent can reach for when needed, but that does not load automatically. These tools do different jobs, and treating them as interchangeable is a good way to get a messy setup.

ToolWhat it doesWhen to use it
SkillsPortable packages of instructions, scripts, and resourcesWhen the agent needs domain knowledge, best practices, or procedural steps
MCPExternal service integrations with persistent stateStructured tool access to authenticated or stateful external systems
WebFetch / WebSearchReal-time web accessWhen the agent needs up-to-date and precise info not in training data
CLIDirect execution through shell commands and installed command-line toolsWhen the task is best handled through local commands, scripts, or developer tooling
SubagentsSpawned helper agents for scoped exploration or executionWhen the task can be decomposed into bounded subtasks or parallelized

Skills

Done well, skills are one of the most effective levers in a harness. They move specialized knowledge out of permanent context and into a retrieval model: the agent reaches for what it needs, when it needs it. That keeps the context window clean, and you only pay the cost of expertise when you actually need it. This is the fundamental argument against a CLAUDE.md that keeps growing forever: permanent context is a fixed overhead while skills are a variable cost. In practice, move as many AGENTS.md / CLAUDE.md rules as possible into dedicated skills.

A skill is not always just a .md file. It is a directory with three parts:

  • SKILL.md (required): Contains YAML frontmatter (metadata) and Markdown instructions. The agent only loads the name and description from the frontmatter into its context. If it judges that the description matches the user’s request, it then opens the entire file to follow the instructions.
  • scripts/ (optional): Executable code (Bash, JS/TS, Python) that lets the agent perform actions on the LLM.
  • references/ (optional): In-depth documentation loaded only if the agent needs to look something up mid-task. This is an additional sub-layer of on-demand context.

The three core skill types

To keep a harness usable, categorize skills by intent.

1. Documentation and knowledge skills

Even the most advanced models have a knowledge cutoff, a knowledge date linked to the end of their training.

  • Purpose: Provide information the agent doesn’t know or might misremember.
  • Example: If you use Expo SDK 55, the agent might not know the API details simply because this specific API version may not have been in its training data.
  • Solution: Expo Skills
2. Behaviors and best practices

LLMs tend to generate “average” code.

3. Tooling skills

This is probably the most underused skill type.

  • Purpose: Give the agent capabilities it doesn’t have natively, by bundling scripts that produce output the model alone cannot.
  • Example: A codebase-visualizer skill that runs a bundled script to generate an interactive HTML tree of your project
  • Why it matters: Without the script, this is a prompt. With the script, it is a tool.

Risks: bloat and security

It is tempting to install every best-practice skill you can find, but it is usually a mistake.

  1. Context bloat: even with lazy loading, the agent still scans every installed skill description on every turn. If you have 50 skills, you have added 2,000+ tokens of routing noise to each prompt.
  2. Prompt-injection risk: a skill is an executable prompt. A malicious third-party skill can embed hidden instructions that alter agent behavior. Always audit SKILL.md and any associated scripts before adding a skill to your harness.

How to: install skills for your agent

MCP for stateful integrations

MCP (Model Context Protocol) is an open standard for structured communication between an agent and external systems. In practice, an MCP server is a small Node.js or Python service that exposes typed tools the agent can call. The agent discovers tools, invokes one, and gets a structured response back.

That matters in two main cases:

1. Authenticated integrations

Some systems need a persistent, credentialed connection: Atlassian, GitHub, Context7, and others.

Manually managing tokens in shell environment variables or passing credentials as CLI flags is fragile and error-prone. MCP solves authentication once and exposes structured actions on top.

2. External state manipulation

MCP is also the right tool when the agent needs to operate inside another system, not just query it.

A good example is Chrome DevTools MCP: the agent can open Chrome, inspect the live DOM and CSS, read console and network activity, simulate user flows, and record a performance trace through DevTools. It is not just fetching documentation about the page. It is operating inside a running browser session and reading the resulting state back. The state lives in Chrome, not in the context window, and MCP is the bridge.

MCP tools are usually not very token efficient. If you do not need authentication or persistent external state to operate inside another system, you probably do not need MCP. A skill usually solves the same problem with less overhead and less complexity.

WebSearch and WebFetch for retrieval

These tools are native to most modern agents. They solve two problems:

  • Knowledge cutoff: a language model trains on a snapshot of the world at a specific date. For anything that changes, such as a new Next.js release, a revised Expo SDK, or a breaking change, the model does not know.
  • Precision errors: even for stable APIs in training data, the model may generate plausible but incorrect details, such as wrong method signatures or invented edge-case behavior.

WebSearch and WebFetch are the answer to both. Architecturally, they provide retrieval on demand: instead of trusting pre-training weights, the agent fetches factual data from up-to-date sources and reasons from there.

  • “Upgrade Storybook from v8 to v10.33 (latest). Don’t just upgrade version, make necessary corresponding API changes in the codebase. Use WebSearch to get up to date docs”

It is often worth making WebSearch usage explicit in your prompts, AGENTS.md, CLAUDE.md, or skills to replace the LLM’s default behavior:

“Prefer retrieval-led reasoning over pre-training-led reasoning whenever precision matters”

That shifts the default from “the model probably knows” to “check first before acting.”

CLI as the execution surface

CLI is the natural execution surface for agents, and it falls into two categories:

Native tools

Unix fundamentals (find, grep, sed, awk, jq, curl) and core git commands are deeply embedded in most agents’ training. They need no introduction and carry almost no context cost. The agent can chain them and adapt them to novel situations without explicit instructions.

Augmented CLIs

These are CLIs you can install to extend your agent’s capabilities, tools that are not part of the base toolchain but become available as soon as they are installed on the machine. In practice, if you want the agent to use them reliably, you also need to explicitly tell it they exist in AGENTS.md or in a skill.

A good example is gh, the official GitHub CLI. It unlocks direct access to GitHub operations from the shell.

The same logic applies across a broader tool set:

  • agent-browser gives the agent the ability to control a headless browser from the command line, which is useful for testing, debugging, or navigating the web UI during execution.
  • Cloud-provider CLIs such as AWS CLI and Azure CLI expose hundreds of operations the agent can chain directly, using syntax it already knows from training.
  • Custom CLIs built specifically for your infrastructure can expose internal operations behind an interface the agent can discover on demand via --help.

When should you use the CLI? If a tool has a mature CLI and the agent can use it from its own training as a starting point, prefer the CLI. MCP wins when the tool has no CLI, when authentication is too awkward to manage cleanly in shell, or when the workflow requires persistent state in an external system.

Subagents as isolated workers

A subagent is an agent spawned by the main agent to handle a bounded subtask. It gets:

  • its own context window
  • its own tool access
  • its own scope
  • then returns a result to the parent

From a context-architecture point of view, this matters because it moves work out of the main context entirely.

Instead of loading a large codebase analysis or a long diagnostic sequence into the primary window, you delegate it. The parent agent sees a clean result, not all intermediate reasoning and file reads that produced it.

The practical gains are:

  • Isolation: A subagent that goes wrong does not corrupt the main session’s context.
  • Parallelism: Subagents can run concurrently on independent tasks, such as writing tests for module A while refactoring module B.

In practice, most agents handle this automatically. Claude Code, Codex, Kiro, and similar tools spawn subagents when tasks warrant it. You usually do not configure this, but if you want finer control, you can explicitly spawn custom subagents for well-defined subtasks.


Layer 3: the System layer (hooks and permissions)

This is the enforcement layer. Unlike the permanent and on-demand layers, it does not rely on the model’s judgment at all. It intercepts execution at lifecycle events and allows, blocks, or transforms actions before they reach the filesystem or external systems. Permissions and hooks run deterministically. They do not forget rules when the context gets crowded, which is why they are the most reliable enforcement surface in the harness.

Permissions

Permissions define what the agent is allowed to attempt: file-system access, network access, and whitelisted CLI commands. There is usually little to tweak here, but avoid whitelisting destructive commands you would never want executed without approval.

Hooks: deterministic enforcement

Where an AGENTS.md/CLAUDE.md rule can be ignored, a hook is a hard gate.

Unlike AGENTS.md/CLAUDE.md, hooks do not live in the prompt. They only inject content into context when they fail. That makes hooks ideal for rules you never want violated, without paying an ongoing context cost.

Note: the hook implementation described in this section corresponds to Claude Code. Other agents that implement hooks may expose a different model, event set, or handler system, since this layer is not yet truly standardized.

Handler types

Claude Code supports three handler types:

TypeWhat it doesWhen to use it
commandRuns a shell scriptStructural checks, enforcement, formatting
promptSends context to a model for judgment callsWhen the decision requires interpretation, not a hard rule
agentSpawns a subagent with tool accessDeep verification that needs codebase exploration

Focus on command first. It is deterministic, fast, has no inference cost, and covers most enforcement needs.

Lifecycle events

Hooks attach to specific points in the agent’s execution cycle. Claude Code exposes many; two matter most:

PreToolUse fires before any tool executes. It is the only event that can block actions. Every tool call, Bash, Edit, Write, Read, WebFetch, Task, or any MCP tool, passes through here first. Your hook receives a JSON payload on stdin with the tool name, its full input, and session context.

Exit 0 and execution proceeds. Exit 2 with a message on stderr and the action is blocked, with that message fed directly back to the agent.

INPUT=$(cat)
command=$(echo "$INPUT" | jq -r '.tool_input.command // empty')

if echo "$command" | grep -qE "(^|[\\s&|;])npm "; then
  echo "Blocked: use pnpm, not npm." >&2
  exit 2
fi

That makes PreToolUse the right place for policy enforcement and human-in-the-loop gates on irreversible operations like production deploys, database migrations, and git writes.

PostToolUse fires after a tool completes successfully. It cannot block, but it can inject structured feedback via additionalContext. The pattern is straightforward: run a quality check, capture output, return it to the agent. A linter catches an error; the error description flows back into context; the agent resolves it in its next action. This closes the loop without any human intervention.

FILE=$(echo "$(cat)" | jq -r '.tool_input.file_path // empty')
[[ "$FILE" =~ \.(ts|tsx)$ ]] || exit 0

npx prettier --write "$FILE" 2>/dev/null
if ! npx tsc --noEmit 2>&1 | head -20; then
  echo "Type errors introduced - resolve before proceeding." >&2
fi

Use PreToolUse for policy guards and PostToolUse for cleanup and feedback.

How to: install hooks for Claude Code

Move hard rules out of AGENTS.md context

Many rules that clutter AGENTS.md/CLAUDE.md are actually enforcement candidates, not context candidates:

  • “Always use pnpm, not npm or yarn.”
  • “Never manually edit files in the __generated__ directory.”
  • “All commits must follow conventional commit format.”

These are hard constraints, not implicit knowledge. The use-pnpm rule becomes a PreToolUse hook inspecting every Bash command. The __generated__ protection becomes a file-path check on Write operations. Commit-format enforcement runs on Bash tools invoking git commit.

Moving enforcement rules out of permanent context and into hooks is one of the highest-leverage cleanups you can make. It keeps AGENTS.md/CLAUDE.md focused on what genuinely needs reasoning context and reserves the system layer for what requires absolute guarantees.


Layer 4: the Feedback layer (tests, build, lint, type checker)

This verification loop closes the agent action cycle. It is one of the most underbuilt layers in agentic setups, and one of the most important to get right.

The agent can produce something, report success, and still be wrong. The feature might work, but the code quality can be low. The feedback layer exists to catch that. Tests validate functional correctness, type checking catches structural breakage early, and linting enforces consistency without needing a human to step in every time. Together, these checks keep the codebase maintainable and high quality and allow the agent to evolve more autonomously.

Type checking

tsc --noEmit is usually the fastest deterministic signal in a TypeScript stack. It knows your interfaces, exports, and function signatures. When the agent refactors a shared utility or changes the shape of a DTO, tsc reports the downstream breakage before tests or builds even start.

Stricter rules are free signal

With a human developer, a strict type config can feel like friction. It slows you down, forces explicit decisions, and surfaces errors you meant to clean up later. In agentic development, that logic flips. The agent has no real concept of “later.” It produces code, gets a signal, and reacts immediately.

The stricter the compiler, the richer the signal. A strict tsconfig is not a constraint on the agent. It is a free quality multiplier applied to everything it produces.

The rules worth enabling:

  • strict: true in tsconfig.json is non-negotiable in an agentic context.
  • noUnusedLocals and noUnusedParameters catch the debris of refactoring. The agent reorganizes logic and leaves behind variables and parameters that no longer serve a purpose.
  • allowUnreachableCode: false and allowUnusedLabels: false surface dead code the moment it is introduced.
  • noUncheckedSideEffectImports: true blocks side-effect-only imports where the module cannot be verified to exist.
  • noFallthroughCasesInSwitch: true forces explicit intent on every switch case.
  • paths: { "@/*": ["./src/*"] } is not a validation rule, but a structural contract. It forces imports through resolved aliases rather than relative paths.

A stricter compiler does not slow the agent down. It gives it better signal on every turn.

Linting

The linter is an architectural contract

The same logic that applies to a strict tsconfig applies here too. Every lint rule you add is a zero-token sensor that fires on every change the agent makes without hoping the model remembered the right paragraph in CLAUDE.md, without waiting for review, without a human spotting the issue later. The difference is that a type checker enforces structural correctness. A linter enforces intent: architectural decisions, team conventions, deprecated patterns, and domain-specific rules the type system cannot express.

An agent that writes “average” code is often an agent operating without enough constraints. The linter is one way to raise the floor.

The philosophy of strict baselines

Before writing custom rules, start with a strict baseline that treats lint errors as failures, not warnings. A strict baseline catches a whole class of LLM-shaped mistakes like unnecessary assertions, overly broad error handling, sloppy generics, barrel imports, missing exhaustive checks, etc… right when they appear. Quality then becomes a property of the environment, not something you have to ask for in a new prompt.

Ultracite is a good example of this philosophy. It is a highly opinionated lint preset that bundles hundreds of rules across TypeScript, React, accessibility, imports, and code quality, pre-tuned to be strict without being noisy. Whether you adopt Ultracite itself or assemble your own equivalent, the principle is the same: a strict baseline replaces tedious back-and-forth with the agent and gives you high signal-to-noise enforcement out of the box.

File and function size limits as architectural guardrails

LLMs tend to produce large, monolithic files. A 200-line utility quickly becomes an 800-line file as the agent iterates. The problem is not just readability: performance degrades as context within a file grows. The model spends more tokens tracking internal references, local state, and nested logic, and less on the actual task which makes it harder to test, review, and maintain.

You can solve this deterministically with built-in ESLint/OxLint rules that enforce size limits:

{
  "rules": {
    "max-lines": ["error", { "max": 600, "skipBlankLines": true, "skipComments": true }],
    "max-lines-per-function": ["error", { "max": 250, "skipBlankLines": true, "skipComments": true }]
  }
}

These constraints encode principles you would enforce as a developer anyway if you care about clean code architecture and patterns: composability, separation of concerns, and testable units. The difference is that a lint rule applies them automatically and immediately, enforcing deterministically what would otherwise require constant vigilance—without waiting for review, without relying on the LLM’s judgment in the moment. The agent adapts by producing smaller, more focused units from the start, and the codebase stays navigable as it grows.

Project-specific rules are the real leverage

The highest-leverage linting work is the rules you write yourself, specific to your codebase, your domain, and your team’s accumulated knowledge.

Every architectural decision that currently lives as tacit team knowledge is a lint rule that only waits to exist:

  • “Do not import the database layer from UI components.”
  • “Use the internal `httpClient` wrapper, not raw `fetch`.”
  • “The payments module cannot import from analytics.”
  • “We deprecated `moment`, use `date-fns`.”

Each of these exists as a comment in a PR, a section in a wiki, or tribal knowledge in someone’s head, all of which the agent will never reliably reach, and none of which survive team turnover. Turn them into rules, and they become part of the environment the agent operates inside.

no-restricted-imports is the simplest governance primitive:

"no-restricted-imports": ["error", {
  "paths": [
    { "name": "axios", "message": "Use the internal httpClient wrapper instead." },
    { "name": "moment", "message": "Use date-fns. moment is deprecated." }
  ]
}]

For architectural boundaries, eslint-plugin-boundaries goes further. It lets you declare which layers can import from which: UI, domain, infrastructure, shared, and turns every violation into an immediate, local error before it reaches review, before it reaches CI, before it propagates across the codebase.

Every time a pattern appears more than twice in code review, ask whether it can become a lint rule. If yes, it probably should. A recurring review comment is a lint rule waiting to exist, and in an agentic workflow, a lint rule is considerably more reliable than a comment.

The more project-specific rules you encode, the more the agent’s output reflects your actual codebase instead of statistical averages from training data. Each rule is another sensor. More sensors means better signal. Better signal usually means better output.

Tests

Tests as behavioral signal

Tests are the most direct feedback signal in your harness. A type checker tells the agent the code is structurally valid, a linter tells it the code follows the rules, and tests tell it whether the code does what it’s supposed to do.

Writing tests used to be expensive and tedious, so teams sometimes settled for thin coverage and happy-path-only suites. The feedback loop was limited by how much pain the team was willing to absorb.

That cost structure has changed. Describe the behavior, point the agent at the module, and it can draft a test suite quickly. The practical implication is that coverage gaps are now feedback-loop gaps, and weak tests are bad signals. The agent will keep moving either way. If the suite does not clearly define correct behavior, nothing reliably catches drift when it happens.

A strict baseline and high-quality tests create a virtuous circle: they become tangible anchors that guide the agent’s next changes and let it evolve in the codebase with confidence.


Your codebase is the highest signal

A tight CLAUDE.md and quality skills is simply good documentation. A strict TypeScript configuration is what good engineers try to enforce on every codebase. Lint rules that encode architectural decisions are written institutional knowledge. Tests as a “feedback loop” are not a new insight—it is one of the oldest ideas in software quality.

Harness engineering is just good engineering

What is new is the cost of not doing it. When a human developer skips documentation or writes a weak test, the gap is often compensated by the team’s judgment and memory that is capable of navigating those gaps. The system is imperfect, but it can generally hold together.

An agent has none of that. Every gap in your harness is a gap the agent may fall into.

The paradox is that a well-engineered codebase barely needs CLAUDE.md at all.

Agents are strong pattern matchers. If architectural decisions and code patterns show up consistently, the agent does not need the rules spelled out every time because it can read them from the environment.

Manual context layers exist to compensate for gaps. Eliminate the gaps and you eliminate most of what those AGENTS and skills files needed to say.

The discipline harness engineering asks for is the same discipline good engineering has always asked for: encode decisions so they outlive the people who made them, prefer deterministic enforcement over tribal knowledge, and close feedback loops early.

What has changed is where your attention goes: the agent writes the code, and your job is to review and improve the environment it operates in. The underrated promise of agentic development is that a well-designed codebase, under constant automated pressure, converges toward optimal quality faster than any team ever could manually.


Sources