<Notes of dev/>
AgenticAIAIcontextengineeringLLMssecurity

Harness Engineering: The Discipline After the Hype

April 9, 202616 min read
Illustration of a robotic assembly line with mechanical hands stamping documents and a magnifying glass inspecting papers.

"It feels faster" isn't a metric. "The vibes are good" doesn't pass an audit. Here's how to actually prove your AI-assisted workflow is working — with numbers, not feelings.

Part 6 of 6 in the "Context Engineering in 2026" series

Welcome to the final post of this series. If you've been following along, you've set up context seeds, expanded context engineering to your whole team, and picked your framework. Your AI is producing better code, your PM is getting faster status reports, your QA engineer is generating test scenarios from specs.

Everything feels great. The vibes are immaculate.

Here's my problem with vibes: they don't scale.

You know what scales? A CI pipeline that fails when your AI agent hallucinates a function that doesn't exist. A scoring system that catches when your context seed references a pattern you abandoned three sprints ago. A validation loop that gives you a number — not a feeling — about whether your AI setup is actually making you better or just making you feel faster.

That's harness engineering. And if context engineering is the 2026 skill for working with AI, harness engineering is the 2026 skill for trusting AI.

The Maturity Curve Nobody Talks About

The industry went through three phases in roughly 18 months:

Phase 1: Vibe Coding (early 2025). "Hey Claude, build me a landing page." It works! It's magic! Ship it! Nobody checks the output because the sheer novelty of AI-generated code overrides any quality concerns. The engineering equivalent of eating raw cookie dough — delightful, questionable, occasionally regrettable.

Phase 2: Context Engineering (mid 2025 - early 2026). "The AI writes better code when we give it structured context." This is where we've spent most of this series. CLAUDE.md files, spec-driven development, team context systems. The output quality improves dramatically. But validation is still mostly manual — someone reads the code, someone runs the tests, someone eyeballs the results.

Phase 3: Harness Engineering (2026 and beyond). "We can prove the AI's output meets our standards, consistently, at scale." This is where it gets real. This is where you stop saying "it seems better" and start saying "our evaluation suite verified 94.3% spec compliance before any human touched this PR."

For teams selling their services to clients — whether you're an outsourcing company, an agency, or a freelancer — that shift from "seems better" to "measurably better" is the difference between a nice internal experiment and a competitive advantage you can put in a proposal.

Illustration of a person climbing abstract steps, moving from chaos to organized data, charts, and KPIs on a gradient background.
The Maturity Curve Nobody Talks About

What IS an Eval Harness?

An eval harness (or evaluation harness) is a statistical validation loop that systematically measures the quality of AI-generated output. It's not a test suite. Tests are binary — pass or fail. An eval harness is multi-dimensional — it measures how good the output is across several axes simultaneously.

Think of it like this: a test asks "does this function return the right value?" An eval harness asks "did the AI generate a function that's correct, relevant, grounded in the project context, and doesn't contain hallucinated dependencies?"

The key metrics an eval harness measures:

Faithfulness — Is the output grounded in the context provided? When the AI writes code that uses a processPayment() function, does that function actually exist in your codebase? Faithfulness is the hallucination detector. It catches the moments when your AI confidently references things that don't exist.

Retrieval Precision — Did the AI pull the right context? If you asked it to fix a bug in the auth module, did it actually look at the auth module's files, or did it read some unrelated utility and base its answer on that? Precision tells you whether your context loading strategy is working or whether the AI is drowning in irrelevant information.

Relevance — Does the output actually address the question asked? This sounds obvious, but AI has a remarkable talent for generating beautifully written, technically correct answers to questions nobody asked. A relevance score catches the drift.

Context Utilization — How much of the context you injected was actually used? If you're feeding 10,000 tokens of project context into every prompt and the AI only uses 2,000 of them, you've identified 8,000 tokens of noise that's costing you money and potentially diluting attention. This metric helps you trim your context seeds to just the high-signal content.

The LLM-as-Judge Pattern

"But wait," you say, "evaluating AI output is itself a judgment call. Who evaluates the evaluator?"

Welcome to the LLM-as-Judge pattern. The idea: use a stronger model to grade the outputs of your specialized agents. Concretely:

Your development workflow uses Sonnet (fast, affordable) for code generation. Your eval harness uses Opus (slower, more capable) to review the Sonnet-generated code on a standardized scale. Opus reads the original spec, the generated code, and the project context, then scores the output from 0 to 3 on each metric — faithfulness, relevance, compliance with coding standards.

This isn't theoretical. Teams running this pattern report catching 15-20% of issues that would have otherwise made it to human code review — saving the human reviewer time and catching problems earlier in the cycle.

For simpler validations, you don't even need an LLM judge. Rule-based checks work perfectly: Does the generated code compile? Do the referenced functions exist? Does it import from the correct paths? Does it follow the naming conventions specified in the constitution? These are automated checks that run in seconds and catch the most common AI mistakes.

Illustration of a student working at a desk, with papers moving towards a seated man reviewing documents, and a chart above showing evaluations.
Evaluating AI output is itself a judgment call. Who evaluates the evaluator?

Context Rot: The Silent Killer

Here's a failure mode that doesn't get enough attention. You set up your CLAUDE.md three months ago. It references a validateOrder() function that was renamed to verifyOrder() in Sprint 7. It describes an architecture pattern you abandoned when you migrated from REST to tRPC. It lists a library version that's two major releases behind.

Your context seed is rotting. And because the AI faithfully follows the instructions in the seed, it's now generating code that references nonexistent functions, follows deprecated patterns, and imports outdated library APIs. The output "looks right" because it matches the seed — but the seed is wrong.

This is context rot, and it's the context engineering equivalent of technical debt. The only defense is automated validation: checks that run regularly (ideally in CI/CD) to verify that your context seeds are still accurate.

A basic context rot prevention script:

# Check if functions referenced in CLAUDE.md still exist in the codebase
grep -oP '`\K[a-zA-Z]+\(\)' CLAUDE.md | while read func; do
  if ! grep -r "${func%()}" src/ --include="*.ts" -q; then
    echo "WARNING: $func referenced in CLAUDE.md but not found in codebase"
  fi
done

Is this script perfect? No. Is it better than nothing? Enormously. The first time it catches a stale reference that would have caused 30 minutes of debugging, it pays for itself.

Illustration of a partially damaged green document with text, surrounded by warning symbols, against a peach background.
Context Rot: The Silent Killer

Context Drift: The Subtle One

Context rot is when your seed is factually wrong. Context drift is subtler — it's when your AI's behavior gradually shifts because the context window is growing with each interaction, accumulating noise and contradictory information.

Over a long coding session, you give the AI corrections: "Actually, use Zod not Yup." "No wait, use the v2 API, not v1." "Forget what I said earlier about the error handling." Each correction adds tokens. The original context gets buried under layers of amendments. The AI starts making inconsistent decisions because it's trying to reconcile contradictory instructions from different points in the conversation.

The solution is what we described in Part 3: the artifact-driven workflow. Brainstorm in conversation, distill into a document, start a fresh session with the document as context. The document is your versioned, validated, clean context. The conversation is the messy process of creating it. Once the document exists, the conversation is disposable — and so is its accumulated drift.

Superpowers (discussed in Part 4) addresses this architecturally through its subagent isolation. Each worker agent receives only the context it needs for its specific task — not the entire conversation history, not the accumulated corrections, not the debates from the brainstorming phase. Clean context in, clean output out.

Continuous Evaluation: Context in Your CI/CD

The final step is treating context like code — version it, test it, review it in PRs. Here's what that looks like practically:

Run eval suites after every change to prompts, context files, or retrieval logic. If a change to your CLAUDE.md degrades your eval metrics by more than 5%, the build fails. This prevents well-intentioned "improvements" to context files from actually making things worse.

Version your context files. They should be in Git. They should be reviewed in PRs just like code. When someone modifies the project constitution, the team should see and discuss it — not discover two weeks later that someone silently changed the error handling pattern.

Track eval metrics over time. A dashboard showing faithfulness, relevance, and context utilization trends tells you whether your context engineering practice is improving, plateauing, or degrading. When you see a metric dropping, investigate before it becomes a problem.

Error budgets, not binary gates. Treat eval metrics as error budgets rather than strict pass/fail gates. An AI output that scores 2.7/3.0 on faithfulness is probably fine. One that scores 1.2/3.0 needs investigation. Make the thresholds explicit and agree on them as a team.

The "Detonation Chamber" Concept

OpenAI's engineering team introduced a useful metaphor: harness engineering keeps AI agents in safe, observed "detonation chambers" where their outputs are validated before hitting production. The agent can experiment, hallucinate, make mistakes — but those mistakes are caught and contained before they cause real damage.

This maps directly to the Superpowers workflow we covered in Part 4: the two-stage code review (spec compliance, then code quality), the mandatory TDD (failing test first, then implementation), and the verification-before-completion step. These aren't optional quality checks — they're mandatory methodology that wraps the agent in validation at every step.

The result: you can trust the AI more because you've built a system that doesn't require trust. Trust is earned through verification, not assumed through vibes.

Your Eval Harness Starter Kit

Here's a minimal but functional starting point:

1. Define your metrics. For code generation: faithfulness (do referenced functions exist?), standard compliance (does it follow the constitution?), test coverage (are tests included?). For non-code outputs: relevance (does it answer the question?), completeness (is anything missing?), format compliance (does it match the template?).

2. Create a ground truth dataset. Collect 20-30 representative prompts from your actual work — not synthetic examples, but real requests your team makes regularly. Record the expected output quality for each. This is your evaluation baseline.

3. Automate the easy checks. Compile the code. Run the linter. Check for stale references in context files. These rule-based checks catch 40-60% of issues with zero LLM cost.

4. Add LLM-as-Judge for the rest. For subjective quality checks (is this API design reasonable? Does this test scenario cover the right edge cases?), use a stronger model to evaluate. Score on a 0-3 scale. Log everything.

5. Integrate into CI/CD. Run your eval suite on every PR that modifies context files. Run the context rot check weekly. Dashboard the trends. Celebrate when metrics go up. Investigate when they go down.

Illustration of a programmer coding, a team discussing charts, and a representation of growth with graphs, trophies, and a rocket.
Move from ad-hoc to systematic, from vibes to validation, from execution to orchestration

The Full Picture: Everything Connected

Let me zoom out one last time and show how the entire series connects:

Post 1 gave you the "why" — context engineering is the discipline that separates productive AI usage from expensive chat therapy with a language model.

Post 2 gave you the "how" for solo work — seed files, multi-project isolation, the context flywheel that gets smarter over time.

Post 3 expanded to the team — every role contributes context, every role consumes context, the artifact-driven workflow connects them all.

Post 4 armed you with frameworks — OpenSpec for planning, Spec-Kit for governance, BMAD for agile simulation, Superpowers for execution discipline, ClaudeKit for team orchestration, MCP for connectivity.

Post 5 laid out the complete lifecycle — brainstorm, plan, implement, validate, and the security step that most AI workflows skip entirely.

This post closed the loop with validation — because the only thing worse than not using AI is using AI and having no idea whether it's helping.

The pattern across all six posts is the same: move from ad-hoc to systematic, from vibes to validation, from execution to orchestration. Start small. Build the flywheel. Measure the results. Adjust. Repeat.

Context engineering isn't a one-time setup. It's a practice — like testing, like code review, like any other engineering discipline. The teams that treat it as infrastructure will outperform the teams that treat it as a trick. Not by a little. By a lot.

Now go build something. Your AI is ready. And thanks to the context you've set up, this time it actually remembers who you are.


That's a wrap on the "Context Engineering for Real Teams" series. If you found this useful, share it with your team — especially the non-developers. They're the ones who need it most and hear about it least.

If you want to go deeper into the tool ecosystem, check out ClaudeKit for Claude Code team orchestration, Superpowers for execution methodology, and the Awesome Context Engineering repo for a comprehensive curated list of resources.


This is Part 6 of the "Context Engineering for Real Teams" series. Part 1 → | Part 2 → | Part 3 → | Part 4 → | Part 5 →