From a Single File to an MCP Server: Six Rewrites of My Own Harness
摘要
文章详述了 AI 指令集(Harness)的六个演进阶段:从单一 Markdown 文件的模块化拆分,到基于目录的作用域划分,再到定义规则、技能与命令的语义化设计。随后通过 Sellier 和 Keystone 工具实现了配置的脚手架化与框架化,最终演变为 MCP 服务器,使 Agent 能通过结构化接口主动获取上下文、执行多步工作流并管理自身状态。
荐读理由
提供了从单文件 Prompt 到 MCP Server 的完整演进架构,你能据此优化 AI Agent 的上下文管理:通过将规则、技能与工作流结构化为 MCP 工具,解决长 Markdown 文件导致的指令冲突与 Agent 理解模糊问题。
原文
From a Single File to an MCP Server: Six Rewrites of My Own Harness
19 min read
·
1 hour ago
--
The file was 1,800 lines long. It lived at ~/.claude/CLAUDE.md. It had grown a table of contents on top, then nested headings, then a few inline rules that contradicted each other, then a section I had forgotten I wrote. I was halfway through adding another rule when I realized I could not remember which of the previous rules it would override. I closed the file. The next morning I broke it into five smaller files. That was the first rewrite.
Six months later, the same set of ideas lives inside an MCP server that any agent can talk to. The rules are still there. The skills are still there. The lifecycle is still there. The shape is unrecognizable from the 1,800-line single file, and at the same time, nothing about what the harness means has changed. Every rewrite preserved the content and reshaped the container around it.
I want to walk through the six rewrites in order, because each one was driven by a question I asked about my own setup that turned out to be the right question to ask. The rewrites are not a recommendation. You do not need to follow this path. The recommendation is the question that drove each one: what is this rule actually doing, and is the place it lives the place it belongs? If you ask that often enough, your own path will fall out.
The thing I did not expect, looking back, is that the rewrites kept making my setup more useful to other people, not less. The opposite of what you would predict. A personal config file should drift toward being more personal over time. Mine drifted toward being more general. I want to explain why.
The file got too big
The first version of my Claude harness was a single CLAUDE.md at the user-global path. It started small. A paragraph about how I name files. A few rules about testing. A note about not writing comments that narrated the code. Plain markdown, no structure, no frontmatter, nothing the tool was supposed to parse.
That worked for about a week.
Then I added a section about commit messages. Then a section about how to ask before deleting files. Then a long block about the kinds of refactoring I trust without review and the kinds I do not. Then a paragraph about preferring paratest to phpunit in PHP projects, which contradicted a paragraph I had written a month earlier saying I did not have a strong preference between them. By the time the file hit 800 lines, I was the only person in the world who could load it without skimming, and even I was skimming.
The first failure mode was not the agent’s. It was mine. I could not remember what was in my own config. I would write a rule about something, forget I had written it, and write a slightly different version a month later. Two contradictory rules sitting in the same file, both ostensibly mine, both ostensibly active, and no way for the agent to know which one I meant.
I broke the file into pieces. One file per topic. Testing. Commits. Editing. Communication. A small MEMORY.md at the top that linked to the others. The split took an afternoon. The morning after, I could find every rule I had written by skimming the directory instead of skimming the file. The agent’s behavior changed, too. The rules stopped fighting each other.
That was rewrite one. Modularity at the file level. The lesson was the cheapest one: a configuration file that does not fit on your screen does not fit in your head either, and a thing that does not fit in your head will go wrong in ways you cannot predict.
Subdirectory files, scoped to the work
The second rewrite came from a different irritation. Most of my rules were global by accident. I had written them in the global config because that was the file I was editing, not because they applied to every project I worked on. The rule about preferring pytest applied to Python projects. The rule about Rails conventions applied to one Rails app. The rule about writing TypeScript types before implementations applied to TypeScript work. None of those were true everywhere, and the agent was loading all of them everywhere.
Claude Code had grown subdirectory CLAUDE.md files by then. A file at ~/foo/bar/CLAUDE.md only loaded when the agent was working under that subtree. The path itself decided activation. No frontmatter, no flag, no rule resolver; just the location.
I pulled the project-specific rules down to project repos. I pulled the language-specific rules down to language-specific subtrees. The global file shrank to maybe a third of its size. What stayed was the stuff that was actually true about me regardless of project: how I write, how I think about tradeoffs, what I want the agent to do when it is uncertain.
The shape that emerged was something like this:
~/.claude/CLAUDE.md # me, in every project
~/work/acme/CLAUDE.md # this company, every repo
~/work/acme/backend/CLAUDE.md # this team
~/work/acme/backend/service-x/CLAUDE.md # this service
Four layers, each one narrower than the last, each one loading by virtue of its path. The agent saw the union of whichever ones applied to where it was working. I did not have to think about activation rules because the filesystem was the activation rule.
This is the rewrite where I learned that scope is a first-class property of a rule. Not all rules are global. Not all rules are local. Some are team-shaped, some are language-shaped, some are repo-shaped. A harness that does not give you a place to put each of those will collapse them all into one file and you will be back where I started.
Rules, skills, and commands, when the tool grew a vocabulary
Claude Code grew first-class primitives next. Rules became one thing — pieces of always-loaded context. Skills became another — named procedures the agent could invoke. Commands became a third — slash commands the user typed. The vocabulary mattered. A rule is not a skill. A skill is not a command. Mashing them all into one markdown file had been hiding the differences.
I rewrote my config to match the new shape. Rules went under ~/.cluade/rules/. Skills went under ~/.claude/skills/. Commands went under ~/.claude/commands/. The bodies of the markdown files barely changed. The directories did most of the work.
The interesting thing was watching what got easier. Skills with names became searchable. Commands with names became invokable directly. Rules with names became reviewable. The boundaries forced me to ask, for each thing I had written, what kind of thing is this? Some answers were obvious. Many were not. A few were both: the same paragraph had been doing the work of a rule and the work of a skill, and pulling it apart improved both pieces.
This was rewrite three. The lesson: the vocabulary of the tool is doing real work even when you do not see it. When the tool grows a named slot for skills, your skills are easier to write, easier to read, and easier to reason about than they were when they lived inside a paragraph in a wall of text. The naming is the design. This is The Rumplestiltskin Principle in action. When you give something a name, you gain power over it (because you can talk about it).
By the end of rewrite three, my personal harness had stopped looking like a personal config and started looking like a small system. Half a dozen rule files. A dozen skills. A handful of commands. A MEMORY.md that indexed the rest. It worked well enough that I started copying chunks of it into project repos when I started new work, the way you copy a dotfile into a new machine.
That copy-pasting was the next problem.
Sellier, the first time it left my laptop
The first time I built a tool to scaffold this stuff was a project I called Sellier. It was a small Python binary. It took a project directory and dropped a starter set of rules, skills, and commands into the right places. I built it because I had copied my harness into nine different repos by hand and the variations were starting to diverge in ways I could not track.
Sellier was the minimum thing that would have stopped me from copy-pasting. A CLI with one command. A handful of embedded templates. No configuration. You ran sellier init, you got a harness, you moved on.
It worked. It also revealed its own limits within a month.
The first limit was that one starter set is not a real answer to every project. A Rails app and a Python data pipeline and an Elixir service do not want the same testing rules, the same lint conventions, the same commit conventions. Sellier dropped the same set of files into all three and let you edit them after. That was barely better than copy-paste. The work of customizing the starter was still my work, just on slightly fresher templates.
The second limit was that I had no way to share the harness with other people. Sellier scaffolded my opinions. If a colleague used it, they got my testing rules whether they agreed with them or not, and the only path to disagree was to delete my rules and write theirs. There was no layer where their team could customize the defaults without forking the tool.
The third limit was the one that mattered. Sellier had no story for evolution. Once it dropped files, it was done. If I changed my mind about a rule, every Sellier-installed harness was stale. There was no sellier update, because update implied a notion of what version of the rules you had and a notion of what the new version looks like, and Sellier had neither.
Sellier was a thing that pretended to be smaller than it needed to be. The pretense kept it simple, and the simplicity was what made it unusable for the second project that tried to adopt it.
Keystone, when Sellier ran out of room
I started Keystone the week I gave up on Sellier. The thesis was different: instead of one scaffold for everyone, build a system that knew what kind of project it was scaffolding for and produced a harness tuned to that kind. Backend, frontend, monorepo, library, CLI tool. Each one wanted slightly different defaults. Each one had its own opinions baked into the templates.
Keystone 0.1 was a Go binary that ran keystone init, asked you which agent, asked you which kind of project, and dropped a typed starter set. It was Sellier with more options, written carefully enough that the options could grow.
The next two weeks was a sequence of features that each looked like a small addition and added up to something bigger.
First, agent multi-select. Some teams used Claude Code alongside Cursor. The scaffold needed to lay down both sets of files at once.
Then a kind taxonomy on the rules themselves. Some rules were inferential — the agent reading markdown and reasoning about it. Others were computational — a deterministic tool, a linter, a type checker, a test runner. Different load behavior, different verification semantics. The same starter pack needed to handle both.
Then a forward-migration path. Keystone shipped with a migrate command that knew how to walk an installed harness from one version to the next without breaking anything the user had customized. The harness had a schema now. The schema had a version. The migration command was the proof that I had to take both seriously.
Then policy plugins. Organizations that wanted to share governance across many projects (e.g., vendor lists, license rules, release gates, compliance rules for HIPAA or GDPR) needed a way to ship that content as a unit. Plugins lived in git repos with a small manifest. Projects pulled them in by reference, pinned them with a lockfile, and resolved conflicts against project-level files.
Then teams between org and project. The clean org-to-project model broke as soon as one company had a backend team and a frontend team with different conventions. The team layer became its own tier, sitting above the project and below the org.
Then a cascade. With three layers (org, team, project) and rules that could exist at any of them, resolution order became a thing the tool had to be opinionated about. Project overrode team overrode org, unless the higher tier marked something as strict, in which case the higher tier won. Required items propagated downward as gaps the lower tier had to fill or explain.
By Keystone 0.13, the binary had grown a runtime, conventions, plugins, a cascade, a lockfile, migrations, and per-agent rendering. The README still called it a scaffolder. I had been calling it a scaffolder for half a year. It was not a scaffolder. The moment I sat down to plan the 1.0 release and could not write the next paragraph without using the word framework. The 1.0 work was the refactor that admitted what the code had already become.
When the framework wanted to be a server
Keystone 1.0 shipped as a Go binary. You ran keystone init in a project. The binary scaffolded a harness, wrote a lockfile, and walked away. After that you had a set of files in your repo and you edited them like any other source code. If you wanted to add a guide, you ran keystone new guide <name>. If you wanted to verify drift, you ran keystone verify. If you wanted to install a policy from your org, you ran keystone plugin add <ref>.
The 1.0 surface was good. It was the right shape for an installed CLI. It also had a quiet problem I did not see until I started using it on a real project for a few days: the agent did not know anything had happened.
Get Ian Johnson’s stories in your inbox
Join Medium for free to get updates from this writer.
The harness existed on disk. The agent loaded the rules ambiently when its context window allowed. But Keystone the binary was a separate world from Claude the agent. They did not talk. The agent did not know which guides were active. It did not know which sensors had been run recently. It could not ask Keystone “what is the deploy policy on this project?” and get an answer. It could only read the markdown file the same way it read any other file in the repo, hoping the right section had been loaded into context.
There was a perfectly serviceable workaround. The agent read the markdown. Markdown was the source of truth. It worked.
I sat with the question for about a week: should Keystone be an MCP server?
I tried to think of a reason it should not. The harness has structured data — topics, sources, rules, severities, classifications. An agent that can call a tool to fetch structured data with a known schema gets better answers than an agent that scans paragraphs of markdown looking for relevant sentences. The harness has workflows — bootstrap, task, audit, learn. An agent that can be handed a workflow as a multi-step prompt is more reliable than an agent that has to read a playbook file and try to follow it. The harness has live state — sensor results, drift reports, budget calculations. An agent that can ask “show me the current drift report” and get a JSON answer is more useful than an agent that runs a shell command and parses the output.
Every one of those was an argument for the server. None of them were arguments against. The only argument against was inertia.
I built keystone-mcp the week after I admitted there was no good reason not to.
What keystone-mcp actually does
The MCP server is the same harness, repackaged as a thing the agent can call directly. Three surfaces:
Tools. Functions the agent can invoke. keystone_get_context(topic) returns a structured envelope with the rules, the reasoning, the skills, and the commands relevant to a topic. keystone_list_topics() enumerates what is available. keystone_harness_bootstrap() scaffolds a new harness skeleton. There are scaffolders for every kind of thing the harness supports — guides, sensors, scripts, prompts, skills, actions, playbooks, corpus entries, adapters. The CLI’s keystone new <thing> commands all became MCP tools that the agent could call without leaving its loop.
Prompts. Multi-step workflows the agent can be handed. bootstrap() walks the agent through analyzing the codebase and filling in state ledgers. task(description) runs an end-to-end work loop (spec, orient, implement, verify, review) with handoffs at each step. audit() runs the dual flywheel: learn from what worked, prune what is stale. learn(finding) captures a finding from the current session into the learning queue, where it gets promoted to a real rule later.
Resources. Read-only data the agent can fetch by URI. keystone://harness/status returns the layout audit. keystone://harness/verify returns a cascade report: what resolved, what is unreachable, what violates canonical rules, and what required items are missing. keystone://harness/budget returns the per-port token budget so the agent knows when ambient load is getting heavy. keystone://context/{topic} returns the full envelope for a topic.
The shape that emerged is one I would not have predicted when I started. The harness goes in. The MCP server exposes it. The agent calls the server, gets structured answers, and operates on them without ever having to skim a 1,800-line markdown file.
The same content, exposed to the agent through tools, prompts, and resources instead of ambient file loads.
The harness creation, the additions, the pruning, the flywheels — every part of the lifecycle is something the agent can do through the server. Nothing about the user’s side has changed. The files are still in the repo. The user still edits markdown. The MCP server is the thing the agent sees; the markdown is the thing the human sees. Two views of the same content, each tuned for its consumer.
Why I kept questioning my own setup
The six rewrites had one thing in common that I did not see until I lined them up next to each other. Every one started with me looking at my own setup and asking, is this the right shape for what it is doing? Not *does it work? *It always worked, more or less; the question was the shape.
The 1,800-line file worked. I rewrote it because it did not fit in my head.
The split into per-topic files worked. I rewrote it because the topics were global by accident, not by design.
The flat global config worked. I rewrote it because the tool had grown a vocabulary for the distinctions I had been flattening.
Sellier worked. I rewrote it because one starter set was not enough for the variety of projects I was scaffolding.
Keystone 0.x worked. I rewrote it because it had been pretending to be a scaffolder while doing framework work.
Keystone 1.0 worked. I rewrote it as an MCP server because the agent could not talk to it, and that was a worse problem than I had admitted.
None of those rewrites were forced by a bug. They were forced by friction I noticed when I looked carefully. The friction is the signal. You can ship a working setup forever and never look at the friction; you can also look at the friction once a quarter and find the next layer that wants to come off.
I think this is the most useful lesson for anyone running an agent setup of their own. The setup will be wrong in a way that does not break it. The wrongness is the thing to listen for. Why does this feel awkward? Why do I copy-paste this every time? Why do I edit this file with one hand on the back button? Why do I keep adding rules that contradict the rules I added last month? The questions point at the next rewrite.
The accidental generalization
The thing I keep meaning to make peace with is that asking those questions did not make my setup more personal. It made it more general.
If you had asked me at the start what I was building, I would have said the best harness for me, Ian, working the way I work. What came out of six rewrites is a harness framework with no opinion about who is using it, scaffolders that produce starters tuned to the project type, tools that let an org enforce its own conventions from relevant sources, an MCP server that any agent can call.
That generalization was not the plan. It was a side effect of asking honest questions about my own setup. The questions kept revealing that the thing I had thought was personal was actually structural. The split between rules and skills and commands is not specific to me. The cascade from org to team to project is not specific to my company. The need for a forward-migration path on a harness with a schema is not specific to my workflow. Each time I asked what is this rule actually doing?, the answer came back at a level of abstraction that other people could share.
The corollary is the part I want to leave with you. Your personal setup, examined honestly, has more structure in it than you think. Most of the work you have done is structural, not personal. The personal part (your specific testing preferences, your specific naming conventions, your specific aesthetic) is a thin layer on top of a thicker layer of the kinds of things a harness needs to do. Pull on the threads of your setup and the structural layer comes free.
This is the thing I did not expect to find. When I built Keystone for myself, I built a tool other people could use. When I built keystone-mcp for myself, I built an MCP server other agents could call. The selfishness of building for yourself, taken seriously, turns into the generosity of building for everyone — but only if you keep asking is this the right shape? and listen when the answer is no.
Structure, not content
The thing Keystone provides is structure. Not content. This is worth being explicit about, because the most common question I get when I show someone the tool is what rules does it ship with?
The answer is: as few as possible. There is a small set of universal engineering rules in the default policy. These are things that are true regardless of language or framework, like write tests before refactoring legacy code or separate behavioral changes from structural changes in commits. Everything else is up to the project. Keystone scaffolds the directories. Keystone scaffolds the file shapes. Keystone scaffolds the lockfile, the cascade, the agent adapters, the sensor stubs. The content that goes inside those scaffolds is yours.
The reason the tool can be useful to people who do not share my opinions is that it does not have opinions about most things. It has opinions about layout. It has opinions about how the agent should be told about a rule. It has opinions about cascade resolution. The opinions are structural, and they are opinions a wide range of teams can adopt without their own opinions changing.
The bootstrap step is where the content gets filled in. The first thing keystone-mcp’s bootstrap() prompt does is hand the agent a long workflow for analyzing the codebase: which language, which framework, which tests, which lints, which deploy targets, which conventions are already in evidence. The agent does the work of looking at the code, writing it up, and dropping the writeups into the harness’s state ledgers. By the time bootstrap finishes, the harness has been populated with content that is specific to that codebase — not because Keystone shipped it, but because the agent generated it during bootstrap.
This is the division of labor I would not have planned but that fell out of the rewrites:
Keystone gives you the shape. Bootstrap fills it with content. The harness is the result.
The reason this matters for anyone building their own tooling: the more general the structural layer becomes, the more the content layer has to come from somewhere else. If your tool ships opinions, it works for the people whose opinions match. If your tool ships structure, it works for everyone, and the content comes from the project itself. Structure is what travels. Content is what stays local.
I did not start out thinking about this. I started out thinking about my own 1,800-line markdown file. The rewrites taught me the distinction by punishing every time I conflated the two.
Try one rewrite this week
If you are running a harness of any kind, even a small one, and you have not looked at it in a while, here is the work I would do this week.
Open your global config and read it end to end. Not a skim. Read every paragraph and ask, for each one, is this true everywhere? If the answer is no (if the rule applies to a specific language, a specific framework, a specific project), pull it out. Make a new file under the path where it actually applies. Your global config should shrink. A smaller global config means the agent loads less irrelevant ambient context on every turn, and you can find the rules you wrote when you need to revise them.
Find one rule that has drifted. Look for a paragraph in your config that you wrote months ago and would write differently today. Rewrite it. If you cannot remember why you wrote it, delete it. Anything you cannot defend in writing is not earning its keep. The agent does not care about rules that contradicted themselves; you do.
Pick one workflow you do by hand and write it down as a skill. Something repetitive: a release process, a code review checklist, or a debugging sequence. Write a short markdown file that walks through the steps. Drop it under wherever your tool keeps skills. The next time you do that work, ask the agent to invoke the skill instead of doing it from memory. If the skill works, keep it. If it does not, edit it once. The cost of writing a skill is one afternoon. The savings compound from then on.
Those three pieces of work, done in an hour each, are the version-zero of every rewrite I described. Read your config. Trim it. Hoist what is universal up, push what is local down. Capture one workflow as a named thing. The version-zero version is enough to start. The bigger rewrites — splitting your setup across layers, building a scaffolder, exposing the whole thing as an MCP server — come later, if at all. You may never need them. And if you do, why not try out the keystone-mcp? The work of asking is this the right shape? is the work that matters, regardless of how far you take it.
The thing I want to leave you with is the one I learned by accident across six rewrites: a harness is a thing you keep rebuilding because it keeps teaching you what it actually is. The first version is the version where you find out you do not yet know what you are building. The second version is the version where the vocabulary starts to settle. By the sixth version, the structural part of what you built is general enough to be useful to anyone, and the content part is specific enough to be useful to you. Both layers are the harness. Neither one is the whole thing on its own.
If you have been adding rules to a config file for a year, your harness is in there. Ask it what shape it wants to be. The answer will probably not be the shape you started with.
这条对你有帮助吗?