As you know we're building a coding agent harness for both open and closed models. Processing billions of tokens an hours is teaching us lots of amazing things.
Command Code is purpose-built for each model we launch, and it's getting better every day. The closest analogy I can think of is teaching a human to drive a car when they make a mistake, first you save them from the accident, then you teach them how to handle it next time.
"Open source models are bad at coding" is, almost always, a taste problem in the harness, not a capability problem in the model.
Claude Code is a great harness. For Claude. Codex is fine too. For GPT.
If you're using either for an open source model, you're going to have a subpar experience and conclude the model isn't smart enough yet. It usually is. Even something as small as a couple of failed tool calls can completely disable a session.
CONTEXT: At Command Code we've now pushed 10B+ tokens through our agent across closed (opus 4.7, gpt) and open (kimi k2.6, deepseek v4 pro, glm, qwen) paths, and we're adding ~1k new devs a day, mostly on the open side. You see things at that volume that I'm happy to share. The most consistent thing I see: "model X is bad at tool calls" almost always decomposes into "harness Y is built for model Z and silently breaks for model X, and nobody at the harness vendor cares because they don't run model X."
That's not a moral failing it's an incentive structure. Claude Code is Anthropic's agent, tuned for Anthropic's model. Codex is OpenAI's, tuned for OpenAI's. They're doing the thing they're supposed to do. But it produces a market where every other harness is a commodity router 1000 models in a dropdown, no per-model tuning, no per-model evals, no per-model repair logic. You click Kimi, the harness shrugs, and when tool calls bounce six turns in a row, nothing gets logged. The model gets blamed. The harness moves on to the next dropdown entry.
The analogy I keep coming back to is Apple vs Windows not the consumer-marketing version, the design-engineering version. One ships fewer SKUs and tunes the stack for each one. The other ships every SKU and ships you the optimization problem. Neither is wrong they're different products. Coding agents have mostly been the second kind. I think there's a large opening for the first kind, and that's what we're building. Command Code with taste. The current landscape of big coding agents feels a lot like Windows.
"Purpose-built per model" sounds vague until you list what it actually means. Four things from the last few weeks we've observed and shipped:
- Per-model tool input repair.
Deepseek-flash sometimes emits filePath as a markdown auto-link. Opus never does. Opus sometimes sends null for an optional field. Deepseek doesn't. Kimi wraps a single arg in {} when the schema wants an array. Glm sends "foo" where the schema wants ["foo"].
A generic harness sees all of these as "invalid tool call" and ships the model back a raw zod-issues blob. The model can't read it well, retries the same wrong call, gets the same blob. Five turns of that and the user assumes the model is broken. It wasn't it was screaming into a room with no one in it. A purpose-built harness has a small ordered repair table per shape failure (we have ~four per model on avg), watches which models hit which repair, and turns that data into per-model defaults. The model gets to look as smart as it actually is.
- Canonical model id at the request layer, slug translation at the sdk boundary.
We route a single canonical model (kimi-k2-6) through 3-12 providers per model in priority order. Each wants a different slug, a different request shape, a different auth header. The temptation is to fork that all the way through the agent loop. Don't. One canonical id flows through billing, telemetry, evals, fallback. Slug translation happens at exactly one boundary.
A generic harness can't do this because it doesn't have the concept of "the same model on three providers" it has moonshotai/Kimi-K2-Instruct and moonshot/kimi-k2-6 as two separate dropdown entries and your eval data is split across both. You can't tune what you can't aggregate.
- Prefix-cache pinning, per provider.
Closed models have prompt caching as a product. Open models have prefix caching as a side effect of inference-server implementation. The second your conversation lands on a different GPU pod, the warm prefix evaporates, TTFT jumps from <1s to 6-8s, and the model spends its turn re-prefilling the system prompt instead of thinking.
A one-line session pin (best-effort: same value, same pod) fixes it. A generic harness doesn't ship this because it would need a different one for every provider. We ship a different one for every provider. Closed-model harnesses never have to think about this because Anthropic and OpenAI handle caching server-side and silently. Open-model harnesses pay the cost loudly, and most just leave it there.
- Tiered context compaction, tuned to the model that's loaded.
At 50% of effective context, drop the oldest tool calls their results are dead bytes the model already extracted from. At 80%, the same, harder. At 90%, summarize, and only what's old. Never compact the in-flight task. One primitive, three parameters.
The thresholds aren't 50/80/90 of the model's advertised max. They're percentages of effective context, which sits well below the marketing number, because the last ~10% of any model's window is where its own attention degrades. You have to pick that ceiling per model. Nobody publishes it. You find it in the data. A generic harness can't pick it per model, so it picks the marketing number for everyone, and the resulting agent gets dumb at turn 25 across the board. Users assume that's just how agents are. It isn't.
A few things to realize from all this:
A coding agent's intelligence per turn is bounded by:
- what fraction of context is actually load-bearing (compaction)
- what fraction of tool calls survive contact with the schema (repair logic)
- what fraction of turns get warm cache (pinning)
- whether your eval data is one model or three pretending to be one (canonical id)
Closed-model harnesses get most of these for free because the model vendor handles them server-side and doesn't tell you. Open-model harnesses don't get any of them for free. And the generic "1000 models in a dropdown" harnesses can't fix them at scale because there is no "the model" to tune for there is the union of all of them, which is the empty set.
There is no general coding agent. There is a coding agent for each model you actually want to run well, and every one of them is a different product underneath the same harness. We're building that for you.
Coding agents have mostly been windows so far. We're taking inspiration from Apple instead fewer SKUs, tuned for each one. Taste, in engineering, is the willingness to do that work.
The model isn't bad. Your harness is generic. Fix that.

