Why Every Coding Agent is Bad at Open Source Models?

The best harness engineering for open source models ain't easy, but we're doing just that Command Code.

This is another engineering deep dive into complex problems we're fixing so our users can have the best developer experience when building with Command and open source.

Founder pov here. We shipped what looked like a half-day refactor this week. Turned into a multi-day rabbit hole that taught me something about agent runtime design I keep relearning.

BACKGROUND: Our CLI lets users swap models mid-conversation with /model. Closed API models on one side, growing set of open-source models on the other, fronted by 12-20 different gateways per model. The moment you support that selector, context window stops being a constant. It becomes a runtime variable.

Almost every assumption baked into the agent loop breaks when the window can change under you. Auto-compact thresholds. The /context breakdown. The tokens-left gauge. The overflow guard.

We had one context_limit = 200,000 constant imported by ~12 call sites. Worked great when everything was 200k models. Falls apart the second the selector mixes a 1m model with a 128k open-source one. The agent either wastes 800k of budget or hits a cryptic "context length exceeded" error the next turn after you switch down.

Naive fix: replace the constant with a per-model lookup helper, sourced from a contextwindow field on each model option. Boring. The interesting design problem is what happens at the exact moment of switching.

Three calls we made.

Reconcile, don't just re-render.

User is at 600k tokens used, on a 1m model, switches to a 200k one. You can't just refresh the gauge and hope the next request goes through. It won't.

Our context engine, on a model change: recomputes the limit, re-emits a usage update so the UI is honest about the new ceiling. If usage is already past the tier-3 threshold (90%) of the new window, runs an in-place compaction before returning control. The compaction posts a feed entry: "switched models, compacted 612k to 84k to fit." Opaque upstream error turns into a legible state transition.

Tier-3 instead of the raw limit because the next turn needs headroom for input plus output.

Asymmetric. Only compact on a shrink.

First version compacted whenever usage was past 90% of the new window. Wrong on a same-or-wider switch. Two 1m models, 920k used, switching between them: technically over 90%, but the window didn't shrink. Summarizing now just throws context away. The natural tier-3 auto-compact fires on the next turn anyway.

Guard: if newlimit >= previouslimit, bail. Small change, big behavior delta. Users were seeing "switched to a smaller context window" feed entries when they hadn't actually switched to anything smaller.

Finally, a race condition.

User mashes /model twice in a row. Two summarize-and-compact runs start. Both reading the message list. Both writing. The compaction counter double-counts, the second summary runs against a partially-replaced list.

Single boolean in-flight lock around the reconcile path. If a switch lands while one is running, emit the usage update and bail. The in-flight one finishes with the right state. Boring concurrency, only shows up in replay tests.

Tests passed. Shipped to internal. Then someone on the team tried an open-source model through gateway 1, used about 120k tokens, watched it auto-compact. The model has a 1m window. Tier-1 should fire at 500k, not 100k.

This is where it got interesting for eng.

The bug: our context-limit lookup was running on the raw model id the request was made with. Gateway 1 wants the slug as "vendorname/model-name". Our model registry is keyed by the canonical id. Lookup misses. Falls through to the default of 200k. Tier-1 (50%) of 200k is 100k. So the agent was compacting at 100k instead of 500k. No error, no warning, just compacting too early on every long session.

Same class of bug for the other 2 gateways. Each had its own naming convention. Gateway 2 prefixes everything with a routing namespace. Gateway 3 lowercases vendor slugs in a way our registry didn't match. Plus dated snapshots, plus deprecated-model migrations. Each path was treating the same model as a distinct model.

The agent runtime had been treating model identity as a string equality check. It should always have been canonicalize-then-equal.

Fix: a single canonicalization function at the boundary, run before any registry lookup. It already existed for some call sites. We just hadn't been disciplined about routing every code path through it. Strips routing prefixes, strips dated suffixes, scans the oss model table case-insensitively, runs deprecation migrations. Unknown ids fall back gracefully.

One funnel. Every call site benefits.

The lesson I keep relearning, probably the reason I'm writing this: in an agent runtime, model names are the leakiest abstraction you have. A canonical identity ships with n gateway-flavored aliases. Every config lookup keyed on the wire-format string is a latent bug waiting for a user to switch providers. The defense is to canonicalize at the boundary and keep the canonical form everywhere internal.

Most of those lines are tests and call sites that swapped the constant for the helper. The conceptual work is small. The refactor pushes a runtime variable through a system that assumed it was a constant, and the bugs hide in places you don't think to look until a user with an unusual model slug finds them.

Open-source model ecosystems make this load-bearing. Closed-api providers ship a small, slow-moving set of ids. Folding oss models into the same selector means dozens of providers, each with their own slug conventions, each with windows ranging 32k to 1m. The agent has to absorb that variance without surfacing it as confusion or quiet wrongness to the user.

Three things I think you can learn from this:

One: every "obvious" constant in an agent runtime is a future bug. Our 200k constant was obvious for 8 months. Until it wasn't.

Two: the bug shows up in the second-most-popular path, not the first. Most usage was on closed-api models where the lookup happened to work. The oss path was a test of whether our abstraction held up to variance. It didn't.

Three: when the runtime gets a new degree of freedom (mid-session model swap, in this case), audit every place that read the old assumption. We didn't, and ate a few days of "why did it just compact me" reports before finding the canonicalization gap.

Per-model context windows. In-place compaction on shrinking switches. Single funnel for canonical id resolution. Sounds boring. Load-bearing in practice.

Hope this helps you improve your code. Or you can always try what we're building.

Why Every Coding Agent is Bad at Open Source Models?

More insights

Introducing the Command Code Provider API

/design: the skill that turns Command Code into your design partner

Rules Rot. Skills Decay. Taste Is What Scales.