Prompt caching is one of the most important optimization techniques in modern AI systems.
It can:
- reduce latency
- lower inference cost
- speed up coding agents
- improve long-context workflows
But most developers misunderstand what prompt caching actually means.
A lot of people assume it works like traditional caching:
“If the prompt is the same, the model just reuses the old response.”
That’s not prompt caching. That’s output caching.
Prompt caching is something much more interesting happening inside the model runtime itself.
Prompt Caching Is Not Output Caching
Let’s start with the easiest misconception first.
Imagine a database query.
You run:
1SELECT * FROM users WHERE id = 42The database processes the query and returns the result.
If someone runs the same query again shortly afterward, the system might skip the database work entirely and simply return the cached result.
That’s traditional output caching.
A lot of people think prompt caching works the same way for LLMs. It doesn’t.
What Output Caching Looks Like in LLMs
Imagine this prompt:
1What is the capital of France?The model generates:
1ParisIf the exact same request comes in again, an application could simply return the old response directly without calling the model.
That’s output caching.
The model itself never runs again. Useful sometimes, but that is not prompt caching.





































































//Take Command of your code.
Ship 10x faster with the same team, less time, and your coding taste. Install, sign in, and start coding.
What Prompt Caching Actually Does
Prompt caching caches the processing work of the input prompt itself.
To understand why this matters, you need to understand what happens when an LLM receives a prompt.
When you send a prompt into a transformer model, the model computes something called:
- key-value pairs (KV cache)
at every transformer layer for every token.
You can think of these KV pairs as:
- the model’s internal understanding of the prompt
- contextual memory representations
- token relationships across the sequence
This computation happens during the:
prefill phase
before the model even generates the first output token.
And for large prompts:
that computation becomes extremely expensive.
Why Long Context Is Expensive
Imagine a simple prompt:
1What is the capital of France?That’s cheap. Very few tokens. Very little processing.
Now imagine this instead:
- a 50-page document
- a giant system prompt
- multiple tool definitions
- few-shot examples
- long conversation history
followed by:
1Summarize this documentNow the model has to:
- process thousands of tokens
- compute KV cache across dozens of transformer layers
- build contextual relationships for the entire sequence
before it can generate even a single token.
That’s expensive.
Both in:
- latency
- GPU compute
- inference cost
Prompt Caching Reuses The Expensive Part
This is where prompt caching helps.
Instead of recomputing the same prompt repeatedly, the system stores the already-computed KV cache.
So later, if another request contains the same prompt prefix:
- the cached KV representations get reused
- the model skips recomputing the large shared context
- only the new tokens get processed
That dramatically reduces:
- latency
- compute cost
- time-to-first-token





































































//Take Command of your code.
Ship 10x faster with the same team, less time, and your coding taste. Install, sign in, and start coding.
Real Example of Prompt Caching
Imagine this first request:
1[50-page document]
2
3Summarize this document.The model processes:
- the entire document
- computes KV cache
- generates the summary
Now imagine a second request:
1[Same 50-page document]
2
3What are the warranty terms?Without prompt caching:
- the entire document gets processed again
With prompt caching:
- the system reuses the cached KV representations for the document
- only the new question gets processed
That’s a massive compute saving.
What Usually Gets Cached?
Prompt caching works best for content that stays mostly static across requests.
The most common examples are:
System Prompts
Things like:
1You are a helpful coding assistant...These instructions remain mostly unchanged across conversations. Perfect for caching.
Documents
Examples:
- PDFs
- manuals
- research papers
- legal contracts
- product docs
Especially useful in:
- RAG systems
- AI support agents
- coding assistants
Few-Shot Examples
Few-shot prompting often includes repeated formatting examples.
Those examples can be cached once and reused repeatedly.
Tool Definitions
Coding agents frequently send:
- MCP tool schemas
- function definitions
- JSON contracts
These are usually static and ideal for prompt caching.
Conversation History
Long-running agentic sessions often reuse large parts of prior conversation state.
Caching prevents repeated recomputation.
Prefix Matching Is What Makes Prompt Caching Work
Prompt caching usually relies on something called:
prefix matching
The cache system compares prompts token-by-token from the beginning.
The moment a token changes:
- cache reuse stops
- normal processing resumes
That means prompt structure matters a lot.
Why Prompt Structure Matters
Good prompt structure puts:
- static content first
- dynamic content last
Example:
1[System Prompt]
2[Large Document]
3[Few-Shot Examples]
4
5User Question:
6What are the warranty terms?Now another request can reuse everything above and only process:
1What is the return policy?That’s ideal for caching.
Bad Prompt Structure Breaks Caching
Now imagine the opposite:
1User Question:
2What are the warranty terms?
3
4[Large Document]
5[Few-Shot Examples]If the question changes:
- the very first tokens change
- prefix matching fails immediately
- the cache becomes useless
The model has to reprocess everything again.
This is why:
prompt structure directly affects inference cost.
Why Prompt Caching Matters for Coding Agents
Coding agents benefit enormously from prompt caching because they repeatedly send:
- giant system prompts
- tool schemas
- agent instructions
- repository context
- memory state
- conversation history
Most coding sessions are:
- append-only
- highly repetitive
- structurally similar between turns
That makes them almost perfect for cache reuse.
At Command Code, a huge amount of harness engineering focuses on preserving:
- cache locality
- stable prefixes
- provider consistency
- session routing
Because if requests bounce between providers or GPU nodes:
- prefix cache disappears
- prompts re-prefill
- latency spikes dramatically
The model itself didn’t get slower.
The runtime lost the cache.
That’s one reason generic coding agents often feel worse with open models.
Closed providers frequently hide caching internally.
Open-model systems expose the orchestration problem directly.
Prompt Caching Isn’t Infinite
Caches do not last forever.
Most providers:
- expire caches after a few minutes
- evict inactive KV states
- rotate GPU memory dynamically
Typical cache lifetimes are:
- 5–10 minutes
- sometimes longer
- occasionally up to 24 hours
Some providers:
- handle caching automatically
Others require:
- explicit cache markers
- API-level cache controls
Prompt Caching Usually Starts After Large Context
Prompt caching has overhead too.
Very small prompts often don’t benefit much.
Most systems typically start caching after:
- ~1024 tokens
- or larger prompt thresholds
Below that:
- cache management overhead may exceed savings
Caching becomes much more valuable as prompts grow.
Final Thought
Prompt caching is one of the most important optimizations in modern AI systems.
Because increasingly:
- AI cost
- latency
- responsiveness
depend less on raw model intelligence and more on:
- runtime orchestration
- context management
- cache behavior
- inference engineering
And as coding agents and long-context workflows grow larger:
Prompt caching becomes infrastructure, not just optimization.
