Disclaimer: All numbers in this post are estimates derived from static code analysis of the actual production prompts. Token counts use a BPE approximation of ~3.8 chars/token for mixed Spanish/English text. Production metrics will be collected via structured logging (
prompt_cache_hit_tokensfrom the DeepSeek API response). Pricing sourced from DeepSeek’s official pricing page as of June 2026 for the V4 Flash model.
The Setup
IntelliAuto is an AI mechanic assistant that runs on Firebase Cloud Functions. Each user query sends a structured prompt to the DeepSeek API (V4 Flash model). A typical request contains:
- System rules (persona, formatting, JSON schema, action definitions)
- Vehicle context (make, model, VIN, mileage, maintenance history)
- Conversation history (previous turns)
- Dynamic context (current date, resolver data, affiliate info)
- The user’s question
How DeepSeek KV Cache Works
DeepSeek provides automatic disk-based KV caching. When you send a request, the API caches the input as “prefix units.” If a subsequent request shares the same prefix (byte-identical from the start), those tokens are served from cache instead of reprocessed.
DeepSeek V4 Flash pricing:
| Type | Cost per 1M tokens |
|---|---|
| Cache miss (input) | $0.14 |
| Cache hit (input) | $0.0028 |
| Output | $0.28 |
| Cache ratio | 50x cheaper |
Cache hits are 50x cheaper than misses. This means structuring your prompt so more tokens hit the cache has direct, significant cost impact.
The Problem: Interleaved Data Breaking Prefix Matching
The original code built a single systemPrompt string concatenating everything — static rules, vehicle data, and dynamic context — into one massive system message.
Measured from the actual code:
- The static rules portion alone is ~1,308 tokens (4,972 chars)
- Two issues broke prefix caching:
-
Vehicle data was interleaved with static rules — User A’s Golf and User B’s Focus produced different prefixes from the first diverging token. The cache could never be shared across users.
-
Examples contained interpolated variables — The static rules section included example phrases with the user’s actual mileage injected:
// BEFORE: interpolated mileage broke prefix matching "A tus 85.000 km, yo revisaría..." // AFTER: generic placeholder, prefix stays identical "A tus km, yo revisaría..."
Because of these two issues, the only cache hits came from repeated requests within the same conversation — and even then, only partial matches on the earliest static portion before the first interpolated variable.
Estimated cache hit rate before optimization: ~30-55% (midpoint estimate ~42%, assuming partial prefix matching within a session).
The Solution: Layered Context Splitting
The fix split the monolithic prompt into distinct messages, ordered from most-stable to most-dynamic:
const apiMessages = [
{ role: "system", content: staticSystemPrompt }, // Layer 1: identical for all users
{ role: "system", content: sessionContextPrompt }, // Layer 2: stable per conversation
...conversationHistory, // Layer 3: changes per turn
{ role: "system", content: dynamicContextPrompt }, // Layer 4: changes every request
{ role: "user", content: userQuestion }, // Layer 5: unique
];
Measured Token Breakdown (from actual production code)
| Layer | Content | Chars | Est. tokens | Cache behavior |
|---|---|---|---|---|
| Static rules | Persona, formatting, JSON schema, action definitions | 4,972 | ~1,308 | ✅ Hit — identical for all users of same language + tier |
| Session context | Vehicle info + maintenance history | 798 | ~210 | ✅ Hit — stable within a conversation |
| Conversation history | Previous turns (2-turn example) | 297 | ~78 | ❌ Miss — dynamic context injected after breaks prefix |
| Dynamic context | Date, resolver data, affiliate info | 654 | ~172 | ❌ Miss — changes every request |
| User question | The query | 53 | ~14 | ❌ Miss — unique per request |
| Total | 6,774 | ~1,782 |
- Cacheable tokens: ~1,518 (static + session)
- Non-cacheable tokens: ~264 (history + dynamic + question)
- Theoretical max cache hit rate: ~85%
Estimated Impact
Using the token measurements above and DeepSeek V4 Flash pricing:
| Metric | Before | After | Method |
|---|---|---|---|
| Cache hit rate | ~30-55% (est. 42%) | ~75-85% (est. 76%) | Static analysis |
| Cost per 1K requests | ~$0.15 | ~$0.06 | 1,782 tokens × pricing |
| Cost reduction | — | ~57% | Calculated |
| Monthly cost (10K req) | ~$1.47 | ~$0.64 | — |
These are estimates. The actual hit rate depends on DeepSeek’s internal cache eviction policy, traffic patterns, and whether concurrent users share cache slots. Production logging (
DEEPSEEK_CACHE_METRICSevent in Firebase) will capture the realprompt_cache_hit_tokens/prompt_cache_miss_tokensvalues.
Why History Still Misses
The conversation history doesn’t cache because the Dynamic Context (current date, resolver data) is injected after the history. Since this block changes every turn, the prefix sequence ending at the history is never identical twice.
Moving dynamic context before history would enable history caching, but would mean the model reads dynamic instructions before conversation context — potentially changing response behavior. The current ordering was chosen to preserve existing behavior.
Conclusion
The core insight: prefix caching is fragile by design. One interpolated variable in the wrong place poisons the entire prefix chain. The fix isn’t about writing better prompts — it’s about structuring your API message array so stable data comes first.
The changes were minimal: split one system message into three, generalize two example strings, reorder the message array. No model change, no prompt rewrite, no behavioral difference.
Production metrics will be published once the updated Firebase Functions are deployed and traffic flows through.