← Back to posts

Optimizing DeepSeek KV Cache for Serverless AI Pipelines

How splitting a monolithic system prompt into static and per-session layers improved estimated KV cache hit rates from ~42% to ~76% and reduced input costs by an estimated 57% on a Firebase Functions app running DeepSeek V4 Flash.

Get it on Google Play

Disclaimer: All numbers in this post are estimates derived from static code analysis of the actual production prompts. Token counts use a BPE approximation of ~3.8 chars/token for mixed Spanish/English text. Production metrics will be collected via structured logging (prompt_cache_hit_tokens from the DeepSeek API response). Pricing sourced from DeepSeek’s official pricing page as of June 2026 for the V4 Flash model.

The Setup

IntelliAuto is an AI mechanic assistant that runs on Firebase Cloud Functions. Each user query sends a structured prompt to the DeepSeek API (V4 Flash model). A typical request contains:

  • System rules (persona, formatting, JSON schema, action definitions)
  • Vehicle context (make, model, VIN, mileage, maintenance history)
  • Conversation history (previous turns)
  • Dynamic context (current date, resolver data, affiliate info)
  • The user’s question

How DeepSeek KV Cache Works

DeepSeek provides automatic disk-based KV caching. When you send a request, the API caches the input as “prefix units.” If a subsequent request shares the same prefix (byte-identical from the start), those tokens are served from cache instead of reprocessed.

DeepSeek V4 Flash pricing:

TypeCost per 1M tokens
Cache miss (input)$0.14
Cache hit (input)$0.0028
Output$0.28
Cache ratio50x cheaper

Cache hits are 50x cheaper than misses. This means structuring your prompt so more tokens hit the cache has direct, significant cost impact.

The Problem: Interleaved Data Breaking Prefix Matching

The original code built a single systemPrompt string concatenating everything — static rules, vehicle data, and dynamic context — into one massive system message.

Measured from the actual code:

  • The static rules portion alone is ~1,308 tokens (4,972 chars)
  • Two issues broke prefix caching:
  1. Vehicle data was interleaved with static rules — User A’s Golf and User B’s Focus produced different prefixes from the first diverging token. The cache could never be shared across users.

  2. Examples contained interpolated variables — The static rules section included example phrases with the user’s actual mileage injected:

    // BEFORE: interpolated mileage broke prefix matching
    "A tus 85.000 km, yo revisaría..."
    
    // AFTER: generic placeholder, prefix stays identical
    "A tus km, yo revisaría..."

Because of these two issues, the only cache hits came from repeated requests within the same conversation — and even then, only partial matches on the earliest static portion before the first interpolated variable.

Estimated cache hit rate before optimization: ~30-55% (midpoint estimate ~42%, assuming partial prefix matching within a session).

The Solution: Layered Context Splitting

The fix split the monolithic prompt into distinct messages, ordered from most-stable to most-dynamic:

const apiMessages = [
  { role: "system", content: staticSystemPrompt }, // Layer 1: identical for all users
  { role: "system", content: sessionContextPrompt }, // Layer 2: stable per conversation
  ...conversationHistory, // Layer 3: changes per turn
  { role: "system", content: dynamicContextPrompt }, // Layer 4: changes every request
  { role: "user", content: userQuestion }, // Layer 5: unique
];

Measured Token Breakdown (from actual production code)

LayerContentCharsEst. tokensCache behavior
Static rulesPersona, formatting, JSON schema, action definitions4,972~1,308✅ Hit — identical for all users of same language + tier
Session contextVehicle info + maintenance history798~210✅ Hit — stable within a conversation
Conversation historyPrevious turns (2-turn example)297~78❌ Miss — dynamic context injected after breaks prefix
Dynamic contextDate, resolver data, affiliate info654~172❌ Miss — changes every request
User questionThe query53~14❌ Miss — unique per request
Total6,774~1,782
  • Cacheable tokens: ~1,518 (static + session)
  • Non-cacheable tokens: ~264 (history + dynamic + question)
  • Theoretical max cache hit rate: ~85%

Estimated Impact

Using the token measurements above and DeepSeek V4 Flash pricing:

MetricBeforeAfterMethod
Cache hit rate~30-55% (est. 42%)~75-85% (est. 76%)Static analysis
Cost per 1K requests~$0.15~$0.061,782 tokens × pricing
Cost reduction~57%Calculated
Monthly cost (10K req)~$1.47~$0.64

These are estimates. The actual hit rate depends on DeepSeek’s internal cache eviction policy, traffic patterns, and whether concurrent users share cache slots. Production logging (DEEPSEEK_CACHE_METRICS event in Firebase) will capture the real prompt_cache_hit_tokens / prompt_cache_miss_tokens values.

Why History Still Misses

The conversation history doesn’t cache because the Dynamic Context (current date, resolver data) is injected after the history. Since this block changes every turn, the prefix sequence ending at the history is never identical twice.

Moving dynamic context before history would enable history caching, but would mean the model reads dynamic instructions before conversation context — potentially changing response behavior. The current ordering was chosen to preserve existing behavior.

Conclusion

The core insight: prefix caching is fragile by design. One interpolated variable in the wrong place poisons the entire prefix chain. The fix isn’t about writing better prompts — it’s about structuring your API message array so stable data comes first.

The changes were minimal: split one system message into three, generalize two example strings, reorder the message array. No model change, no prompt rewrite, no behavioral difference.

Production metrics will be published once the updated Firebase Functions are deployed and traffic flows through.