Situation
Building an AI assistant for personal finance requires absolute precision. If an LLM miscalculates a user’s savings rate or formats its response improperly, the application crashes or displays dangerous misinformation.
In this scenario, the AI needed to act as an “Expert Personal Financial Advisor,” categorizing bulk transactions and providing strategic insights without hallucinating numbers.
The Problem with LLM Math
Large Language Models generate text based on probabilities; they do not inherently “calculate” math. If you give an LLM a list of 50 transactions and ask it for the total spend, it will likely guess a number that looks plausible but is entirely wrong.
1. The Pre-Computed Hint Pattern
To solve the math hallucination issue, the architecture was flipped. The LLM is never asked to do math.
Instead, the client application (or a deterministic backend script) calculates the exact financial metrics—such as “Spending Velocity”, “Total Income”, and “Savings Rate”—before the AI is called.
These exact numbers are injected into the system prompt as computed_hints.
// Example of injected computed hints
let computedHintsContext = "";
if (data.computed_hints) {
computedHintsContext = `
FACTS YOU MUST USE:
- User has spent $1,200 out of $2,000 budget.
- Savings rate is exactly 15%.
`;
}
The LLM is instructed to use these exact figures to generate its narrative, completely eliminating mathematical errors.
2. Enforcing Strict JSON Outputs
To ensure the client application can parse the AI’s response, the prompt engineering must be ruthless.
- JSON Structure Definition: The exact expected JSON schema is baked directly into the system prompt.
- API Enforcements: The API call enforces
response_format: { type: "json_object" }on OpenAI-compatible chat APIs that support it. - Server-Side Scrubbing: Even with strict prompting, models sometimes wrap their output in markdown (e.g.,
```json). A robust backend scrubber removes these markdown blocks and extracts the string from the first{to the last}before returning it to the client.
3. Zero-Shot Bulk Categorization
Categorizing transactions (e.g., “Uber Eats”, “Tesco”, “Steam”) globally requires broad context.
Instead of training a custom classification model, a zero-shot prompt approach was used. The prompt dynamically injects the user’s available categories alongside explicit rules:
- International Awareness: Explicit instructions to recognize global brands.
- Context Inference: Rules for guessing (e.g., if it contains “Bistro”, default to “Restaurant”).
- Hard Overrides: Specific keywords (“Vanguard”, “BlackRock”) are strictly hardcoded to route to “Investments” regardless of other context.
By combining pre-computed mathematics, aggressive JSON scrubbing, and highly structured zero-shot prompting, the backend transforms a probabilistic LLM into a highly deterministic financial engine.
Architecture Diagram
This diagram supports Engineering a Deterministic AI Financial Analyzer and highlights where controls, validation, and ownership boundaries sit in the workflow.
Post-Specific Engineering Lens
For this post, the primary objective is: Balance model quality with deterministic runtime constraints.
Implementation decisions for this case
- Chose a staged approach centered on llm to avoid high-blast-radius rollouts.
- Used fintech checkpoints to make regressions observable before full rollout.
- Treated prompt-engineering documentation as part of delivery, not a post-task artifact.
Practical command path
These are representative execution checkpoints relevant to this post:
./llama-server --ctx-size <n> --cache-type-k q4_0 --cache-type-v q4_0
curl -s http://localhost:8080/health
python benchmark.py --profile edge
Validation Matrix
| Validation goal | What to baseline | What confirms success |
|---|---|---|
| Functional stability | input quality, extraction accuracy, and processing latency | schema validation catches malformed payloads |
| Operational safety | rollback ownership + change window | confidence/fallback policy routes low-quality outputs safely |
| Production readiness | monitoring visibility and handoff notes | observability captures latency + quality per request class |
Failure Modes and Mitigations
| Failure mode | Why it appears in this type of work | Mitigation used in this post pattern |
|---|---|---|
| Over-allocated context | Memory pressure causes latency spikes or OOM | Tune ctx + cache quantization from measured baseline |
| Silent quality drift | Outputs degrade while latency appears fine | Track quality samples alongside perf metrics |
| Single-profile dependency | No graceful behavior under load | Define fallback profile and automatic failover rule |
Recruiter-Readable Impact Summary
- Scope: ship AI features with guardrails and measurable quality.
- Execution quality: guarded by staged checks and explicit rollback triggers.
- Outcome signal: repeatable implementation that can be handed over without hidden steps.