Situation
In document processing pipelines (like extracting data from insurance policies using OCR), you eventually hit a wall with deterministic normalization rules.
Regex works great for standard dates, but what happens when the OCR outputs "Premium payable: vierteljährlich" (German for quarterly), or "Sum insured: 1.200.000,50 EUR", or "Coverage starts at 00:00 on 12 de agosto de 2024"?
Building lookup tables and parsers for every possible language, typo, and formatting quirk becomes unmaintainable. This is where fine-tuning a Large Language Model (LLM) becomes the most efficient solution for data normalization.
Task 1 – When to Choose Fine-Tuning
You should consider fine-tuning an LLM when:
- You need consistent, human-quality normalization across many unpredictable edge cases.
- Your deterministic rules (regex, parsers) have become too brittle and are constantly failing on new document variations.
- You need to convert raw, messy extraction JSON into a strict, canonical JSON schema.
Task 2 – Preparing the Training Data (.jsonl)
To fine-tune an LLM, you don’t write code; you provide examples. You create a .jsonl (JSON Lines) file where each line represents a single training example formatted as a chat conversation.
The format typically includes a system prompt defining the task, a user prompt containing the raw input, and the assistant response containing the perfect, normalized JSON output.
Here are real examples of how to train the model to handle complex normalization:
1. Normalizing noisy numeric formats (European decimals):
{"messages":[{"role":"system","content":"Normalize OCR noisy monetary formats to numeric values."},{"role":"user","content":"Sum insured: 1.200.000,50 EUR"},{"role":"assistant","content":"{"buildings_sum_insured":1200000.50,"currency":"EUR"}"}]}
2. Translating and standardizing dates to ISO format:
{"messages":[{"role":"system","content":"Normalize dates with month names in Spanish to ISO format."},{"role":"user","content":"Fecha inicio: 12 de agosto de 2024"},{"role":"assistant","content":"{"period_start":"2024-08-12"}"}]}
3. Mapping natural language to controlled enumerations (Enums):
{"messages":[{"role":"system","content":"Extract and normalize 'premium cadence' values."},{"role":"user","content":"Premium cadence: vierteljährlich"},{"role":"assistant","content":"{"premium_cadence":"QUARTERLY"}"}]}
Task 3 – The Fine-Tuning Process
- Supervised Fine-Tuning: You upload your
train.jsonl(containing hundreds of these examples) to your AI platform (like Azure AI Foundry). - Validation: You also provide a
validation.jsonlfile containing different examples the model hasn’t seen. The platform uses this to ensure the model is learning the concept of normalization, not just memorizing the training data (overfitting). - Evaluation: Once trained, you evaluate the model against strict domain metrics: Does the output JSON perfectly match your schema? Are the numeric values mathematically correct?
The Result
By putting the fine-tuned LLM at the end of your extraction pipeline, you replace hundreds of lines of brittle parsing code with a single, intelligent API call. The LLM receives the raw OCR text and reliably outputs structured, validated data ready for your database.
Architecture Diagram
This diagram supports Fine-Tuning LLMs for Complex Data Normalization and highlights where controls, validation, and ownership boundaries sit in the workflow.
Post-Specific Engineering Lens
For this post, the primary objective is: Balance model quality with deterministic runtime constraints.
Implementation decisions for this case
- Chose a staged approach centered on llm to avoid high-blast-radius rollouts.
- Used fine-tuning checkpoints to make regressions observable before full rollout.
- Treated data-engineering documentation as part of delivery, not a post-task artifact.
Practical command path
These are representative execution checkpoints relevant to this post:
./llama-server --ctx-size <n> --cache-type-k q4_0 --cache-type-v q4_0
curl -s http://localhost:8080/health
python benchmark.py --profile edge
Validation Matrix
| Validation goal | What to baseline | What confirms success |
|---|---|---|
| Functional stability | input quality, extraction accuracy, and processing latency | schema validation catches malformed payloads |
| Operational safety | rollback ownership + change window | confidence/fallback policy routes low-quality outputs safely |
| Production readiness | monitoring visibility and handoff notes | observability captures latency + quality per request class |
Failure Modes and Mitigations
| Failure mode | Why it appears in this type of work | Mitigation used in this post pattern |
|---|---|---|
| Over-allocated context | Memory pressure causes latency spikes or OOM | Tune ctx + cache quantization from measured baseline |
| Silent quality drift | Outputs degrade while latency appears fine | Track quality samples alongside perf metrics |
| Single-profile dependency | No graceful behavior under load | Define fallback profile and automatic failover rule |
Recruiter-Readable Impact Summary
- Scope: ship AI features with guardrails and measurable quality.
- Execution quality: guarded by staged checks and explicit rollback triggers.
- Outcome signal: repeatable implementation that can be handed over without hidden steps.