← Back to posts

Data Normalization Strategies for AI Document Extraction

How to handle messy OCR data and normalize fields like dates, currencies, and names after extracting them with AI models.

Case Snapshot

Situation

During document intelligence and Generative AI implementation work, this case came from work related to "Data Normalization Strategies for AI Document Extraction."

Issue

Needed a repeatable way to handle messy OCR data and normalize fields like dates, currencies, and names after extracting them with AI models.

Solution

Implemented a practical runbook/automation pattern with clear safety checks, execution steps, and verification points.

Used In

Used in enterprise AI initiatives where extracted data must be production-ready and governable.

Impact

Increased reliability of AI outputs and made downstream integration easier for business systems.

Situation

Extracting data from unstructured documents (like PDFs or scanned images) using AI models is only half the battle. The raw JSON output from tools like Azure Document Intelligence often contains OCR noise, varied date formats, and inconsistent entity names.

To make this data useful for downstream systems, you need a robust normalization pipeline. The AI extraction model identifies where the data is; the normalization pipeline ensures what the data is conforms to your canonical schema.

Where Normalization Happens

Normalization should primarily happen after extraction.

You typically build a post-processing pipeline (using an Azure Function, Logic App, backend service, or Databricks) that intercepts the raw JSON output from the extraction model and transforms it before saving it to your database.

Common Normalization Tasks

Here are typical data types that require normalization and how to handle them:

  • Dates: OCR might return “1st January 2025”, “01/01/2025”, or “12 de agosto de 2024”. Your pipeline must convert all of these to standard ISO formats (e.g., YYYY-MM-DD).
  • Currency Strings: Raw text like “EUR 15,506,000.00” or “$12,000” needs to be split. You need a numeric float for the amount (15506000.00) and a standardized currency code field ("currency": "EUR").
  • Entity Names: “Example Insurance Corporation Limited” should map to a canonical token like EXAMPLE_INSURANCE_CORP.
  • Addresses: Break down a single string (“123 Business Avenue, Suite 500, 12345 Business City, Country”) into structured components (Street, PostalCode, City, Country).
  • Categorical Mapping: Map free-text descriptions (e.g., “Property: Cold storage warehouse”) to a controlled vocabulary (WAREHOUSE_COLD_STORAGE).

Implementation Patterns

You generally use three patterns to normalize data, often in combination:

1. Deterministic Rules

For highly predictable data, use standard coding techniques:

  • Regex: To strip out unwanted characters or extract specific patterns.
  • Locale-aware parsers: Libraries that understand date formats in different languages.
  • Lookup Tables: Simple key-value maps for standardizing known entity names or country codes.

2. Heuristic Corrections

Implement logic to fix common, known OCR errors. For example, if your OCR frequently confuses a ‘0’ (zero) with an ‘O’ (letter) in a specific numeric field, you can write a heuristic to automatically correct it.

3. ML/LLM Post-processing (Fine-tuning)

When deterministic rules become too brittle to handle the sheer variety of edge cases, you can use a fine-tuned Large Language Model (LLM) for post-processing.

You train the LLM by providing it with .jsonl files containing examples of the messy raw extraction JSON and the desired canonical JSON output.

Example train.jsonl snippet:

{"messages":[{"role":"system","content":"Extract monetary amounts as numbers and include currency code."},{"role":"user","content":"Buildings Declared Value: EUR 15,506,000"},{"role":"assistant","content":"{"buildings_declared_value":15506000.00,"currency":"EUR"}"}]}

The fine-tuned LLM acts as an intelligent normalization engine, capable of understanding context and formatting the data precisely according to your business rules.

Architecture Diagram

Data Normalization Strategies for AI Document Extraction execution diagram

This diagram supports Data Normalization Strategies for AI Document Extraction and highlights where controls, validation, and ownership boundaries sit in the workflow.

Post-Specific Engineering Lens

For this post, the primary objective is: Balance model quality with deterministic runtime constraints.

Implementation decisions for this case

  • Chose a staged approach centered on azure to avoid high-blast-radius rollouts.
  • Used document-intelligence checkpoints to make regressions observable before full rollout.
  • Treated data-engineering documentation as part of delivery, not a post-task artifact.

Practical command path

These are representative execution checkpoints relevant to this post:

./llama-server --ctx-size <n> --cache-type-k q4_0 --cache-type-v q4_0
curl -s http://localhost:8080/health
python benchmark.py --profile edge

Validation Matrix

Validation goalWhat to baselineWhat confirms success
Functional stabilityinput quality, extraction accuracy, and processing latencyschema validation catches malformed payloads
Operational safetyrollback ownership + change windowconfidence/fallback policy routes low-quality outputs safely
Production readinessmonitoring visibility and handoff notesobservability captures latency + quality per request class

Failure Modes and Mitigations

Failure modeWhy it appears in this type of workMitigation used in this post pattern
Over-allocated contextMemory pressure causes latency spikes or OOMTune ctx + cache quantization from measured baseline
Silent quality driftOutputs degrade while latency appears fineTrack quality samples alongside perf metrics
Single-profile dependencyNo graceful behavior under loadDefine fallback profile and automatic failover rule

Recruiter-Readable Impact Summary

  • Scope: ship AI features with guardrails and measurable quality.
  • Execution quality: guarded by staged checks and explicit rollback triggers.
  • Outcome signal: repeatable implementation that can be handed over without hidden steps.