Local AI | Sergio B.

Each post in this domain is written in case-study format: situation, issue, solution, usage context, and delivery impact.

Search posts

Feb 28, 2026 Local AI 3 min read

Context Folding on Edge LLMs: Fatigue Thresholds and Hierarchical Compression

A practical context-management design for long conversations on constrained hardware: estimate tokens, detect fatigue, and fold history through a 4-level hierarchy.

Issue: As conversations grow, irrelevant middle context accumulates, token budgets get exceeded, and edge devices pay extra latency for input processing.
Solution: Implemented a context folding hierarchy (RAW → DETAILED → SUMMARY → CONCEPTS) with fatigue detection thresholds (85%/95%/98%) and fast character-based token estimation.
Used In: Used in RADXA AI Suite (edge inference + RAG + multi-agent orchestration).

context-windowsummarizationedge-airagagents

Feb 28, 2026 Local AI 3 min read

KV Cache Quantization on Qwen 3.5 (27B): Cutting Memory Without Breaking Latency

A small, practical benchmark showing how quantizing the attention KV cache can materially reduce RAM usage on edge hardware.

Issue: Large models can be loadable, but the KV cache can still consume meaningful memory as context grows, limiting concurrency and increasing OOM risk.
Solution: Benchmarked KV cache quantization modes (default vs q8 vs q4) at a fixed context window and compared startup time, request latency, RSS, and KV cache footprint.
Used In: Used in Engram AI benchmark runs for CPU GGUF inference (llama.cpp) on ARM64.

llama.cppggufqwenkv-cachequantizationarm64

Feb 28, 2026 Local AI 3 min read

RK3588 LLM Performance: NPU vs CPU in a Discord Agent

Benchmarking local LLM inference on RK3588 and why NPU acceleration (RKLLM) is the difference between real-time chat and unusable latency.

Issue: CPU-only inference on small models was too slow for interactive UX, and some NPU model runs initially failed for non-runtime reasons (corrupted downloads or wrong target platform conversions).
Solution: Benchmarked CPU (Ollama) vs NPU (RKLLM), applied system and inference parameter optimizations, and documented failure modes to distinguish model-file issues from NPU/runtime issues.
Used In: Used in Engram AI (local-first Discord bot) running on RK3588.

rk3588npurkllmollamabenchmarksdiscord

Feb 15, 2026 Local AI 3 min read

Local AI: Stop Optimizing for VRAM Capacity. Start Optimizing for Bandwidth.

Why moving layers into system RAM kills token generation speed, and how the Roofline Model explains it.

Issue: Needed a repeatable way to understand why moving layers into system RAM kills token generation speed, and how the Roofline Model explains it.
Solution: Implemented a practical runbook/automation pattern with clear safety checks, execution steps, and verification points.
Used In: Used in local model testing to validate architecture decisions before broader rollout.

HardwarePerformancePython

Feb 2, 2026 Local AI 3 min read

Recursive Language Models & Context Rot

Experimenting with Context Folding to parse massive documentation sets on local hardware.

Issue: Needed a repeatable way to apply Context Folding to parse massive documentation sets on local hardware.
Solution: Implemented a practical runbook/automation pattern with clear safety checks, execution steps, and verification points.
Used In: Used in local model testing to validate architecture decisions before broader rollout.

LocalLLMMachineLearningArchitecture