Each post in this domain is written in case-study format: situation, issue, solution, usage context, and delivery impact.

Context Folding on Edge LLMs: Fatigue Thresholds and Hierarchical Compression

A practical context-management design for long conversations on constrained hardware: estimate tokens, detect fatigue, and fold history through a 4-level hierarchy.

  • Issue: As conversations grow, irrelevant middle context accumulates, token budgets get exceeded, and edge devices pay extra latency for input processing.
  • Solution: Implemented a context folding hierarchy (RAW → DETAILED → SUMMARY → CONCEPTS) with fatigue detection thresholds (85%/95%/98%) and fast character-based token estimation.
  • Used In: Used in RADXA AI Suite (edge inference + RAG + multi-agent orchestration).

KV Cache Quantization on Qwen 3.5 (27B): Cutting Memory Without Breaking Latency

A small, practical benchmark showing how quantizing the attention KV cache can materially reduce RAM usage on edge hardware.

  • Issue: Large models can be loadable, but the KV cache can still consume meaningful memory as context grows, limiting concurrency and increasing OOM risk.
  • Solution: Benchmarked KV cache quantization modes (default vs q8 vs q4) at a fixed context window and compared startup time, request latency, RSS, and KV cache footprint.
  • Used In: Used in Engram AI benchmark runs for CPU GGUF inference (llama.cpp) on ARM64.

RK3588 LLM Performance: NPU vs CPU in a Discord Agent

Benchmarking local LLM inference on RK3588 and why NPU acceleration (RKLLM) is the difference between real-time chat and unusable latency.

  • Issue: CPU-only inference on small models was too slow for interactive UX, and some NPU model runs initially failed for non-runtime reasons (corrupted downloads or wrong target platform conversions).
  • Solution: Benchmarked CPU (Ollama) vs NPU (RKLLM), applied system and inference parameter optimizations, and documented failure modes to distinguish model-file issues from NPU/runtime issues.
  • Used In: Used in Engram AI (local-first Discord bot) running on RK3588.

Local AI: Stop Optimizing for VRAM Capacity. Start Optimizing for Bandwidth.

Why moving layers into system RAM kills token generation speed, and how the Roofline Model explains it.

  • Issue: Needed a repeatable way to understand why moving layers into system RAM kills token generation speed, and how the Roofline Model explains it.
  • Solution: Implemented a practical runbook/automation pattern with clear safety checks, execution steps, and verification points.
  • Used In: Used in local model testing to validate architecture decisions before broader rollout.

Recursive Language Models & Context Rot

Experimenting with Context Folding to parse massive documentation sets on local hardware.

  • Issue: Needed a repeatable way to apply Context Folding to parse massive documentation sets on local hardware.
  • Solution: Implemented a practical runbook/automation pattern with clear safety checks, execution steps, and verification points.
  • Used In: Used in local model testing to validate architecture decisions before broader rollout.