A practical context-management design for long conversations on constrained hardware: estimate tokens, detect fatigue, and fold history through a 4-level hierarchy.
- Issue: As conversations grow, irrelevant middle context accumulates, token budgets get exceeded, and edge devices pay extra latency for input processing.
- Solution: Implemented a context folding hierarchy (RAW → DETAILED → SUMMARY → CONCEPTS) with fatigue detection thresholds (85%/95%/98%) and fast character-based token estimation.
- Used In: Used in RADXA AI Suite (edge inference + RAG + multi-agent orchestration).
context-windowsummarizationedge-airagagents
A small, practical benchmark showing how quantizing the attention KV cache can materially reduce RAM usage on edge hardware.
- Issue: Large models can be loadable, but the KV cache can still consume meaningful memory as context grows, limiting concurrency and increasing OOM risk.
- Solution: Benchmarked KV cache quantization modes (default vs q8 vs q4) at a fixed context window and compared startup time, request latency, RSS, and KV cache footprint.
- Used In: Used in Engram AI benchmark runs for CPU GGUF inference (llama.cpp) on ARM64.
llama.cppggufqwenkv-cachequantizationarm64
Benchmarking local LLM inference on RK3588 and why NPU acceleration (RKLLM) is the difference between real-time chat and unusable latency.
- Issue: CPU-only inference on small models was too slow for interactive UX, and some NPU model runs initially failed for non-runtime reasons (corrupted downloads or wrong target platform conversions).
- Solution: Benchmarked CPU (Ollama) vs NPU (RKLLM), applied system and inference parameter optimizations, and documented failure modes to distinguish model-file issues from NPU/runtime issues.
- Used In: Used in Engram AI (local-first Discord bot) running on RK3588.
rk3588npurkllmollamabenchmarksdiscord
Why moving layers into system RAM kills token generation speed, and how the Roofline Model explains it.
- Issue: Needed a repeatable way to understand why moving layers into system RAM kills token generation speed, and how the Roofline Model explains it.
- Solution: Implemented a practical runbook/automation pattern with clear safety checks, execution steps, and verification points.
- Used In: Used in local model testing to validate architecture decisions before broader rollout.
HardwarePerformancePython
Experimenting with Context Folding to parse massive documentation sets on local hardware.
- Issue: Needed a repeatable way to apply Context Folding to parse massive documentation sets on local hardware.
- Solution: Implemented a practical runbook/automation pattern with clear safety checks, execution steps, and verification points.
- Used In: Used in local model testing to validate architecture decisions before broader rollout.
LocalLLMMachineLearningArchitecture