Each post in this domain is written in case-study format: situation, issue, solution, usage context, and delivery impact.

10 min read

14 Models Benchmarked on RK3588: The Definitive CPU vs NPU Ranking

Benchmarked every viable local LLM (350M to 26B, CPU and NPU) through a live Discord agent pipeline on RK3588. Found NPU beats CPU at same quality, code is solved at any size, and 4B+ models are slower AND worse than 2B on this board.

Local AI
Issue Previous benchmarks measured raw llama.cpp throughput but not real quality through the agent pipeline. Models that looked fast synthetically failed at reasoning, refused tool calls, or got intercepted by workspace routing before reaching the model.
Solution Built a 14-test, 6-dimension benchmark harness that tests every model through the live Discord pipeline with quality validation: reasoning, factual accuracy, code generation, instruction following, tool calling, and math. Tested 14 models (9 CPU GGUF + 3 NPU RKLLM + 2 large MoE) with BENCHMARK_MODE to isolate pure model performance.
rk3588radxarock-5b-plusllama.cpp
6 min read

llamacpp-workbench: Remote llama.cpp Control and REAP Model Serving on RK3588

Publishing a practical local-AI control plane for llama.cpp: remote model loading, runtime tuning, streaming chat, and real REAP model serving on a Radxa ROCK 5B+.

Local AI
Issue Most local model UIs either abstract away the runtime details that actually matter on constrained hardware or assume desktop-class GPUs. On RK3588, that makes it harder to tune context, KV cache quantization, reasoning behavior, and model selection credibly.
Solution Built and published `llamacpp-workbench`, a remote llama.cpp workbench with explicit runtime controls, model presets, markdown chat rendering, streaming responses, and benchmark-backed defaults for REAP and dense GGUF models.
llama.cpprk3588radxarock-5b-plus
15 min read

Qwen3.5 on RK3588 with llama.cpp: Real Benchmarks from a Radxa ROCK 5B+

An advanced benchmark report for running Qwen3.5 locally on RK3588 with source-built llama.cpp: prefill speed, decode speed, stable context, tool-calling behavior, and the practical model choices that actually work on a Radxa ROCK 5B+.

Local AI
Issue The usual local-AI advice overemphasizes parameter count and underexplains bandwidth, context budget, KV cache policy, and interactive latency. On RK3588, that leads to bad defaults: models that technically load but feel broken in real chat and tool-calling workloads.
Solution I ran a corrected Qwen3.5 sweep on RK3588 using source-built llama.cpp, quantized KV cache, and task-pass validation. Then I compared prefill, decode, stable context, average latency, and tool-calling behavior to determine the right model for each workload.
rk3588radxarock-5b-plusllama.cpp
15 min read

GPU VRAM, CPU Offload, and llama.cpp: The Real Performance Cliff

An advanced guide to local GPU inference with llama.cpp: why bandwidth matters more than model fit, how hybrid GPU+CPU offload behaves on cards like the RTX 3060 and 5070, what quantization really means mathematically, and how to run it on Linux, Windows, and WSL.

Local AI
Issue Operators lacked a practical framework for choosing quantization, sizing VRAM budgets, deciding when CPU offload is acceptable, and understanding the difference between weight quantization and KV cache quantization. Windows-specific setup questions also created confusion around native builds versus WSL.
Solution Documented the bandwidth-first model, explained hybrid offload behavior for 12 GB and mid-range modern GPUs, compared quantization choices such as Q4_K_M and q4_0 KV cache, and provided concrete llama.cpp launch patterns for Linux, Windows, and WSL.
local-aillama.cppcudavram
3 min read

Implementing Google's TurboQuant: Hybrid KV Cache for Edge LLM Deployment

How I implemented hybrid per-layer KV cache quantization on RK3588 using insights from Google's TurboQuant research, achieving 17% better compression with zero quality loss.

Local AI
Issue Every time you message an AI chatbot, the model stores your conversation in temporary memory called the KV cache. On large models, this cache alone can consume 40GB—more than the model itself. On a constrained edge device, this is the difference between working and broken.
Solution Implemented hybrid per-layer KV cache quantization inspired by Google's TurboQuant (ICLR 2026). By using 8-bit quantization for early transformer layers (where attention quality matters most) and 4-bit quantization for later layers, we achieved 17% better compression without quality loss.
local-aiedge-aiturboquantkv-cache
9 min read

RK3588 NPU Router Architecture: What Actually Runs, What Wins, and Why

A benchmark-backed deep dive into the real RK3588 inference stack: llama.cpp CPU winners, NPU roles, KV cache choices, quantization tradeoffs, and how to think about 27B with GPU+CPU offload.

Local AI
Issue Several paths technically loaded but were not practically usable. Large models timed out or delivered poor latency, CPU tuning mattered more than expected, and the product narrative needed to shift from 'many runtimes' to a benchmark-backed llama.cpp-first architecture.
Solution Benchmarked llama.cpp and RKLLM on RK3588, identified the winning CPU configs for Qwen 3.5 4B and 9B, clarified where the NPU helps, documented KV cache and quantization choices, and reframed the architecture as llama.cpp-first with NPU used selectively.
rk3588npullama.cpprkllm
8 min read

Azure Provisioned Throughput: When Fixed Costs Beat Pay-Per-Token

Why we moved from Pay-As-You-Go to Provisioned Throughput Units (PTU) for our Azure OpenAI workloads—and how to know if it makes sense for you.

AI
Issue As your application scales, Microsoft's default rate limits can throttle your service, leading to slow responses and inconsistent user experiences. You're essentially stuck in traffic during peak hours.
Solution Think of it like a toll road. Standard use is like paying per mile, but you're stuck in traffic. Azure's Provisioned Throughput (PTU) is like renting your own dedicated express lane. We built a framework to calculate the exact financial break-even point between the two models.
azurellmcost-optimizationinfrastructure
6 min read

Edge LLM Optimization: Memory Bandwidth and Context Management

Lessons learned running LLMs on constrained hardware—why bandwidth matters more than capacity, how KV cache quantization helps, and context folding for long conversations.

Local AI
Issue Edge devices have hard constraints: limited RAM, no GPU VRAM, and strict latency requirements for interactive applications. The naive approach of 'make the model fit' failed repeatedly—either latency was too high or context windows would overflow during long conversations.
Solution Developed a three-pronged approach: (1) enforce bandwidth-first model selection, (2) use KV cache quantization to reduce memory footprint, and (3) implement hierarchical context folding for long conversations.
local-aiedge-aillama.cppkv-cache
14 min read

Training Custom AI Models for Insurance Document Processing

How to build, train, and deploy custom document intelligence models for extracting structured data from multilingual insurance policies using Azure AI Foundry.

AI
Issue Off-the-shelf OCR solutions couldn't handle the complexity of insurance documents. Different insurers used different layouts, multilingual support was limited, and extracted data needed to conform to a strict canonical schema for downstream systems.
Solution Implemented a custom document intelligence solution using Azure AI Document Intelligence, training models on labeled examples to extract and normalize fields across multiple insurers and languages.
azure-aidocument-intelligencemachine-learningautomation
9 min read

IntelliAuto: AI-Powered Automotive Assistant with Secure Monetization

Building an intelligent car maintenance companion with LLM-powered diagnostics, dynamic affiliate commerce, and defense-in-depth AI security.

Kotlin AI
Issue Existing automotive apps are passive logs. Adding AI creates risks: prompt injection through user input, data privacy concerns, API cost runaway, and potential for incorrect safety-critical advice.
Solution Designed IntelliAuto with AutoMind AI assistant featuring backend proxy architecture, multi-layer prompt injection prevention, dynamic affiliate link generation, and strict safety disclaimers for automotive advice.
androidaikotlinmobile
3 min read

RK3588 LLM Performance: NPU vs CPU in a Discord Agent

Benchmarking local LLM inference on RK3588 and why NPU acceleration (RKLLM) is the difference between real-time chat and unusable latency.

Local AI
Issue CPU-only inference on small models was too slow for interactive UX, and some NPU model runs initially failed for non-runtime reasons (corrupted downloads or wrong target platform conversions).
Solution Benchmarked CPU (Ollama) vs NPU (RKLLM), applied system and inference parameter optimizations, and documented failure modes to distinguish model-file issues from NPU/runtime issues.
rk3588npurkllmollama
4 min read

Modernizing Android UX: High Refresh Rates & App Shortcuts

How to request 90Hz/120Hz rendering and implement static deep-linked app shortcuts to improve mobile application usability.

Kotlin AI
Issue The app was locked to standard 60Hz rendering, causing sub-optimal scrolling experiences on devices capable of 90Hz or 120Hz. Additionally, users had to navigate through multiple screens to perform frequent actions.
Solution Detected 90Hz+ display modes and configured window post-processing preferences for smoother rendering, then implemented static XML-based app shortcuts routed via deep links.
androiduxperformancekotlin
4 min read

Securing and Scaling AI Context in an Automotive Assistant

How to implement rate limiting, context window management, and prompt injection prevention for an LLM-powered mobile application backend.

AI Kotlin
Issue Directly exposing LLMs to users risks massive API costs through spam or unbounded context windows. Furthermore, raw user input is vulnerable to jailbreaks (e.g., 'ignore previous instructions and execute code').
Solution Implemented a multi-tier model routing strategy (chat vs reasoning), robust context truncation, regex-based jailbreak detection, and strict timestamp-based rate limiting.
llmsecuritynodejsarchitecture
4 min read

Building a Multilingual AI Backend for Part Recognition

How to handle multi-language AI queries to provide accurate predictions and generate tailored localized search queries in a serverless environment.

AI
Issue The backend AI needed to recognize user intent and categorize vehicle parts accurately regardless of the input language, and subsequently generate both localized predictive maintenance responses and tailored affiliate search queries.
Solution Implemented comprehensive multi-language keyword dictionaries, extracted user language context directly from client requests, and used mapping dictionaries to serve localized response templates.
nodejsmultilingualllmarchitecture
4 min read

Slashing LLM API Costs with System Prompt Caching

How to structure LLM requests for prompt caching (when supported) to reduce repeated system-prompt input costs.

AI
Issue Large Language Models charge per token. When you send a 1,000-token system prompt alongside a 50-token user question, you pay for 1,050 tokens every time, even though 95% of the payload never changes between requests.
Solution Restructured the API payload to isolate static system instructions so the backend can take advantage of cached-input pricing or prompt caching features where the provider supports it.
llmcost-optimizationarchitecturecaching