Publishing a practical local-AI control plane for llama.cpp: remote model loading, runtime tuning, streaming chat, and real REAP model serving on a Radxa ROCK 5B+.
Issue
Most local model UIs either abstract away the runtime details that actually matter on constrained hardware or assume desktop-class GPUs. On RK3588, that makes it harder to tune context, KV cache quantization, reasoning behavior, and model selection credibly.
Solution
Built and published `llamacpp-workbench`, a remote llama.cpp workbench with explicit runtime controls, model presets, markdown chat rendering, streaming responses, and benchmark-backed defaults for REAP and dense GGUF models.
llama.cpprk3588radxarock-5b-plus
An advanced benchmark report for running Qwen3.5 locally on RK3588 with source-built llama.cpp: prefill speed, decode speed, stable context, tool-calling behavior, and the practical model choices that actually work on a Radxa ROCK 5B+.
Issue
The usual local-AI advice overemphasizes parameter count and underexplains bandwidth, context budget, KV cache policy, and interactive latency. On RK3588, that leads to bad defaults: models that technically load but feel broken in real chat and tool-calling workloads.
Solution
I ran a corrected Qwen3.5 sweep on RK3588 using source-built llama.cpp, quantized KV cache, and task-pass validation. Then I compared prefill, decode, stable context, average latency, and tool-calling behavior to determine the right model for each workload.
rk3588radxarock-5b-plusllama.cpp
An advanced guide to local GPU inference with llama.cpp: why bandwidth matters more than model fit, how hybrid GPU+CPU offload behaves on cards like the RTX 3060 and 5070, what quantization really means mathematically, and how to run it on Linux, Windows, and WSL.
Issue
Operators lacked a practical framework for choosing quantization, sizing VRAM budgets, deciding when CPU offload is acceptable, and understanding the difference between weight quantization and KV cache quantization. Windows-specific setup questions also created confusion around native builds versus WSL.
Solution
Documented the bandwidth-first model, explained hybrid offload behavior for 12 GB and mid-range modern GPUs, compared quantization choices such as Q4_K_M and q4_0 KV cache, and provided concrete llama.cpp launch patterns for Linux, Windows, and WSL.
local-aillama.cppcudavram
How I implemented hybrid per-layer KV cache quantization on RK3588 using insights from Google's TurboQuant research, achieving 17% better compression with zero quality loss.
Issue
Every time you message an AI chatbot, the model stores your conversation in temporary memory called the KV cache. On large models, this cache alone can consume 40GB—more than the model itself. On a constrained edge device, this is the difference between working and broken.
Solution
Implemented hybrid per-layer KV cache quantization inspired by Google's TurboQuant (ICLR 2026). By using 8-bit quantization for early transformer layers (where attention quality matters most) and 4-bit quantization for later layers, we achieved 17% better compression without quality loss.
local-aiedge-aiturboquantkv-cache
A deep dive into the cloud architectures, real-time data streaming capabilities, and Generative AI setups powering Formula 1 and Formula E in 2026.
Issue
Processing millions of high-velocity data points per second for immediate broadcast insights and race strategy required moving beyond traditional databases to highly decoupled, event-driven streaming architectures capable of sub-millisecond HTAP and GenAI integrations.
Solution
A technical deep dive into F1's AWS 'Track Pulse' architecture utilizing Kinesis sharding and DynamoDB caching, compared alongside Formula E's GCP HTAP architecture leveraging Pub/Sub, AlloyDB's columnar engine, and Vertex AI for real-time coaching.
system-architecturedata-engineeringawsgoogle-cloud
A benchmark-backed deep dive into the real RK3588 inference stack: llama.cpp CPU winners, NPU roles, KV cache choices, quantization tradeoffs, and how to think about 27B with GPU+CPU offload.
Issue
Several paths technically loaded but were not practically usable. Large models timed out or delivered poor latency, CPU tuning mattered more than expected, and the product narrative needed to shift from 'many runtimes' to a benchmark-backed llama.cpp-first architecture.
Solution
Benchmarked llama.cpp and RKLLM on RK3588, identified the winning CPU configs for Qwen 3.5 4B and 9B, clarified where the NPU helps, documented KV cache and quantization choices, and reframed the architecture as llama.cpp-first with NPU used selectively.
rk3588npullama.cpprkllm