Situation
On an edge device, latency is the product. A Discord bot that streams tokens back to users can feel instant or completely broken depending on the tokens/second the hardware can sustain.
On an RK3588 host (Radxa ROCK 5B), I ran a direct comparison between:
- CPU-only inference via Ollama
- NPU-accelerated inference via RKLLM
What The Benchmarks Showed
The same trivial prompt ("What is 2+2?") produced dramatically different throughput:
CPU (Ollama)
- Model: Qwen2.5 1.5B (CPU-only)
- Throughput: ~0.21 tokens/second
- End-to-end: ~37.7s for ~10 tokens
NPU (RKLLM)
-
Model: Qwen3 4B (NPU)
-
Throughput: ~5.5 tokens/second
-
Model: DeepSeek-R1 1.5B W8A8 (NPU)
-
Throughput: ~11.2 tokens/second
In the measured comparison table, that landed at roughly:
- 26x speedup (Qwen3 4B NPU vs Qwen2.5 1.5B CPU)
- 53x speedup (DeepSeek-R1 1.5B NPU vs Qwen2.5 1.5B CPU)
Separating “NPU Problems” from Model File Problems
Two early failures were not NPU/runtime regressions:
- A Llama 3.1 8B W8A8 RKLLM file was
0 bytes(corrupted/incomplete download). - A Llama2 13B conversion failed due to
target_platformmismatch (converted for the wrong device, not RK3588).
The practical lesson: when RKLLM says “invalid model” or “platform mismatch”, treat it as an artifact problem first.
System + Inference Optimizations That Matter
To make measurements stable and repeatable, the benchmark run documented these baseline controls:
- CPU governor set to
performance - system caches cleared (
drop_caches) before loading large models - sanity checks on available RAM and thermals during runs
- inference bounds tuned for the benchmark (
max_new_tokens,max_context_len, generous timeouts)
Takeaway
On RK3588, CPU-only inference can be too slow even on small models for interactive chat. NPU acceleration is the enabling constraint: without it, the Discord UX degrades into 30+ second response times for trivial questions; with it, you can sustain streaming replies at multi-token-per-second throughput.
Benchmark Comparison Chart
This chart quantifies the RK3588 LLM Performance delta. It illustrates the stark contrast between running Qwen2.5 (1.5B) on the CPU via Ollama (yielding an unusable ~0.21 t/s) versus offloading to the NPU via RKLLM. The NPU accelerates larger models (like Qwen3 4B and DeepSeek-R1 1.5B) to stable, interactive streaming speeds—achieving up to a 53x throughput multiplier over the CPU baseline.
Post-Specific Engineering Lens
For this post, the primary objective is: Balance model quality with deterministic runtime constraints.
Implementation decisions for this case
- Chose a staged approach centered on rk3588 to avoid high-blast-radius rollouts.
- Used npu checkpoints to make regressions observable before full rollout.
- Treated rkllm documentation as part of delivery, not a post-task artifact.
Practical command path
These are representative execution checkpoints relevant to this post:
./llama-server --ctx-size <n> --cache-type-k q4_0 --cache-type-v q4_0
curl -s http://localhost:8080/health
python benchmark.py --profile edge
Validation Matrix
| Validation goal | What to baseline | What confirms success |
|---|---|---|
| Functional stability | RSS usage, token latency, and context utilization | runtime memory stays under planned ceiling during peak context |
| Operational safety | rollback ownership + change window | decode latency remains stable across repeated runs |
| Production readiness | monitoring visibility and handoff notes | fallback model/profile activates cleanly when pressure increases |
Failure Modes and Mitigations
| Failure mode | Why it appears in this type of work | Mitigation used in this post pattern |
|---|---|---|
| Over-allocated context | Memory pressure causes latency spikes or OOM | Tune ctx + cache quantization from measured baseline |
| Silent quality drift | Outputs degrade while latency appears fine | Track quality samples alongside perf metrics |
| Single-profile dependency | No graceful behavior under load | Define fallback profile and automatic failover rule |
Recruiter-Readable Impact Summary
- Scope: optimize local inference under strict memory budgets.
- Execution quality: guarded by staged checks and explicit rollback triggers.
- Outcome signal: repeatable implementation that can be handed over without hidden steps.