RK3588 LLM Performance: NPU vs CPU in a Discord Agent

Situation

On an edge device, latency is the product. A Discord bot that streams tokens back to users can feel instant or completely broken depending on the tokens/second the hardware can sustain.

On an RK3588 host (Radxa ROCK 5B), I ran a direct comparison between:

CPU-only inference via Ollama
NPU-accelerated inference via RKLLM

What The Benchmarks Showed

The same trivial prompt ("What is 2+2?") produced dramatically different throughput:

CPU (Ollama)

Model: Qwen2.5 1.5B (CPU-only)
Throughput: ~0.21 tokens/second
End-to-end: ~37.7s for ~10 tokens

NPU (RKLLM)

Model: Qwen3 4B (NPU)
Throughput: ~5.5 tokens/second
Model: DeepSeek-R1 1.5B W8A8 (NPU)
Throughput: ~11.2 tokens/second

In the measured comparison table, that landed at roughly:

26x speedup (Qwen3 4B NPU vs Qwen2.5 1.5B CPU)
53x speedup (DeepSeek-R1 1.5B NPU vs Qwen2.5 1.5B CPU)

Separating “NPU Problems” from Model File Problems

Two early failures were not NPU/runtime regressions:

A Llama 3.1 8B W8A8 RKLLM file was 0 bytes (corrupted/incomplete download).
A Llama2 13B conversion failed due to target_platform mismatch (converted for the wrong device, not RK3588).

The practical lesson: when RKLLM says “invalid model” or “platform mismatch”, treat it as an artifact problem first.

System + Inference Optimizations That Matter

To make measurements stable and repeatable, the benchmark run documented these baseline controls:

CPU governor set to performance
system caches cleared (drop_caches) before loading large models
sanity checks on available RAM and thermals during runs
inference bounds tuned for the benchmark (max_new_tokens, max_context_len, generous timeouts)

Takeaway

On RK3588, CPU-only inference can be too slow even on small models for interactive chat. NPU acceleration is the enabling constraint: without it, the Discord UX degrades into 30+ second response times for trivial questions; with it, you can sustain streaming replies at multi-token-per-second throughput.

Benchmark Comparison Chart

RK3588 NPU vs CPU Inference Benchmark

This chart quantifies the RK3588 LLM Performance delta. It illustrates the stark contrast between running Qwen2.5 (1.5B) on the CPU via Ollama (yielding an unusable ~0.21 t/s) versus offloading to the NPU via RKLLM. The NPU accelerates larger models (like Qwen3 4B and DeepSeek-R1 1.5B) to stable, interactive streaming speeds—achieving up to a 53x throughput multiplier over the CPU baseline.

Post-Specific Engineering Lens

For this post, the primary objective is: Balance model quality with deterministic runtime constraints.

Implementation decisions for this case

Chose a staged approach centered on rk3588 to avoid high-blast-radius rollouts.
Used npu checkpoints to make regressions observable before full rollout.
Treated rkllm documentation as part of delivery, not a post-task artifact.

Practical command path

These are representative execution checkpoints relevant to this post:

./llama-server --ctx-size <n> --cache-type-k q4_0 --cache-type-v q4_0
curl -s http://localhost:8080/health
python benchmark.py --profile edge

Validation Matrix

Validation goal	What to baseline	What confirms success
Functional stability	RSS usage, token latency, and context utilization	runtime memory stays under planned ceiling during peak context
Operational safety	rollback ownership + change window	decode latency remains stable across repeated runs
Production readiness	monitoring visibility and handoff notes	fallback model/profile activates cleanly when pressure increases

Failure Modes and Mitigations

Failure mode	Why it appears in this type of work	Mitigation used in this post pattern
Over-allocated context	Memory pressure causes latency spikes or OOM	Tune ctx + cache quantization from measured baseline
Silent quality drift	Outputs degrade while latency appears fine	Track quality samples alongside perf metrics
Single-profile dependency	No graceful behavior under load	Define fallback profile and automatic failover rule

Recruiter-Readable Impact Summary

Scope: optimize local inference under strict memory budgets.
Execution quality: guarded by staged checks and explicit rollback triggers.
Outcome signal: repeatable implementation that can be handed over without hidden steps.

Engineer Command Palette

RK3588 LLM Performance: NPU vs CPU in a Discord Agent

Case Snapshot

Situation

Issue

Solution

Used In

Impact

Situation

What The Benchmarks Showed

CPU (Ollama)

NPU (RKLLM)

Separating “NPU Problems” from Model File Problems

System + Inference Optimizations That Matter

Takeaway

Benchmark Comparison Chart

Post-Specific Engineering Lens

Implementation decisions for this case

Practical command path

Validation Matrix

Failure Modes and Mitigations

Recruiter-Readable Impact Summary