← Back to posts

14 Models Benchmarked on RK3588: The Definitive CPU vs NPU Ranking

Benchmarked every viable local LLM (350M to 26B, CPU and NPU) through a live Discord agent pipeline on RK3588. Found NPU beats CPU at same quality, code is solved at any size, and 4B+ models are slower AND worse than 2B on this board.

Case Snapshot

Situation

Running a local-first Discord engineering agent on Radxa ROCK 5B+ (RK3588, 24 GB RAM) and needed hard numbers for model routing — not just synthetic throughput, but actual quality scores through the full Discord → Bot → Inference → Discord pipeline.

Issue:

Previous benchmarks measured raw llama.cpp throughput but not real quality through the agent pipeline. Models that looked fast synthetically failed at reasoning, refused tool calls, or got intercepted by workspace routing before reaching the model.

Solution:

Built a 14-test, 6-dimension benchmark harness that tests every model through the live Discord pipeline with quality validation: reasoning, factual accuracy, code generation, instruction following, tool calling, and math. Tested 14 models (9 CPU GGUF + 3 NPU RKLLM + 2 large MoE) with BENCHMARK_MODE to isolate pure model performance.

Used In:

Production model routing for IntelliAuto Discord agent on Radxa ROCK 5B+ / RK3588 — replacing synthetic benchmark data with real pipeline measurements.

Impact:

Discovered NPU models match CPU quality at 2× lower latency, all models from 350M to 26B generate correct code, Claude distillation breaks tool calling, and the best model for this board is Qwen2.5-3B on NPU (score 29.5, 11/14 quality, 45s response).

Situation

I had one question:

What is the actual best local LLM for RK3588 when you measure through the real agent pipeline, not just llama.cpp’s API?

Previous benchmarks (including my own Qwen3.5 sweep) measured synthetic throughput: prefill t/s, decode t/s, context stability. Those numbers are necessary but not sufficient. The real question is: which model gives the best answers, fastest, through the full Discord chat pipeline?

The target: a Radxa ROCK 5B+ with RK3588 (4× A55 @ 1.8 GHz + 4× A76 @ 2.3 GHz), 24 GB RAM, 6 TOPS NPU, running a local Discord engineering agent.

Two inference backends:

  • CPU: GGUF models via rk-llama.cpp on A76 cores
  • NPU: .rkllm models via Rockchip RKLLM SDK

One important finding upfront: you cannot offload GGUF layers to the NPU. The RKNPU backend in rk-llama.cpp crashes with RESHAPE operations on GGUF tensors. NPU only works with dedicated .rkllm models through the RKLLM SDK. These are completely separate pipelines.

The Test Suite

I designed 14 tests across 6 dimensions, each with clear pass/fail criteria:

CategoryTestsWhat it measures
Reasoning3Bat-ball puzzle, syllogism validity, rate scaling
Factual3Geography, programming history, chemistry
Code2Python list comprehension, palindrome function
Instruction3CSV formatting, sentence counting, JSON output
Tool Call2web_search JSON, calculator JSON
Math1Speed/distance calculation

Tests are weighted — reasoning (3.0×) and tool calling (2.5×) matter more than math (1.0×) because they reflect real agent workloads.

Critical pipeline fix: I had to add BENCHMARK_MODE=true to disable workspace routing, web research interception, and auto-thread creation. Without it, prompts containing words like “write”, “list”, or “tool” were hijacked before reaching the model.

The Complete Results

Composite Ranking (quality × speed)

#ModelBackendSizeQualityWallt/sScore
1Qwen2.5-3B-NPUnpu3.6G11/14 (w=22.0)44.8s1.529.5
2Qwen3-1.7B-NPUnpu2.2G11/14 (w=20.5)42.3s2.229.1
3LFM2.5-350Mcpu255M8/14 (w=13.5)33.5s15.624.2
4LFM2.5-1.2B-Instructcpu698M10/14 (w=18.0)46.2s9.623.4
5Qwen3.5-0.8B-Claudecpu504M10/14 (w=18.5)49.9s6.622.2
6Qwen2.5-Coder-3B-NPUnpu3.9G9/14 (w=17.0)52.4s1.019.5
7Qwen3.5-2Bcpu1.2G12/14 (w=23.5)74.0s5.819.1
8GLM-4.7-REAP-23Bcpu11G10/14 (w=19.5)62.1s5.718.8
9Qwen3.5-2B-Claude-Opuscpu1.2G10/14 (w=18.5)62.0s4.717.9
10Qwen2.5-3B-Instructcpu2.0G11/14 (w=22.0)87.0s2.715.2
11LFM2.5-1.2B-Thinkingcpu698M8/14 (w=14.5)64.1s8.013.6
12Qwen3-4Bcpu2.3G10/14 (w=19.0)101.4s2.011.2
13Qwen3.5-4Bcpu2.5G11/14 (w=20.5)126.5s9.7
14Gemma-4-26B-A4Bcpu12G13/14 (w=25.0)179.2s8.4

Score formula: weighted_quality × max(0.1, 60 / avg_wall_seconds) — balances quality against speed.

Full Parameters

ModelBackendSizeRAMLoadReasoningFactualCodeInstructionTool CallMath
Qwen2.5-3B-NPUnpu3.6G265M8s2/33/32/22/32/20/1
Qwen3-1.7B-NPUnpu2.2G281M8s2/32/32/22/32/21/1
LFM2.5-350Mcpu255M276M8s1/32/32/21/31/21/1
LFM2.5-1.2B-Instructcpu698M290M8s1/32/32/22/32/21/1
Qwen3.5-0.8B-Claudecpu504M280M8s2/31/32/22/32/21/1
Qwen2.5-Coder-3B-NPUnpu3.9G267M8s0/32/32/22/32/21/1
Qwen3.5-2Bcpu1.2G350M10s3/32/32/22/32/21/1
GLM-4.7-REAP-23Bcpu11G270M8s2/31/32/22/32/21/1
Qwen3.5-2B-Claude-Opuscpu1.2G340M8s3/32/32/22/30/21/1
Qwen2.5-3B-Instructcpu2.0G420M8s2/33/32/22/32/20/1
LFM2.5-1.2B-Thinkingcpu698M285M8s1/31/32/21/32/21/1
Qwen3-4Bcpu2.3G550M8s2/32/32/22/32/20/1
Qwen3.5-4Bcpu2.5G620M8s2/32/32/22/32/21/1
Gemma-4-26B-A4Bcpu12G264M8s3/33/32/22/32/21/1

Finding 1: NPU Beats CPU at the Same Quality

The most important result:

MetricQwen2.5-3B on NPUQwen2.5-3B on CPU
Quality11/14 (w=22.0)11/14 (w=22.0)
Wall time44.8s87.0s
Generation time20.5s44.4s
Reasoning2/32/3
Factual3/33/3
Tool Call2/22/2
Score29.515.2

Identical quality. Half the latency. The RKLLM W8A8 quantization does not degrade quality — it just runs faster on the dedicated NPU.

This means: if a model exists in both GGUF and RKLLM format, always prefer the NPU version.

Finding 2: Code Generation Is Solved at Any Size

Every single model — from 350M to 26B — passes both code tests. Even the tiny LFM2.5-350M correctly generates:

evens = [x for x in numbers if x % 2 == 0]

and:

def is_palindrome(s: str) -> bool:
    return s == s[::-1]

For code tasks, use the fastest available model. Size does not matter.

Finding 3: The Bat-Ball Puzzle Remains the Hardest Test

Only 5 of 14 models correctly answer “the ball costs $0.05”:

  • ✅ Qwen3.5-2B, Qwen3.5-2B-Claude-Opus, Gemma-4-26B, Qwen2.5-3B-NPU, GLM-4.7-REAP-23B
  • ❌ All 350M-1.2B models (answer $0.90 — intuitive but wrong)
  • ❌ Qwen3-4B, Qwen3.5-4B (surprising failures for 4B models)

The 350M-1.2B models consistently fail with the intuitive answer ($0.90). Only at 1.7B+ do models start getting it right, and only at 2B+ is it reliable.

Finding 4: Claude Distillation Broke Tool Calling

Qwen3.5-2B-Claude-Opus-Distilled is the only model that refuses tool calls:

"I cannot execute any tool to call web_search. I only provide information."
"I cannot execute any tool to call calculator. I only provide information."

This model gets 3/3 reasoning (best in class!) but 0/2 tool calling. The Claude distillation process trained out the tool-calling capability. It’s actively harmful for agentic workflows.

Finding 5: 4B+ Models Are Not Worth It

The most surprising result — 4B models are slower AND worse than 2B:

ModelQualityWall timeReasoning
Qwen3.5-2B12/1474s3/3
Qwen3-4B10/14101s2/3
Qwen3.5-4B11/14127s2/3

On RK3588, extra parameters beyond 2B don’t help — they just add latency without improving quality. The board’s memory bandwidth becomes the bottleneck.

Finding 6: Gemma-4-26B-A4B Gets Best Quality at 3× Latency

Gemma-4-26B-A4B achieved 13/14 — the highest quality of any model — but at 179s average response time. It’s a MoE model with ~4B active parameters out of 26B total, so it fits in RAM, but the active computation is still heavy.

One pipeline issue: every response leaks <|channel>thought markers that need stripping.

Finding 7: Instruction Following Is the Hardest Category

Zero models achieved 3/3 on instruction following. The role_json test (output raw JSON with no markdown) timed out for ALL 14 models at 300 seconds. This is a pipeline bug, not a model limitation — the Discord agent gets stuck when models try to generate JSON.

Based on these results, here’s the production routing table:

Use CasePrimaryBackupWhy
Fast chatQwen3-1.7B-NPULFM2.5-1.2B-Instruct11/14 at 42s, NPU speed
ReasoningQwen3.5-2BQwen3.5-2B-Claude-OpusOnly 3/3 reasoning under 90s
CodeFastest availableAny modelALL pass code tests
Tool callingQwen3-1.7B-NPUQwen2.5-3B-NPU2/2 tools at 42-45s
FactualQwen2.5-3B-NPUQwen2.5-3B-Instruct3/3 factual accuracy
Deep analysisGemma-4-26B-A4BQwen3.5-2B13/14 quality, 179s

Do not use:

  • Qwen3.5-2B-Claude-Opus — refuses tool calls
  • LFM2.5-1.2B-Thinking — thinking mode loops, worse than Instruct variant
  • Qwen3-4B / Qwen3.5-4B — slower than 2B with no quality gain

Pipeline Bugs Found

The benchmark itself revealed several bot pipeline issues:

  1. Workspace routing hijacks prompts — words “write”, “list”, “show” trigger file routing before the model sees the prompt. Fixed with BENCHMARK_MODE kill-switch.

  2. NPU bridge missing execute permission — the Rust binary was built without +x, causing all NPU inference to silently fail. Fixed with chmod +x.

  3. NPU context overflow — model registry had contextLength: 8192 but RKLLM models max out at 4096. The orchestrator passed 8192 directly to RKLLM init, which crashed. Fixed in registry.

  4. Gemma marker leak<|channel>thought markers appear in every Gemma response. Bot’s sanitizer only strips <think/> tags, not Gemma’s proprietary format.

Hardware Context

RK3588 — 4×A55 @1.8GHz (cores 0-3) + 4×A76 @2.3GHz (cores 4-7)
RAM: 23.2 GB usable
NPU: 6 TOPS @ 1GHz
CMA: 2GB (1.96GB free)

CPU inference: rk-llama.cpp, A76 cores, flash attention ON, q8_0 KV cache
NPU inference: RKLLM SDK via Rust bridge, W8A8 quantization

The Real Takeaway

On RK3588, the best model is not the biggest model that fits. It’s the model that preserves quality while staying inside a usable latency envelope — and NPU versions of the same model run at 2× the speed with zero quality loss.

The previous benchmark conclusion (“Qwen3.5-2B is the best CPU default”) was correct for CPU-only. But the full pipeline benchmark reveals that NPU models with identical quality at half the latency should be the new default.

If I had to pick one model for everything on this board: Qwen3-1.7B on NPU. 42 seconds average, 11/14 quality, 2/2 tool calling, 2/2 code — and it leaves the CPU free for other work.

Recruiter-Readable Impact

  • Scope: benchmarked 14 local LLM models (350M–26B parameters) across CPU and NPU inference on ARM64 edge hardware
  • Execution: built a 14-test automated harness testing 6 quality dimensions through a live Discord agent pipeline, discovering pipeline bugs that invalidated previous synthetic benchmarks
  • Outcome: replaced synthetic throughput data with real quality measurements, identified NPU as the primary inference backend (2× faster at same quality), and established production model routing based on reproducible evidence