← Back to posts

RX 7800 XT 16GB: Running 35B MoE at 128K Context with llama.cpp + ROCm

Full benchmark data on running MoE and dense LLMs on AMD consumer hardware — quantization comparison, power cap analysis, KV cache tuning, and context limits on 16GB VRAM.

Case Snapshot

Situation

Built a dedicated AI inference node to run local models 24/7 via Tailscale, keeping the desktop free for coding and gaming. The constraint: a single RX 7800 XT 16GB with ROCm as the only viable backend (vLLM and SGLang have no ROCm wheels for gfx1101).

Issue:

Consumer GPUs have hard VRAM ceilings. Running 23-35B parameter models on 16GB requires aggressive quantization, KV cache compression, and precise build flags. The noise-to-signal ratio in online benchmarking is high — most people test on NVIDIA, not AMD RDNA3, and few test MoE architectures with context windows above 32K.

Solution:

Systematically benchmarked 8+ models across 5 quantization levels, swept GPU power caps from 30W to 190W, tested 3 KV cache configurations, and pushed context limits to 256K. Documented the exact llama.cpp build flags and runtime parameters that make 128K inference on 16GB VRAM stable and fast.

Used In:

Dedicated AI inference node (Ryzen 7 3800X + RX 7800 XT) running llama.cpp with HIPBLAS + rocWMMA FlashAttention. Models served via systemd services, accessed remotely through Tailscale.

Impact:

IQ quants outperform K-quants by 3x on RDNA3. Power cap can be reduced by >50% with no throughput loss on MoE models. 256K context proven on 16GB with MoE + IQ2_M. Established a repeatable configuration for 128K always-on inference.

Hardware

  • CPU: AMD Ryzen 7 3800X (8C/16T, 2.8 GHz powersave)
  • GPU: Sapphire RX 7800 XT 16GB (gfx1101, RDNA3, 60 CUs, 623 GiB/s GDDR6)
  • RAM: 16GB DDR4
  • Storage: 1TB NVMe (LVM: 500G models, 100G inference, 100G root, 16G swap)
  • PSU: 850W Gold
  • OS: Ubuntu 24.04.4 LTS, kernel 6.17.0-35-generic
  • Network: Ethernet, Tailscale (100.123.29.5)

Why ROCm and not CUDA

vLLM has no ROCm wheels for gfx1101 on PyPI. SGLang only publishes Docker images for MI300X/MI350. Building vLLM from source with ROCm takes 4+ hours and fails frequently. llama.cpp with HIPBLAS is the only production-viable backend for consumer AMD GPUs. This decision was made after trying and failing with both alternatives.

llama.cpp Build Configuration

Compiled from source targeting gfx1101 only:

cmake -B build -DGGML_HIP=ON \
  -DGPU_TARGETS=gfx1101 \
  -DGGML_HIPBLAS=ON \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DGGML_HIP_GRAPHS=ON \
  -DGGML_HIP_MMQ_MFMA=ON \
  -DGGML_HIP_NO_VMM=ON \
  -DGGML_HIP_EXPORT_METRICS=OFF \
  -DCMAKE_BUILD_TYPE=Release

Each flag:

  • GGML_HIPBLAS=ON — HIPBLAS backend (AMD’s equivalent of cuBLAS)
  • GPU_TARGETS=gfx1101 — compile only for RX 7800 XT architecture
  • GGML_HIP_ROCWMMA_FATTN=ON — FlashAttention via AMD rocWMMA matrix cores
  • GGML_HIP_GRAPHS=ON — HIP graph capture to reduce kernel launch overhead
  • GGML_HIP_MMQ_MFMA=ON — MFMA-based matrix-matrix multiplication for RDNA3
  • GGML_HIP_NO_VMM=ON — disables virtual memory management (avoids overhead on RDNA3)

ROCm version: 6.4.4 (HIP 6.4.43484, AMD clang 19.0.0)

ROCm Environment

The systemd service sets these environment variables:

Environment=ROCM_PATH=/opt/rocm-6.4.4
Environment=HSA_ENABLE_GFX_VERSION=gfx1101
Environment=PATH=/opt/rocm-6.4.4/bin:/usr/local/bin:/usr/bin:/bin
Environment=LD_LIBRARY_PATH=/opt/rocm-6.4.4/lib:/opt/rocm-6.4.4/lib/llvm/lib

Benchmark Server Flags (constant across all runs)

--cache-type-k q8_0 --cache-type-v q4_0 --flash-attn on \
--parallel 1 -b 512 -ub 128 --mlock

This is the discovered sweet spot for 23-28B MoE on this card. The default KV cache flags (q8_0 / q4_0) were used for benchmarking; the daily running service uses q8_0 / q8_0 to maximize quality at the cost of some headroom.

Phase 1 — Model and Quantization Comparison (190W cap)

ModelQuantSizectxVRAM idlePrefill t/sGen t/sLong prefill (180 tok)
GLM-4.7-REAP-23BIQ4_XS12.6 GB32K13.93 GB81.759.8691.7
GLM-4.7-REAP-23BIQ4_NL13.3 GB32K14.60 GB77.654.0692.1
GLM-4.7-REAP-23BUD-IQ3_XXS9.4 GB65K12.32 GB47.852.6614.4
Qwen3.6-REAP-28BIQ3_XXS11 GB32K11.90 GB70.848.7435.3
qwen3.6-27b4bpw13 GB16K14.25 GB35.719.4202.1
qwopus-27bQ3_K_L14 GB32K15.68 GB43.716.8205.0
gemma-4-21BQ4_K_M13.8 GB32K14.85 GB105.933.3320.6
gemma-4-26B-A4BIQ2_M9.4 GB65K11.30 GB103.235.9326.7

Key findings

IQ quants dominate K-quants on RDNA3. Every K-quant tested is slower and uses more VRAM than a smaller IQ variant. Q3_K_L at 14 GB delivers 16.8 t/s generation; UD-IQ3_XXS at 9.4 GB delivers 52.6 t/s. Same GPU, 35% less VRAM, 3x the speed. The IQ decompression kernels for gfx1101 are better optimized in llama.cpp than the K-quant codepaths.

Sweet spot: GLM-4.7-REAP-23B in IQ4_XS. Best generation speed (59.8 t/s), strong prefill (81.7 t/s), 32K context, 14 GB VRAM with 2.4 GB headroom.

Gemma-4-21B Q4_K_M has the fastest prefill (105.9 t/s) — the 4B active parameter count in the MoE pays off for prompt-heavy workloads, but generation drops to 33 t/s.

GLM-4.7 in UD-IQ3_XXS unlocks 65K context at 12.32 GB VRAM (3.7 GB headroom). Best long-context option in the benchmark.

Phase 2 — GPU Power Cap Sweep (GLM-4.7 IQ4_XS)

Power capPrefill t/sGen t/sLong prefillWatts (load)
190 W81.759.8691.780 W
160 W79.159.8680.480 W
130 W85.059.8680.881 W
100 W84.259.6681.281 W

The GPU never exceeds ~81W actual draw during MoE inference. The 190W power cap is far above what MoE workloads demand.

Phase 2b — Ultra-Fine Power Cap Sweep

Power capPrefill t/sGen t/sTTFTWatts (load)T_junc (load)
130 W77.857.90.50s56 W53°C
100 W85.957.80.48s58 W54°C
80 W83.957.80.49s56 W55°C
60 W85.858.00.48s59 W56°C
45 W85.957.80.48s60 W57°C
30 W85.157.80.48s57 W57°C

Below 60W, the GPU never enters a throttling regime — actual load hovers at 56-60W regardless of cap. The realistic safe floor is 60-80W: below that, there is no power savings (the GPU draws 56W minimum during inference) and no performance gain.

For daily operation, 80W is the recommended cap — identical throughput with thermal headroom for transient spikes. The earlier 190W default was wasteful for this workload class.

Phase 2c — Thinking Mode ON vs OFF

Qwen3.6 and GLM-4.7 ship with thinking mode enabled in the chat template. The impact:

ModelModeTTFTGen t/sThink wordsContent wordsWatts (load)
GLM-4.7 IQ4_XS--reasoning off0.48s58.201457 W
GLM-4.7 IQ4_XSthinking ON1.25s59.731085 W
GLM-4.7 UD-IQ3_XXS--reasoning off0.65s51.001457 W
GLM-4.7 UD-IQ3_XXSthinking ON1.49s52.637093 W

With thinking ON, the model burns the entire max_tokens budget on <think/> blocks. The visible content field is empty. A 64-token generation for “write a haiku” produced 31-37 reasoning tokens and 0 actual haiku.

--reasoning off is the correct fix. The -rb 0 / --reasoning-budget 0 flag does NOT work — Qwen3.6 still emits reasoning_content and produces 0 content even with budget 0.

Phase 3 — KV Cache Type Combinations

K typeV typeVRAM idlePrefillGenLong prefill
f16f1614.76 GB78.260.8678.8
q8_0f1613.93 GB85.159.7679.5
q8_0q4_013.93 GB81.759.8691.7
q8_0q8_013.93 GB79.259.5670.3
q4_0q4_013.93 GB68.956.3549.5
q4_0f1613.93 GB60.155.8473.6

q8_0 for K with q4_0 for V is the best throughput/compression tradeoff. FP16 KV uses 0.83 GB more VRAM for no speed benefit. q4_0/q4_0 saves the same VRAM but loses 8-10 t/s.

For daily operation with 128K context and maximum quality, q8_0/q8_0 is used (only 8.6 t/s loss vs the benchmark sweet spot, but full precision on V).

Context Limits on 16GB VRAM

All tests used --flash-attn on --cache-type-k q4_0 --cache-type-v q4_0 --mlock.

ModelFile sizeParams active128K (4 slots)160K (1 slot)192K (1 slot)256K (1 slot)
Gemma-4-26B-A4B IQ2_M~9.3 GB4B (MoE)
Qwen3.6-27B 4bpw~13 GB27B dense❌ OOM❌ OOM
Qwen3-14B-128K Q4_K_S~8 GB14B denselikely ✅
Qwopus-27B Q3_K_L~14 GB27B + MTP❌ OOM
Qwopus-27B Q4_K_S~15 GB27B + MTP❌ OOM

How 256K works on 16GB

256K context is only achievable with:

  1. MoE architecture (Gemma-4: 26B total, 4B active per token) — small per-token KV footprint
  2. Aggressive weight quantization (IQ2_M ≈ 2-bit) — ~9.3GB model file
  3. Q4_0 KV cache — ~6GB KV at 256K
  4. Single slot (--parallel 1) — no duplicated KV
  5. Flash Attention — O(N) memory access

Total: ~9.3GB (weights) + ~6GB (KV) = ~15.3GB / 16GB. Extremely tight.

Why dense 27B fails at 192K

Qwen3.6-27B at 192K failed by only ~505 MiB. The OOM comes from compute buffer allocation, not just KV — llama.cpp reserves additional buffers for attention computation that scale with context length. No workaround except smaller model or lower KV cache bits.

Daily Running Configuration

The always-on model is Qwen3.6-35B-A3B UD-IQ3_XXS:

llama-server \
  --model /srv/models/gguf/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
  --host 0.0.0.0 --port 8080 \
  --parallel 1 --n-gpu-layers 99 \
  --batch-size 512 --ubatch-size 128 \
  --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --ctx-size 131072 \
  --reasoning off \
  --api-key <key>

Why this model

  • 35B MoE with only 3B active params — 35B weights must fit in VRAM (any expert can activate), but per-token compute is light
  • IQ3_XXS is the most aggressive quant that stays coherent for code tasks
  • Q4_K_M and IQ4_XS are too large — not enough KV headroom for 128K context
  • IQ2_XXS degrades noticeably on code generation

Why each flag matters

  • --n-gpu-layers 99 — offloads everything to GPU, CPU handles orchestration only
  • --flash-attn on — O(N) attention, mandatory at 128K context
  • --cache-type-k/v q8_0 — quantized KV cache uses half the VRAM of FP16
  • --ctx-size 131072 — 128K context (model natively supports it)
  • --reasoning off — disables Qwen3.6 thinking mode (see Phase 2c)
  • --batch-size 512 --ubatch-size 128 — microbatch size tuned for RDNA3

Node automation

  • systemd service with Restart=on-failure, RestartSec=5
  • RGB monitor script: RAM sticks + motherboard LED show inference state (red=loaded, blue breathing=active, green=idle, dim gray=offline)
  • Sleep schedule: 01:00 Madrid, RTC wake at 07:30 Madrid (S3 deep sleep via systemd timer)
  • Tailscale recovery timer: checks daemon every 2 minutes, auto-restarts on failure
  • CPU: 2.8 GHz powersave governor
  • GPU: 190W cap via LACT (can be reduced to 80W per benchmark findings)

Speculative Decoding

Tested --spec-type ngram-simple on Qwen3.6 IQ3_XXS:

ModePrefill (req1 / req2)Gen (req1 / req2)
ngram-simple72.7 / 114.555.7 / 57.8
baseline60.2 / 121.456.6 / 57.7

No meaningful speedup. ngram finds too few repetitions in the model’s outputs. For real speculative gains, --md <draft-model> with a model from the same family is needed (e.g. Qwen2.5-3B as draft for Qwen3.6-27B), at the cost of additional VRAM and KV overhead.

Model Catalog

31 GGUF files on the node (423 GB total in /srv/models/gguf/):

ModelQuantSizeType
Qwen3.6-35B-A3BUD-IQ3_XXS13GMoE 35B/3B active
Qwen3.6-35B-A3BUD-IQ2_XXS11GMoE 35B/3B active
Qwen3.6-35B-A3B-MTPUD-IQ2_XXS12GMoE + multi-token prediction
Qwen3.6-27B4bpw13GDense 27B
Qwen3.6-27BIQ4_XS15GDense 27B
Qwen3.6-27BUD-IQ3_XXS12GDense 27B
Qwen3.6-28B-REAP20-A3BIQ3_XXS11GMoE 28B/3B active
Qwen3-Coder-30B-A3BIQ4_XS16GMoE 30B/3B active
Qwen3-Coder-NextIQ4_XS40GToo large for 16GB
Gemma-4-26B-A4BQAT-UD-Q4_K_XL14GMoE 26B/4B active
Gemma-4-26B-A4BIQ2_M9.4GMoE 26B/4B active
Gemma-4-21B-A4BREAP-IQ4_XS11GMoE 21B/4B active
Gemma-4-21B-A4BREAP-Q4_K_M13GMoE 21B/4B active
Gemma-4-12BVarious4.4-12GDense 12B
Gemma-4-12BQAT-UD-Q4_K_XL6.3GDense 12B
Gemma-4 E2B/E4BVarious2.5-4.4GEmbedding/small
GLM-4.7-Flash-REAP-23BIQ4_XS13GMoE 23B/3B active
GLM-4.7-Flash-REAP-23BIQ4_NL12GMoE 23B/3B active
GLM-4.7-Flash-REAP-23BUD-IQ3_XXS9.4GMoE 23B/3B active
Qwopus-27BQ3_K_L / Q4_K_S14-15GFinetune 27B
Qwopus3.6-27B-v2Q3_K_S / mmproj12G + 601MFinetune + vision

Lessons

  1. Build for your exact GPU architecture. GPU_TARGETS=gfx1101 produces faster code than a generic ROCm build. Every flag in the cmake config matters.

  2. IQ quants are better than K-quants on RDNA3. If you’re on AMD, test IQ variants first. The speed difference is not marginal — it’s 3x in some cases.

  3. Power caps are mostly irrelevant for MoE. MoE inference is bandwidth-bound, not compute-bound. The GPU idles at 56-60W during generation regardless of cap.

  4. Thinking mode is not free. On Qwen3.6 and GLM-4.7, it doubles power consumption, increases TTFT 2.5x, and produces empty output for short requests. Always benchmark with --reasoning off unless you specifically need chain-of-thought.

  5. 256K on 16GB is possible but painful. It requires a specific combination of MoE + IQ2_M + Q4_0 KV + single slot + Flash Attention. Useful as a ceiling test, not a daily config.

  6. Dense models hit a hard wall around 160K on 16GB. Qwen3.6-27B OOMs at 192K by only 505 MiB. No amount of tuning will change this — the compute buffer allocation scales with context length.

  7. llama.cpp is the only option for consumer AMD GPUs. Don’t waste time trying vLLM or SGLang on gfx1101. They don’t support it.