AI | Sergio B.

Each post in this domain is written in case-study format: situation, issue, solution, usage context, and delivery impact.

Search posts

Jul 16, 2026 • 6 min read

Arc Pro B70 clean suite: Gemma 4 31B MTP, MoE prefill, and Grok tools

Real single-stream timings from a 2026-07-16 B70 suite: long-prompt prefill near 1.7k t/s on MoE, dense Gemma 31B +51% decode with MTP-4, and Grok Build CLI with tools enabled.

Issue Short-prompt prefill looked terrible, dense Gemma needed a path that was not stuck at ~16 t/s, and agent CLI numbers were being confused with engine throughput.

Solution Re-measure with engine timings at long prompt sizes, replace Gemma 26B with Gemma 4 31B + Unsloth MTP draft + vision mmproj, keep MoE profiles for speed, and run Grok Build with tools on.

intel-arcb70llama.cppsycl

Jul 16, 2026 • 6 min read

Grok Build CLI with local models on llama-server (Arc Pro B70)

Practical setup for xAI Grok Build against a local OpenAI-compatible llama.cpp server: install, config.toml models, XAI_API_KEY, tools-on usage, profile switching, and real single-stream timings.

Local AI Automation

Issue Grok Build defaults to cloud models; local routing needs correct base_url, model IDs, auth env, and an honest split between engine tok/s and agent wall time.

Solution Install Grok Build, point custom models at http://127.0.0.1:8765/v1 with model id active (or matching served id), set XAI_API_KEY to the server API key, keep tools enabled for real agent work, and switch llama-server profiles underneath.

grok-buildllama.cppintel-arcb70

Jul 6, 2024 • 3 min read

The Math Behind KV Cache Quantization: Why I Stopped Using Q5_0 for Keys

An analysis of KL divergence when quantizing the Key/Value cache in llama.cpp, and why the K-cache requires significantly higher precision than the V-cache.

Local AI

Issue I was using an aggressive `Q5_0` quantization for the Key (K) cache and `Q4_1` for the Value (V) cache. Over long contexts, the model's reasoning capabilities began to degrade, showing signs of hallucination and logic loops.

Solution Researched the Kullback-Leibler (KL) divergence of various KV cache quantization formats in llama.cpp. Discovered that the Key cache is highly sensitive to precision loss due to dot-product attention mechanics. Shifted to a hybrid `K=Q8_0 / V=Q4_1` profile.

llama-cpphardware-tuningmachine-learningai

Jul 6, 2024 • 3 min read

The Reality of Edge AI Research: Why TurboQuant on Intel Arc SYCL Failed (For Now)

A post-mortem on attempting to compile and run experimental TurboQuant (WHT rotation) KV Cache compression on Intel Arc B70 GPUs using a custom SYCL fork.

Local AI

Issue We attempted to compile and run TurboQuant (TQ) — a promising new WHT-based quantization method — on Intel Arc hardware using an experimental SYCL fork, but hit hard driver and kernel limitations.

Solution Documented the failure modes (specifically `SET_ROWS` view tensor crashes) and fell back to stable asymmetric block quantization (`K=Q8_0 / V=Q4_1`) for production use until upstream support matures.

syclllama-cppintel-archardware-tuning

Jul 5, 2024 • 4 min read

Breaking the 67 tok/s Barrier: Optimizing Intel Arc Pro B70 for High-Concurrency MoE Inference

How to tune llama.cpp on Intel Arc Pro B70 SYCL to double aggregate throughput under massive parallel loads with Mixture of Experts (MoE) models.

Local AI

Issue Baseline sequential generation hard-capped at ~67 tokens/second due to memory bandwidth starvation. Furthermore, naive scaling with large contexts (131K) and unoptimized batching caused immediate VRAM exhaustion (OOM) and server timeouts under load.

Solution Diagnosed hardware bottlenecks and optimized the llama.cpp SYCL stack. Disabled heavy DNN operations (`GGML_SYCL_DISABLE_DNN=1`), quantized the KV Cache (`Q5_0`/`Q4_1`), and saturated the GPU using deep micro-batching (`-b 8192 -ub 4096`) under a 32-parallel request load.

syclllama-cppintel-archardware-tuning

Jul 2, 2024 • 9 min read

KV Cache Quantization and Context Ceilings on Intel Arc Pro B70 32GB

Switching from symmetric q8_0 to asymmetric q5_0-q4_1 KV cache quantization freed 6.2 GB of VRAM per 128K context, pushed context ceilings to 256K on a 35B model, and was 3.3% faster in engine decode rate. Hardware-verified on Intel Arc Pro B70 32GB.

Local AI Infrastructure

Issue Standard q8_0 KV cache quantization used a 0.531 VRAM multiplier, capping a 35B Q5 model at 128K context on 32GB. The question was whether asymmetric K/V quantization (q5_0 for K, q4_1 for V) could unlock higher context lengths without hitting the quality cliff or degrading throughput.

Solution Ran a 5-test hardware-verified benchmark suite on llama.cpp b9851 comparing q8_0-q8_0 against q5_0-q4_1 across control baseline, target comparison, flagship configs, and dense model validation. Calculated per-model VRAM budgets using measured multipliers from the Anbeeld 2026 KV cache benchmark methodology.

local-aillama.cppkv-cachequantization

Jul 2, 2024 • 7 min read

MTP-4 Speculative Decoding Power Scaling and Benchmark Methodology Fix on Intel Arc B70

Corrected power scaling data for Qwen 27B MTP-4 speculative decoding after discovering single-prompt caching was inflating baselines by 4-5%. True MTP-4 gain at 180W is +35%, not +41%. Includes vision benchmark results after ffmpeg dependency fix.

Local AI Infrastructure

Issue The initial power sweep data was inflated. Single-prompt prefix caching was active during testing, which inflated the baseline measurements by approximately 4-5%. This made the speculative decoding gains appear larger than they actually were.

Solution Rewrote the benchmark script to enforce warmup discards, isolate engine decode rate from wall-clock time, and maintain strict thermal cooldowns between test rounds. Re-ran the entire power sweep with the corrected methodology.

local-aillama.cppmtpspeculative-decoding

Jul 1, 2024 • 10 min read

Intel Arc Pro B70 32GB: Running Qwen3.6-35B on llama.cpp SYCL

A reproducible case study for running Qwen3.6-35B-A3B on Intel Arc Pro B70 with llama.cpp SYCL on Ubuntu 26.04, including the exact build, runtime flags, benchmark data, and the persistent SYCL cache issue that caused model-load crashes.

Local AI Infrastructure

Issue The obvious checks all passed: the GPU was visible through Level Zero, ReBAR exposed the full 32GB BAR, the model fit in VRAM, and Vulkan could load it. SYCL still failed during model load, first with xe bcs engine resets and then with SIGSEGV crashes even when GPU offload was disabled.

Solution I rebuilt llama.cpp from current master with Level Zero development headers installed, disabled Intel SYCL persistent kernel cache, pinned the Level Zero device explicitly, and reduced the environment to the smallest set of variables required for stable SYCL inference.

local-aillama.cppsyclintel-arc

Jun 18, 2024 • 5 min read

Optimizing DeepSeek KV Cache for Serverless AI Pipelines

How splitting a monolithic system prompt into static and per-session layers improved estimated KV cache hit rates from ~42% to ~76% and reduced input costs by an estimated 57% on a Firebase Functions app running DeepSeek V4 Flash.

AI Kotlin

LLMDeepSeekFirebaseOptimization

Jun 10, 2024 • 11 min read

RX 7800 XT 16GB: Running 35B MoE at 128K Context with llama.cpp + ROCm

Full benchmark data on running MoE and dense LLMs on AMD consumer hardware — quantization comparison, power cap analysis, KV cache tuning, and context limits on 16GB VRAM.

Local AI Infrastructure

Issue Consumer GPUs have hard VRAM ceilings. Running 23-35B parameter models on 16GB requires aggressive quantization, KV cache compression, and precise build flags. The noise-to-signal ratio in online benchmarking is high — most people test on NVIDIA, not AMD RDNA3, and few test MoE architectures with context windows above 32K.

Solution Systematically benchmarked 8+ models across 5 quantization levels, swept GPU power caps from 30W to 190W, tested 3 KV cache configurations, and pushed context limits to 256K. Documented the exact llama.cpp build flags and runtime parameters that make 128K inference on 16GB VRAM stable and fast.

local-aillama.cpprocmamd

Apr 11, 2024 • 10 min read

14 Models Benchmarked on RK3588: The Definitive CPU vs NPU Ranking

Benchmarked every viable local LLM (350M to 26B, CPU and NPU) through a live Discord agent pipeline on RK3588. Found NPU beats CPU at same quality, code is solved at any size, and 4B+ models are slower AND worse than 2B on this board.

Local AI

Issue Previous benchmarks measured raw llama.cpp throughput but not real quality through the agent pipeline. Models that looked fast synthetically failed at reasoning, refused tool calls, or got intercepted by workspace routing before reaching the model.

Solution Built a 14-test, 6-dimension benchmark harness that tests every model through the live Discord pipeline with quality validation: reasoning, factual accuracy, code generation, instruction following, tool calling, and math. Tested 14 models (9 CPU GGUF + 3 NPU RKLLM + 2 large MoE) with BENCHMARK_MODE to isolate pure model performance.

rk3588radxarock-5b-plusllama.cpp

Mar 30, 2024 • 6 min read

llamacpp-workbench: Remote llama.cpp Control and REAP Model Serving on RK3588

Publishing a practical local-AI control plane for llama.cpp: remote model loading, runtime tuning, streaming chat, and real REAP model serving on a Radxa ROCK 5B+.

Local AI

Issue Most local model UIs either abstract away the runtime details that actually matter on constrained hardware or assume desktop-class GPUs. On RK3588, that makes it harder to tune context, KV cache quantization, reasoning behavior, and model selection credibly.

Solution Built and published `llamacpp-workbench`, a remote llama.cpp workbench with explicit runtime controls, model presets, markdown chat rendering, streaming responses, and benchmark-backed defaults for REAP and dense GGUF models.

llama.cpprk3588radxarock-5b-plus

Mar 28, 2024 • 15 min read

Qwen3.5 on RK3588 with llama.cpp: Real Benchmarks from a Radxa ROCK 5B+

An advanced benchmark report for running Qwen3.5 locally on RK3588 with source-built llama.cpp: prefill speed, decode speed, stable context, tool-calling behavior, and the practical model choices that actually work on a Radxa ROCK 5B+.

Local AI

Issue The usual local-AI advice overemphasizes parameter count and underexplains bandwidth, context budget, KV cache policy, and interactive latency. On RK3588, that leads to bad defaults: models that technically load but feel broken in real chat and tool-calling workloads.

Solution I ran a corrected Qwen3.5 sweep on RK3588 using source-built llama.cpp, quantized KV cache, and task-pass validation. Then I compared prefill, decode, stable context, average latency, and tool-calling behavior to determine the right model for each workload.

rk3588radxarock-5b-plusllama.cpp

Mar 25, 2024 • 15 min read

GPU VRAM, CPU Offload, and llama.cpp: The Real Performance Cliff

An advanced guide to local GPU inference with llama.cpp: why bandwidth matters more than model fit, how hybrid GPU+CPU offload behaves on cards like the RTX 3060 and 5070, what quantization really means mathematically, and how to run it on Linux, Windows, and WSL.

Local AI

Issue Operators lacked a practical framework for choosing quantization, sizing VRAM budgets, deciding when CPU offload is acceptable, and understanding the difference between weight quantization and KV cache quantization. Windows-specific setup questions also created confusion around native builds versus WSL.

Solution Documented the bandwidth-first model, explained hybrid offload behavior for 12 GB and mid-range modern GPUs, compared quantization choices such as Q4_K_M and q4_0 KV cache, and provided concrete llama.cpp launch patterns for Linux, Windows, and WSL.

local-aillama.cppcudavram

Mar 25, 2024 • 3 min read

Implementing Google's TurboQuant: Hybrid KV Cache for Edge LLM Deployment

How I implemented hybrid per-layer KV cache quantization on RK3588 using insights from Google's TurboQuant research, achieving 17% better compression with zero quality loss.

Local AI

Issue Every time you message an AI chatbot, the model stores your conversation in temporary memory called the KV cache. On large models, this cache alone can consume 40GB—more than the model itself. On a constrained edge device, this is the difference between working and broken.

Solution Implemented hybrid per-layer KV cache quantization inspired by Google's TurboQuant (ICLR 2026). By using 8-bit quantization for early transformer layers (where attention quality matters most) and 4-bit quantization for later layers, we achieved 17% better compression without quality loss.

local-aiedge-aiturboquantkv-cache

Mar 22, 2024 • 9 min read

RK3588 NPU Router Architecture: What Actually Runs, What Wins, and Why

A benchmark-backed deep dive into the real RK3588 inference stack: llama.cpp CPU winners, NPU roles, KV cache choices, quantization tradeoffs, and how to think about 27B with GPU+CPU offload.

Local AI

Issue Several paths technically loaded but were not practically usable. Large models timed out or delivered poor latency, CPU tuning mattered more than expected, and the product narrative needed to shift from 'many runtimes' to a benchmark-backed llama.cpp-first architecture.

Solution Benchmarked llama.cpp and RKLLM on RK3588, identified the winning CPU configs for Qwen 3.5 4B and 9B, clarified where the NPU helps, documented KV cache and quantization choices, and reframed the architecture as llama.cpp-first with NPU used selectively.

rk3588npullama.cpprkllm

Mar 9, 2024 • 8 min read

Azure Provisioned Throughput: When Fixed Costs Beat Pay-Per-Token

Why we moved from Pay-As-You-Go to Provisioned Throughput Units (PTU) for our Azure OpenAI workloads—and how to know if it makes sense for you.

Issue As your application scales, Microsoft's default rate limits can throttle your service, leading to slow responses and inconsistent user experiences. You're essentially stuck in traffic during peak hours.

Solution Think of it like a toll road. Standard use is like paying per mile, but you're stuck in traffic. Azure's Provisioned Throughput (PTU) is like renting your own dedicated express lane. We built a framework to calculate the exact financial break-even point between the two models.

azurellmcost-optimizationinfrastructure

Mar 3, 2024 • 14 min read

Training Custom AI Models for Insurance Document Processing

How to build, train, and deploy custom document intelligence models for extracting structured data from multilingual insurance policies using Azure AI Foundry.

Issue Off-the-shelf OCR solutions couldn't handle the complexity of insurance documents. Different insurers used different layouts, multilingual support was limited, and extracted data needed to conform to a strict canonical schema for downstream systems.

Solution Implemented a custom document intelligence solution using Azure AI Document Intelligence, training models on labeled examples to extract and normalize fields across multiple insurers and languages.

azure-aidocument-intelligencemachine-learningautomation

Mar 3, 2024 • 6 min read

Edge LLM Optimization: Memory Bandwidth and Context Management

Lessons learned running LLMs on constrained hardware—why bandwidth matters more than capacity, how KV cache quantization helps, and context folding for long conversations.

Local AI

Issue Edge devices have hard constraints: limited RAM, no GPU VRAM, and strict latency requirements for interactive applications. The naive approach of 'make the model fit' failed repeatedly—either latency was too high or context windows would overflow during long conversations.

Solution Developed a three-pronged approach: (1) enforce bandwidth-first model selection, (2) use KV cache quantization to reduce memory footprint, and (3) implement hierarchical context folding for long conversations.

local-aiedge-aillama.cppkv-cache

Mar 3, 2024 • 9 min read

IntelliAuto: AI-Powered Automotive Assistant with Secure Monetization

Building an intelligent car maintenance companion with LLM-powered diagnostics, dynamic affiliate commerce, and defense-in-depth AI security.

Kotlin AI

Issue Existing automotive apps are passive logs. Adding AI creates risks: prompt injection through user input, data privacy concerns, API cost runaway, and potential for incorrect safety-critical advice.

Solution Designed IntelliAuto with AutoMind AI assistant featuring backend proxy architecture, multi-layer prompt injection prevention, dynamic affiliate link generation, and strict safety disclaimers for automotive advice.

androidaikotlinmobile

Feb 28, 2024 • 3 min read

RK3588 LLM Performance: NPU vs CPU in a Discord Agent

Benchmarking local LLM inference on RK3588 and why NPU acceleration (RKLLM) is the difference between real-time chat and unusable latency.

Local AI

Issue CPU-only inference on small models was too slow for interactive UX, and some NPU model runs initially failed for non-runtime reasons (corrupted downloads or wrong target platform conversions).

Solution Benchmarked CPU (Ollama) vs NPU (RKLLM), applied system and inference parameter optimizations, and documented failure modes to distinguish model-file issues from NPU/runtime issues.

rk3588npurkllmollama

Feb 26, 2024 • 4 min read

Modernizing Android UX: High Refresh Rates & App Shortcuts

How to request 90Hz/120Hz rendering and implement static deep-linked app shortcuts to improve mobile application usability.

Kotlin AI

Issue The app was locked to standard 60Hz rendering, causing sub-optimal scrolling experiences on devices capable of 90Hz or 120Hz. Additionally, users had to navigate through multiple screens to perform frequent actions.

Solution Detected 90Hz+ display modes and configured window post-processing preferences for smoother rendering, then implemented static XML-based app shortcuts routed via deep links.

androiduxperformancekotlin

Feb 18, 2024 • 4 min read

Securing and Scaling AI Context in an Automotive Assistant

How to implement rate limiting, context window management, and prompt injection prevention for an LLM-powered mobile application backend.

AI Kotlin

Issue Directly exposing LLMs to users risks massive API costs through spam or unbounded context windows. Furthermore, raw user input is vulnerable to jailbreaks (e.g., 'ignore previous instructions and execute code').

Solution Implemented a multi-tier model routing strategy (chat vs reasoning), robust context truncation, regex-based jailbreak detection, and strict timestamp-based rate limiting.

llmsecuritynodejsarchitecture

Feb 14, 2024 • 4 min read

Building a Multilingual AI Backend for Part Recognition

How to handle multi-language AI queries to provide accurate predictions and generate tailored localized search queries in a serverless environment.

Issue The backend AI needed to recognize user intent and categorize vehicle parts accurately regardless of the input language, and subsequently generate both localized predictive maintenance responses and tailored affiliate search queries.

Solution Implemented comprehensive multi-language keyword dictionaries, extracted user language context directly from client requests, and used mapping dictionaries to serve localized response templates.

nodejsmultilingualllmarchitecture

Feb 12, 2024 • 4 min read

Slashing LLM API Costs with System Prompt Caching

How to structure LLM requests for prompt caching (when supported) to reduce repeated system-prompt input costs.

Issue Large Language Models charge per token. When you send a 1,000-token system prompt alongside a 50-token user question, you pay for 1,050 tokens every time, even though 95% of the payload never changes between requests.

Solution Restructured the API payload to isolate static system instructions so the backend can take advantage of cached-input pricing or prompt caching features where the provider supports it.

llmcost-optimizationarchitecturecaching