Structured Engineering Case Studies

Linux & Virtualization Engineering Portfolio

Current role Linux & Virtualization Engineer Deutsche Pfandbriefbank AG · Madrid

Published posts 47 Case-study driven technical notes

Last update Apr 16, 2026 Archive-first publishing flow

At a Glance

Linux and virtualization engineer documenting real delivery patterns: clear issue statements, implementation choices, and production outcomes.

🐧 23 posts

Linux Infrastructure

RHEL lifecycle management, kernel tuning, Satellite, enterprise Linux operations

Infrastructure Automation

⚡ 13 posts

Platform Automation

Ansible playbooks, IaC, Git workflows, CI/CD pipelines, VMware automation

Automation Cloud

🤖 15 posts

Applied AI & Edge

Azure AI Foundry, local LLM inference, RK3588 edge deployment, document intelligence

AI Local AI

Production Spotlight

App Store →

● Live on Google Play & Web

📱 IntelliFlow: AI Budget Tracker

A production-grade personal finance application serving real users. Features an AI-powered financial coach, offline-first architecture, and cross-platform syncing.

Get it on Google Play Open Web App

Domains

Browse all

Infrastructure 23

RHEL lifecycle, automation, virtualization, and production operations.

Automation 13

Ansible playbooks, task automation, and configuration management.

AI 15

Applied AI across cloud services, local inference, and practical delivery lessons.

Cloud 2

Azure architecture, infrastructure design, and delivery practices.

Local AI 8

Running models on local hardware with privacy-first workflows.

Kotlin 6

Kotlin projects, notes, and engineering experiments.

Snippets 4

Quick commands and reusable building blocks for day-to-day work.

Featured Projects

View GitHub →

llamacpp-workbench

Local LLM inference workbench for RK3588 and edge devices

Python · JavaScript

IntelliFlow

AI-powered personal finance app with offline-first architecture

Flutter · Production

Ansible Playbooks

Infrastructure automation for enterprise Linux environments

Ansible · YAML

Recent Case Studies

All posts

Apr 16, 2026 • 8 min read

Git Branch Splitting: Untangling Mixed Feature Branches

A practical guide to splitting an oversized Git PR into clean, topic-focused branches using path-based checkout from a fresh branch off main.

Issue Mixed branches make PRs unreviewable, increase blast radius, and risk dragging unrelated changes into production. When one branch contains role code, host variables, certificate files, and inventory updates together, reviewers cannot isolate what changed or why.

Solution Split the oversized branch into multiple clean, topic-focused branches by checking out only the relevant paths from the mixed branch into new branches created fresh off main.

gitdevopsansibleworkflow

Apr 11, 2026 • 10 min read

14 Models Benchmarked on RK3588: The Definitive CPU vs NPU Ranking

Benchmarked every viable local LLM (350M to 26B, CPU and NPU) through a live Discord agent pipeline on RK3588. Found NPU beats CPU at same quality, code is solved at any size, and 4B+ models are slower AND worse than 2B on this board.

Local AI

Issue Previous benchmarks measured raw llama.cpp throughput but not real quality through the agent pipeline. Models that looked fast synthetically failed at reasoning, refused tool calls, or got intercepted by workspace routing before reaching the model.

Solution Built a 14-test, 6-dimension benchmark harness that tests every model through the live Discord pipeline with quality validation: reasoning, factual accuracy, code generation, instruction following, tool calling, and math. Tested 14 models (9 CPU GGUF + 3 NPU RKLLM + 2 large MoE) with BENCHMARK_MODE to isolate pure model performance.

rk3588radxarock-5b-plusllama.cpp

Mar 30, 2026 • 6 min read

llamacpp-workbench: Remote llama.cpp Control and REAP Model Serving on RK3588

Publishing a practical local-AI control plane for llama.cpp: remote model loading, runtime tuning, streaming chat, and real REAP model serving on a Radxa ROCK 5B+.

Local AI

Issue Most local model UIs either abstract away the runtime details that actually matter on constrained hardware or assume desktop-class GPUs. On RK3588, that makes it harder to tune context, KV cache quantization, reasoning behavior, and model selection credibly.

Solution Built and published `llamacpp-workbench`, a remote llama.cpp workbench with explicit runtime controls, model presets, markdown chat rendering, streaming responses, and benchmark-backed defaults for REAP and dense GGUF models.

llama.cpprk3588radxarock-5b-plus

Mar 28, 2026 • 15 min read

Qwen3.5 on RK3588 with llama.cpp: Real Benchmarks from a Radxa ROCK 5B+

An advanced benchmark report for running Qwen3.5 locally on RK3588 with source-built llama.cpp: prefill speed, decode speed, stable context, tool-calling behavior, and the practical model choices that actually work on a Radxa ROCK 5B+.

Local AI

Issue The usual local-AI advice overemphasizes parameter count and underexplains bandwidth, context budget, KV cache policy, and interactive latency. On RK3588, that leads to bad defaults: models that technically load but feel broken in real chat and tool-calling workloads.

Solution I ran a corrected Qwen3.5 sweep on RK3588 using source-built llama.cpp, quantized KV cache, and task-pass validation. Then I compared prefill, decode, stable context, average latency, and tool-calling behavior to determine the right model for each workload.

rk3588radxarock-5b-plusllama.cpp

Mar 25, 2026 • 15 min read

GPU VRAM, CPU Offload, and llama.cpp: The Real Performance Cliff

An advanced guide to local GPU inference with llama.cpp: why bandwidth matters more than model fit, how hybrid GPU+CPU offload behaves on cards like the RTX 3060 and 5070, what quantization really means mathematically, and how to run it on Linux, Windows, and WSL.

Local AI

Issue Operators lacked a practical framework for choosing quantization, sizing VRAM budgets, deciding when CPU offload is acceptable, and understanding the difference between weight quantization and KV cache quantization. Windows-specific setup questions also created confusion around native builds versus WSL.

Solution Documented the bandwidth-first model, explained hybrid offload behavior for 12 GB and mid-range modern GPUs, compared quantization choices such as Q4_K_M and q4_0 KV cache, and provided concrete llama.cpp launch patterns for Linux, Windows, and WSL.

local-aillama.cppcudavram

Mar 25, 2026 • 3 min read

Implementing Google's TurboQuant: Hybrid KV Cache for Edge LLM Deployment

How I implemented hybrid per-layer KV cache quantization on RK3588 using insights from Google's TurboQuant research, achieving 17% better compression with zero quality loss.

Local AI

Issue Every time you message an AI chatbot, the model stores your conversation in temporary memory called the KV cache. On large models, this cache alone can consume 40GB—more than the model itself. On a constrained edge device, this is the difference between working and broken.

Solution Implemented hybrid per-layer KV cache quantization inspired by Google's TurboQuant (ICLR 2026). By using 8-bit quantization for early transformer layers (where attention quality matters most) and 4-bit quantization for later layers, we achieved 17% better compression without quality loss.

local-aiedge-aiturboquantkv-cache