Structured Engineering Case Studies

Linux & Virtualization Engineering Portfolio

CURRENT ROLE Linux & Virtualization Engineer Deutsche Pfandbriefbank AG · Madrid
PUBLISHED CASES 46 Technical deep dives
LAST UPDATE Apr 1, 2026 Archive-first publishing

At a Glance

Full Profile →

Production Spotlight

App Store →
● Live on Google Play & Web

📱 IntelliFlow: AI Budget Tracker

A production-grade personal finance application serving real users. Features an AI-powered financial coach, offline-first architecture, and cross-platform syncing.

Domains

Browse All →

Featured Projects

View GitHub →

llamacpp-workbench

Local LLM inference workbench for RK3588 and edge devices

Python · JavaScript

IntelliFlow

AI-powered personal finance app with offline-first architecture

Flutter · Production

Ansible Playbooks

Infrastructure automation for enterprise Linux environments

Ansible · YAML

Recent Case Studies

All Posts →
11 min read

Bonsai-8B: Extreme Quantization and the Binary Neural Network Paradigm

A deep technical breakdown of 1-bit quantization in LLMs using the Bonsai-8B model. Exploring binary neural networks, inline dequantization kernels, and achieving 14x compression with minimal quality loss.

Local AI
Situation Deploying LLMs on edge devices with severe memory constraints. Standard FP16 models require 16GB+ VRAM, pricing out most users from running capable models locally.
Used In Edge LLM deployment, RK3588 inference, high-throughput serving
local-aiquantizationbonsaibinary-neural-networks
5 min read

llamacpp-workbench: Remote llama.cpp Control and REAP Model Serving on RK3588

Publishing a practical local-AI control plane for llama.cpp: remote model loading, runtime tuning, streaming chat, and real REAP model serving on a Radxa ROCK 5B+.

Local AI
Situation I wanted a serious remote control surface for local GGUF inference on a Radxa ROCK 5B+ instead of one-off shell commands or generic UIs that hide the important llama.cpp knobs.
Used In Local-first AI serving on a Radxa ROCK 5B+ / RK3588 using source-built llama.cpp and GGUF models, including GLM-4.7-Flash-REAP-23B-A3B.
llama.cpprk3588radxarock-5b-plus
13 min read

Qwen3.5 on RK3588 with llama.cpp: Real Benchmarks from a Radxa ROCK 5B+

An advanced benchmark report for running Qwen3.5 locally on RK3588 with source-built llama.cpp: prefill speed, decode speed, stable context, tool-calling behavior, and the practical model choices that actually work on a Radxa ROCK 5B+.

Local AI
Situation I was tuning a local-first Discord engineering agent on a Radxa ROCK 5B+ (RK3588, 24 GB RAM) and needed hard data for which Qwen3.5 models were actually practical on CPU inference with llama.cpp.
Used In Local-first Discord agent runtime on Radxa ROCK 5B+ / RK3588, built around raw llama.cpp rather than Ollama or LM Studio.
rk3588radxarock-5b-plusllama.cpp
15 min read

GPU VRAM, CPU Offload, and llama.cpp: The Real Performance Cliff

An advanced guide to local GPU inference with llama.cpp: why bandwidth matters more than model fit, how hybrid GPU+CPU offload behaves on cards like the RTX 3060 and 5070, what quantization really means mathematically, and how to run it on Linux, Windows, and WSL.

Local AI
Situation Teams running local models on consumer GPUs often assume that if a model loads, it is production-ready. In practice, once model layers or KV cache spill from VRAM into system RAM, the system hits a bandwidth cliff and throughput collapses.
Used In Local-first AI engineering runtimes and workstation inference setups using llama.cpp on consumer NVIDIA GPUs.
local-aillama.cppcudavram
3 min read

Implementing Google's TurboQuant: Hybrid KV Cache for Edge LLM Deployment

How I implemented hybrid per-layer KV cache quantization on RK3588 using insights from Google's TurboQuant research, achieving 17% better compression with zero quality loss.

Local AI
Situation Running a local-first Discord AI agent (Engram) on a $130 RK3588 single-board computer with 24GB RAM. The challenge: KV cache memory during long conversations would crash the bot or cause severe latency spikes.
Used In Engram AI Discord bot, RADXA AI Suite
local-aiedge-aiturboquantkv-cache
16 min read

The Architecture of Speed: Real-Time Telemetry and Generative AI in 2026 Motorsport

A deep dive into the cloud architectures, real-time data streaming capabilities, and Generative AI setups powering Formula 1 and Formula E in 2026.

Cloud
Situation Top-tier motorsport series (F1 and FE) introduced radical new technical regulations in 2026, causing an explosion in telemetry data (over 1.1 million data points per second) that legacy systems couldn't process in real-time.
Used In Researching modern high-throughput IoT edge-to-cloud architectures for autonomous vehicle frameworks.
system-architecturedata-engineeringawsgoogle-cloud