← Back to posts

Bonsai-8B: Extreme Quantization and the Binary Neural Network Paradigm

A deep technical breakdown of 1-bit quantization in LLMs using the Bonsai-8B model. Exploring binary neural networks, inline dequantization kernels, and achieving 14x compression with minimal quality loss.

Case Snapshot

Situation

Deploying LLMs on edge devices with severe memory constraints. Standard FP16 models require 16GB+ VRAM, pricing out most users from running capable models locally.

Used In

Edge LLM deployment, RK3588 inference, high-throughput serving

The Memory Wall Problem

Every neural network inference has a dirty secret: most of the time is spent waiting for memory, not computing.

Modern GPUs can perform terabytes of matrix operations per second, but they’re starved by memory bandwidth. An FP16 weight tensor consumes 2 bytes per parameter. For an 8B parameter model, that’s 16GB just to store weights—before KV cache, activations, or gradients.

The sequence is brutal:

  1. Load 16GB of weights from system RAM → VRAM (bandwidth limited)
  2. Load weights from VRAM → registers (bandwidth limited)
  3. Perform multiplication (fast!)
  4. Write results back (bandwidth limited)

Steps 1, 2, and 4 dominate. The actual compute in step 3 is a rounding error in the total time.

This is the Memory Wall. And Bonsai-8B punches through it.

What is Bonsai-8B?

Bonsai-8B is not a new architecture. It’s a Binary Neural Network (BNN) quantization wrapper around Qwen3-8B that constrains every weight to two possible values: {α,+α}\{-\alpha, +\alpha\}.

Instead of 65,536 possible FP16 values per weight, Bonsai uses exactly 2 values. The result? A 14x reduction in model size with surprisingly minimal quality degradation.

The Compression Math

Let’s work through the actual numbers for an 8B parameter model:

FormatBits/WeightModel SizeCompression
FP323232 GB1x (baseline)
FP161616 GB2x
Q8_08.1258.1 GB3.9x
Q4_04.1254.1 GB7.8x
Q1_0_g1281.1251.125 GB28.4x vs FP32
14.2x vs FP16

That 1.125 bits per weight? Here’s how it’s calculated:

Total bits = (1 bit per weight for sign) + (overhead for scales)

For group size g = 128:
- Each group of 128 weights shares 1 FP16 scale (16 bits)
- Overhead per weight = 16 / 128 = 0.125 bits
- Total = 1 + 0.125 = 1.125 bits/weight

The Quantization Function

Standard weights WRn×mW \in \mathbb{R}^{n \times m} in FP16. Bonsai maps these to quantized weights W^\hat{W} through a simple but powerful transformation.

Step 1: Block-wise Grouping

Divide weights into blocks of size K=128K = 128. For a block vector wR128\mathbf{w} \in \mathbb{R}^{128}:

w = [w₁, w₂, w₃, ..., w₁₂₈]

Step 2: Scale Calculation

The scale ss minimizes reconstruction error. Two common approaches:

Mean Absolute Value (simple):

s=1Ki=1Kwi=w1Ks = \frac{1}{K} \sum_{i=1}^{K} |w_i| = \frac{\|\mathbf{w}\|_1}{K}

MSE-Optimized (better):

s=argminswssign(w)2s^* = \arg \min_s \|\mathbf{w} - s \cdot \text{sign}(\mathbf{w})\|^2

Solving the optimization:

s=i=1KwiK=wsign(w)Ks^* = \frac{\sum_{i=1}^{K} |w_i|}{K} = \frac{\mathbf{w} \cdot \text{sign}(\mathbf{w})}{K}

Step 3: Binarization

Assign binary values based on sign:

bi=sign(wi)={+1if wi01if wi<0b_i = \text{sign}(w_i) = \begin{cases} +1 & \text{if } w_i \geq 0 \\ -1 & \text{if } w_i < 0 \end{cases}

Step 4: Reconstruction

The dequantized weight:

w^i=sbi\hat{w}_i = s \cdot b_i

Complete Quantization Formula

W^=Q(W)=diag(s)B\hat{W} = Q(W) = \text{diag}(\mathbf{s}) \cdot B

Where:

  • sRn/g\mathbf{s} \in \mathbb{R}^{n/g} is the vector of scales (one per group)
  • B{1,+1}n×mB \in \{-1, +1\}^{n \times m} is the binary weight matrix
  • g=128g = 128 is the group size

Concrete Example

Let’s quantize a real weight block from an MLP layer:

Original weights (FP16):        After binarization:     Scale calculation:
┌──────────────────────┐       ┌──────────────────────┐  ┌─────────────────┐
│ 0.0234  -0.1567      │       │  +1      -1          │  │                 │
│ 0.0089   0.0456      │       │  +1      +1          │  │ sum(|w|) =      │
│ -0.2345  0.1234      │  →    │  -1      +1          │  │ 0.0234 + 0.1567 │
│ 0.0678  -0.0891      │       │  +1      -1          │  │ + 0.0089 + ...  │
│ ...      ...         │       │  ...     ...         │  │ = 12.847        │
└──────────────────────┘       └──────────────────────┘  │                 │
                                                         │ s = 12.847/128  │
                                                         │   = 0.1004      │
                                                         └─────────────────┘

Reconstructed weights (1-bit + scale):
┌────────────────────────┐
│ +0.1004  -0.1004       │
│ +0.1004  +0.1004       │
│ -0.1004  +0.1004       │
│ +0.1004  -0.1004       │
│ ...       ...          │
└────────────────────────┘

Quantization error for first weight:
|0.0234 - 0.1004| = 0.0770

Mean Absolute Error for block:
MAE = (1/128) × Σ|wᵢ - ŵᵢ| = 0.0312

The error seems large per-weight, but the magic happens at scale—across millions of weights, the errors average out during matrix multiplication.

Linear Algebra of Binary Computation

The core operation in transformers is matrix-vector multiplication:

y=Wx\mathbf{y} = W\mathbf{x}

With Bonsai quantization:

y=W^x=diag(s)Bx\mathbf{y} = \hat{W}\mathbf{x} = \text{diag}(\mathbf{s}) \cdot B \cdot \mathbf{x}

Breaking this down for a single output element:

yi=sij=1KBijxjy_i = s_i \sum_{j=1}^{K} B_{ij} x_j

Why This is Revolutionary

In standard FP16:

y = w₁×x₁ + w₂×x₂ + w₃×x₃ + ... + w₁₂₈×x₁₂₈
    ↑ expensive FP16 multiply for each term

In Bonsai 1-bit:

y = s × (b₁×x₁ + b₂×x₂ + b₃×x₃ + ... + b₁₂₈×x₁₂₈)
    where bⱼ ∈ {-1, +1}

    = s × (±x₁ ± x₂ ± x₃ ... ± x₁₂₈)
    ↑ only additions and subtractions!

Multiply-Accumulate (MAC) becomes Accumulate-only:

OperationStandardBonsai
Multiplications128 FP16 multiplies1 FP16 multiply (by s)
Additions128 FP16 adds128 FP16 adds
Hardware complexityFull FP16 multiplierSign selector + adder

The multiplication by ±1\pm 1 is free:

  • Multiply by +1+1 = identity (no-op)
  • Multiply by 1-1 = two’s complement negation (bitwise invert + 1, or just XOR with sign bit)

GGUF Format and Inline Dequantization

The GGUF Container

GGUF (GPT-Generated Unified Format) stores:

  1. Header: Metadata, tokenizer vocab, architecture info
  2. Tensor info: Shape, type, offsets for each tensor
  3. Weight data: Packed binary blob

For Bonsai Q1_0_g128, each tensor is stored as:

[tensor_data] = [scale_0][bits_0_127][scale_1][bits_128_255]...

Each block: 16 bytes (128 bits) + 2 bytes (FP16 scale) = 18 bytes
           = 18 bytes / 128 weights = 1.125 bytes/weight

The Kernel Innovation

Standard quantization flow:

1. Load compressed weights from VRAM
2. Dequantize to FP16 in VRAM (expands 14x!)
3. Load FP16 weights to registers
4. Compute
5. Discard FP16 weights

Problem: Step 2 causes memory explosion. An 8B model becomes 16GB again.

Bonsai’s inline dequantization:

1. Load 128 bits + 1 FP16 scale to registers
2. Unpack on-the-fly: bit j → +s or -s
3. Compute dot product immediately
4. Only FP16 result ever touches VRAM

The model stays compressed in memory throughout inference.

Kernel Pseudocode (Conceptual)

// CUDA-like pseudocode for Bonsai Q1_0 dot product
__device__ float bonsai_dot(const uint8_t* weights_bits,
                            const half scale,
                            const float* input_vec) {
    float sum = 0.0f;

    for (int i = 0; i < 128; i++) {
        // Extract bit i from packed bytes
        int byte_idx = i / 8;
        int bit_idx = i % 8;
        int bit = (weights_bits[byte_idx] >> bit_idx) & 1;

        // Map 0→-s, 1→+s (or vice versa based on convention)
        float weight = bit ? scale : -scale;

        // Accumulate (just an add!)
        sum += weight * input_vec[i];
    }

    return sum;
}

In practice, this is vectorized with SIMD (AVX-512, ARM NEON) or CUDA warps for 32+ parallel operations.

Intelligence Density: A New Metric

Bonsai introduces a powerful efficiency metric:

α=ln(1Score/100)Size (GB)\alpha = \frac{-\ln(1 - \text{Score}/100)}{\text{Size (GB)}}

Where Score is a benchmark average (MMLU, GSM8K, etc.).

Why This Formula?

The numerator ln(1Score/100)-\ln(1 - \text{Score}/100) transforms accuracy into a “capability density”:

ScoreLinearTransformed
50%0.500.693
75%0.751.386
90%0.902.303
95%0.952.996

The logarithmic transform accounts for diminishing returns—improving from 90%→95% is harder than 50%→55%.

Calculated Intelligence Density

Using reported benchmarks:

ModelAvg ScoreSizeα (Intelligence/GB)Relative Efficiency
Qwen3-8B FP1678.5%16.0 GB0.0981.0x (baseline)
Qwen3-8B Q476.2%4.5 GB0.3383.4x
Bonsai-8B71.8%1.125 GB1.06210.8x

Bonsai delivers 10.8x more intelligence per gigabyte than the FP16 baseline.

Worked Calculation for Bonsai

Score = 71.8%
Size = 1.125 GB

α = -ln(1 - 0.718) / 1.125
  = -ln(0.282) / 1.125
  = 1.196 / 1.125
  = 1.062 Intelligence/GB

Compare to FP16:

α = -ln(1 - 0.785) / 16.0
  = -ln(0.215) / 16.0
  = 1.538 / 16.0
  = 0.096 Intelligence/GB

Benchmark Results

Quality Degradation Analysis

BenchmarkQwen3-8B FP16Bonsai-8BDrop
MMLU (0-shot)78.5%71.2%-7.3%
GSM8K82.4%74.8%-7.6%
HumanEval67.2%58.9%-8.3%
TruthfulQA68.9%62.1%-6.8%
Average74.3%66.8%-7.5%

Key insight: A ~7.5% accuracy drop buys you 14x size reduction. For many edge applications, this is an acceptable trade-off.

Performance Metrics

MetricFP16Bonsai Q1_0Improvement
Model Size16.0 GB1.125 GB14.2x smaller
Memory Bandwidth16 GB/load1.125 GB/load14.2x less
Tokens/sec (RTX 4090)*45 t/s85 t/s1.9x faster
Tokens/sec (RK3588)**2.1 t/s8.4 t/s4x faster
Power consumption100%65%35% reduction

*With custom CUDA kernels for Bonsai **ARM NEON optimized, CPU-only inference

The Orchestra Analogy

Imagine an LLM as a symphony orchestra:

Standard FP16: Every musician has a precise volume knob with 65,536 positions. Expensive, heavy equipment. Each knob change requires delicate mechanical movement.

Bonsai 1-bit: Replace every knob with a simple light switch: ON or OFF. But we add section conductors (the scale factors) who control overall volume for groups of 128 musicians.

  • Musician (switch) → Direction (play/rest)
  • Conductor (scale) → Magnitude (loud/soft)

The result: Music that’s 90% as nuanced, but the equipment is 14x cheaper, lighter, and faster to operate.

Implementation Architecture

What Gets Quantized?

In Bonsai-8B, quantization applies to:

ComponentPrecisionNotes
Attention Q projectionQ1_0_g128Largest tensor, biggest impact
Attention K projectionQ1_0_g128Critical for quality
Attention V projectionQ1_0_g128Critical for quality
Attention O projectionQ1_0_g128Output projection
MLP gateQ1_0_g128Gating mechanism
MLP upQ1_0_g128Feedforward expansion
MLP downQ1_0_g128Feedforward compression
EmbeddingsQ1_0_g128Token embeddings
LM HeadQ1_0_g128Output logits
RMSNormFP16Kept high precision (small)
RoPE frequenciesFP32Kept full precision (tiny)

Why Embeddings and LM Head Work in 1-bit

These are typically sensitive to quantization because they sit at the token boundary. Bonsai succeeds here through:

  1. Calibration-aware quantization: Running thousands of sample inputs to find optimal scales
  2. Per-channel scales: Different scales for different embedding dimensions
  3. Outlier preservation: Identifying and separately handling weight outliers

Usage with llama.cpp

# Download the model
wget https://huggingface.co/bonsai-8b/gguf/bonsai-8b-q1_0_g128.gguf

# Run inference
./llama-server \
    -m bonsai-8b-q1_0_g128.gguf \
    -c 4096 \
    -ngl 35 \
    --host 0.0.0.0 \
    --port 8080

# Memory usage: ~1.3GB total (weights + KV cache overhead)

Verification

# Check model info
./llama-quantize --help 2>&1 | grep -i "q1_0"

# Expected output shows Q1_0 support

Trade-offs and When to Use Bonsai

Use Bonsai When:

  • Memory is the bottleneck (edge devices, mobile)
  • Throughput matters more than peak quality (batch processing)
  • Cost per token is critical (cloud inference at scale)
  • Running on CPU (memory bandwidth limited)

Don’t Use Bonsai When:

  • Maximum accuracy required (medical diagnosis, legal analysis)
  • Complex reasoning tasks (math proofs, competitive programming)
  • Context is critical (the quantization noise compounds over long contexts)

Quality vs Size Trade-off Curve

Accuracy

80%├──────────────────── FP16 (16GB)

75%├──────────── Q4_0 (4.5GB)

72%├──── Bonsai Q1_0 (1.125GB) ← Sweet spot for many apps

65%│

50%└────────────────────────────────
      0    4    8    12   16  GB
                 Model Size

The Future of Extreme Quantization

Bonsai-8B proves that 1-bit quantization is viable for production LLMs. The research frontier is pushing even further:

Emerging Directions

  1. Sub-1-bit quantization: Using ternary {-1, 0, +1} or custom codebooks
  2. Mixed-precision: Different bit widths for different layers (like TurboQuant for KV cache)
  3. Learned quantization: Training-aware quantization that optimizes for the specific bit constraints
  4. Activation quantization: Extending extreme quantization to activations (currently kept in FP16)

Hardware Implications

Binary neural networks enable specialized hardware:

  • XNOR-popcount operations: Binary dot product = XNOR + population count
  • In-memory compute: Performing operations where data lives, no movement
  • Analog computing: Memristor-based binary weights

Summary

Bonsai-8B represents a paradigm shift in LLM deployment:

AspectStandardBonsai
Weight values65,536 (FP16)2 {-s, +s}
Model size16 GB1.125 GB
Memory trafficHigh14x lower
ComputeMAC operationsAccumulate-only
Quality100%~92%
Intelligence/GB0.0981.062 (10.8x)

The Memory Wall isn’t destroyed—it’s redefined. By making memory bandwidth 14x less critical, Bonsai shifts the bottleneck back to raw compute, where Moore’s Law still applies.

For edge AI, mobile deployment, and cost-sensitive inference: Bonsai-8B is the new baseline.


Building efficient AI systems, one bit at a time.