Bonsai-8B: Extreme Quantization and the Binary Neural Network Paradigm

The Memory Wall Problem

Every neural network inference has a dirty secret: most of the time is spent waiting for memory, not computing.

Modern GPUs can perform terabytes of matrix operations per second, but they’re starved by memory bandwidth. An FP16 weight tensor consumes 2 bytes per parameter. For an 8B parameter model, that’s 16GB just to store weights—before KV cache, activations, or gradients.

The sequence is brutal:

Load 16GB of weights from system RAM → VRAM (bandwidth limited)
Load weights from VRAM → registers (bandwidth limited)
Perform multiplication (fast!)
Write results back (bandwidth limited)

Steps 1, 2, and 4 dominate. The actual compute in step 3 is a rounding error in the total time.

This is the Memory Wall. And Bonsai-8B punches through it.

What is Bonsai-8B?

Bonsai-8B is not a new architecture. It’s a Binary Neural Network (BNN) quantization wrapper around Qwen3-8B that constrains every weight to two possible values: $\{-\alpha, +\alpha\}$ .

Instead of 65,536 possible FP16 values per weight, Bonsai uses exactly 2 values. The result? A 14x reduction in model size with surprisingly minimal quality degradation.

The Compression Math

Let’s work through the actual numbers for an 8B parameter model:

Format	Bits/Weight	Model Size	Compression
FP32	32	32 GB	1x (baseline)
FP16	16	16 GB	2x
Q8_0	8.125	8.1 GB	3.9x
Q4_0	4.125	4.1 GB	7.8x
Q1_0_g128	1.125	1.125 GB	28.4x vs FP32
			14.2x vs FP16

That 1.125 bits per weight? Here’s how it’s calculated:

Total bits = (1 bit per weight for sign) + (overhead for scales)

For group size g = 128:
- Each group of 128 weights shares 1 FP16 scale (16 bits)
- Overhead per weight = 16 / 128 = 0.125 bits
- Total = 1 + 0.125 = 1.125 bits/weight

The Quantization Function

Standard weights $W \in \mathbb{R}^{n \times m}$ in FP16. Bonsai maps these to quantized weights $\hat{W}$ through a simple but powerful transformation.

Step 1: Block-wise Grouping

Divide weights into blocks of size $K = 128$ . For a block vector $\mathbf{w} \in \mathbb{R}^{128}$ :

w = [w₁, w₂, w₃, ..., w₁₂₈]

Step 2: Scale Calculation

The scale $s$ minimizes reconstruction error. Two common approaches:

Mean Absolute Value (simple):

$s = \frac{1}{K} \sum_{i=1}^{K} |w_i| = \frac{\|\mathbf{w}\|_1}{K}$

MSE-Optimized (better):

$s^* = \arg \min_s \|\mathbf{w} - s \cdot \text{sign}(\mathbf{w})\|^2$

Solving the optimization:

$s^* = \frac{\sum_{i=1}^{K} |w_i|}{K} = \frac{\mathbf{w} \cdot \text{sign}(\mathbf{w})}{K}$

Step 3: Binarization

Assign binary values based on sign:

$b_i = \text{sign}(w_i) = \begin{cases} +1 & \text{if } w_i \geq 0 \\ -1 & \text{if } w_i < 0 \end{cases}$

Step 4: Reconstruction

The dequantized weight:

$\hat{w}_i = s \cdot b_i$

Complete Quantization Formula

$\hat{W} = Q(W) = \text{diag}(\mathbf{s}) \cdot B$

Where:

$\mathbf{s} \in \mathbb{R}^{n/g}$ is the vector of scales (one per group)
$B \in \{-1, +1\}^{n \times m}$ is the binary weight matrix
$g = 128$ is the group size

Concrete Example

Let’s quantize a real weight block from an MLP layer:

Original weights (FP16):        After binarization:     Scale calculation:
┌──────────────────────┐       ┌──────────────────────┐  ┌─────────────────┐
│ 0.0234  -0.1567      │       │  +1      -1          │  │                 │
│ 0.0089   0.0456      │       │  +1      +1          │  │ sum(|w|) =      │
│ -0.2345  0.1234      │  →    │  -1      +1          │  │ 0.0234 + 0.1567 │
│ 0.0678  -0.0891      │       │  +1      -1          │  │ + 0.0089 + ...  │
│ ...      ...         │       │  ...     ...         │  │ = 12.847        │
└──────────────────────┘       └──────────────────────┘  │                 │
                                                         │ s = 12.847/128  │
                                                         │   = 0.1004      │
                                                         └─────────────────┘

Reconstructed weights (1-bit + scale):
┌────────────────────────┐
│ +0.1004  -0.1004       │
│ +0.1004  +0.1004       │
│ -0.1004  +0.1004       │
│ +0.1004  -0.1004       │
│ ...       ...          │
└────────────────────────┘

Quantization error for first weight:
|0.0234 - 0.1004| = 0.0770

Mean Absolute Error for block:
MAE = (1/128) × Σ|wᵢ - ŵᵢ| = 0.0312

The error seems large per-weight, but the magic happens at scale—across millions of weights, the errors average out during matrix multiplication.

Linear Algebra of Binary Computation

The core operation in transformers is matrix-vector multiplication:

$\mathbf{y} = W\mathbf{x}$

With Bonsai quantization:

$\mathbf{y} = \hat{W}\mathbf{x} = \text{diag}(\mathbf{s}) \cdot B \cdot \mathbf{x}$

Breaking this down for a single output element:

$y_i = s_i \sum_{j=1}^{K} B_{ij} x_j$

Why This is Revolutionary

In standard FP16:

y = w₁×x₁ + w₂×x₂ + w₃×x₃ + ... + w₁₂₈×x₁₂₈
    ↑ expensive FP16 multiply for each term

In Bonsai 1-bit:

y = s × (b₁×x₁ + b₂×x₂ + b₃×x₃ + ... + b₁₂₈×x₁₂₈)
    where bⱼ ∈ {-1, +1}

    = s × (±x₁ ± x₂ ± x₃ ... ± x₁₂₈)
    ↑ only additions and subtractions!

Multiply-Accumulate (MAC) becomes Accumulate-only:

Operation	Standard	Bonsai
Multiplications	128 FP16 multiplies	1 FP16 multiply (by s)
Additions	128 FP16 adds	128 FP16 adds
Hardware complexity	Full FP16 multiplier	Sign selector + adder

The multiplication by $\pm 1$ is free:

Multiply by $+1$ = identity (no-op)
Multiply by $-1$ = two’s complement negation (bitwise invert + 1, or just XOR with sign bit)

GGUF Format and Inline Dequantization

The GGUF Container

GGUF (GPT-Generated Unified Format) stores:

Header: Metadata, tokenizer vocab, architecture info
Tensor info: Shape, type, offsets for each tensor
Weight data: Packed binary blob

For Bonsai Q1_0_g128, each tensor is stored as:

[tensor_data] = [scale_0][bits_0_127][scale_1][bits_128_255]...

Each block: 16 bytes (128 bits) + 2 bytes (FP16 scale) = 18 bytes
           = 18 bytes / 128 weights = 1.125 bytes/weight

The Kernel Innovation

Standard quantization flow:

1. Load compressed weights from VRAM
2. Dequantize to FP16 in VRAM (expands 14x!)
3. Load FP16 weights to registers
4. Compute
5. Discard FP16 weights

Problem: Step 2 causes memory explosion. An 8B model becomes 16GB again.

Bonsai’s inline dequantization:

1. Load 128 bits + 1 FP16 scale to registers
2. Unpack on-the-fly: bit j → +s or -s
3. Compute dot product immediately
4. Only FP16 result ever touches VRAM

The model stays compressed in memory throughout inference.

Kernel Pseudocode (Conceptual)

// CUDA-like pseudocode for Bonsai Q1_0 dot product
__device__ float bonsai_dot(const uint8_t* weights_bits,
                            const half scale,
                            const float* input_vec) {
    float sum = 0.0f;

    for (int i = 0; i < 128; i++) {
        // Extract bit i from packed bytes
        int byte_idx = i / 8;
        int bit_idx = i % 8;
        int bit = (weights_bits[byte_idx] >> bit_idx) & 1;

        // Map 0→-s, 1→+s (or vice versa based on convention)
        float weight = bit ? scale : -scale;

        // Accumulate (just an add!)
        sum += weight * input_vec[i];
    }

    return sum;
}

In practice, this is vectorized with SIMD (AVX-512, ARM NEON) or CUDA warps for 32+ parallel operations.

Intelligence Density: A New Metric

Bonsai introduces a powerful efficiency metric:

$\alpha = \frac{-\ln(1 - \text{Score}/100)}{\text{Size (GB)}}$

Where Score is a benchmark average (MMLU, GSM8K, etc.).

Why This Formula?

The numerator $-\ln(1 - \text{Score}/100)$ transforms accuracy into a “capability density”:

Score	Linear	Transformed
50%	0.50	0.693
75%	0.75	1.386
90%	0.90	2.303
95%	0.95	2.996

The logarithmic transform accounts for diminishing returns—improving from 90%→95% is harder than 50%→55%.

Calculated Intelligence Density

Using reported benchmarks:

Model	Avg Score	Size	α (Intelligence/GB)	Relative Efficiency
Qwen3-8B FP16	78.5%	16.0 GB	0.098	1.0x (baseline)
Qwen3-8B Q4	76.2%	4.5 GB	0.338	3.4x
Bonsai-8B	71.8%	1.125 GB	1.062	10.8x

Bonsai delivers 10.8x more intelligence per gigabyte than the FP16 baseline.

Worked Calculation for Bonsai

Score = 71.8%
Size = 1.125 GB

α = -ln(1 - 0.718) / 1.125
  = -ln(0.282) / 1.125
  = 1.196 / 1.125
  = 1.062 Intelligence/GB

Compare to FP16:

α = -ln(1 - 0.785) / 16.0
  = -ln(0.215) / 16.0
  = 1.538 / 16.0
  = 0.096 Intelligence/GB

Benchmark Results

Quality Degradation Analysis

Benchmark	Qwen3-8B FP16	Bonsai-8B	Drop
MMLU (0-shot)	78.5%	71.2%	-7.3%
GSM8K	82.4%	74.8%	-7.6%
HumanEval	67.2%	58.9%	-8.3%
TruthfulQA	68.9%	62.1%	-6.8%
Average	74.3%	66.8%	-7.5%

Key insight: A ~7.5% accuracy drop buys you 14x size reduction. For many edge applications, this is an acceptable trade-off.

Performance Metrics

Metric	FP16	Bonsai Q1_0	Improvement
Model Size	16.0 GB	1.125 GB	14.2x smaller
Memory Bandwidth	16 GB/load	1.125 GB/load	14.2x less
Tokens/sec (RTX 4090)*	45 t/s	85 t/s	1.9x faster
Tokens/sec (RK3588)**	2.1 t/s	8.4 t/s	4x faster
Power consumption	100%	65%	35% reduction

*With custom CUDA kernels for Bonsai **ARM NEON optimized, CPU-only inference

The Orchestra Analogy

Imagine an LLM as a symphony orchestra:

Standard FP16: Every musician has a precise volume knob with 65,536 positions. Expensive, heavy equipment. Each knob change requires delicate mechanical movement.

Bonsai 1-bit: Replace every knob with a simple light switch: ON or OFF. But we add section conductors (the scale factors) who control overall volume for groups of 128 musicians.

Musician (switch) → Direction (play/rest)
Conductor (scale) → Magnitude (loud/soft)

The result: Music that’s 90% as nuanced, but the equipment is 14x cheaper, lighter, and faster to operate.

Implementation Architecture

What Gets Quantized?

In Bonsai-8B, quantization applies to:

Component	Precision	Notes
Attention Q projection	Q1_0_g128	Largest tensor, biggest impact
Attention K projection	Q1_0_g128	Critical for quality
Attention V projection	Q1_0_g128	Critical for quality
Attention O projection	Q1_0_g128	Output projection
MLP gate	Q1_0_g128	Gating mechanism
MLP up	Q1_0_g128	Feedforward expansion
MLP down	Q1_0_g128	Feedforward compression
Embeddings	Q1_0_g128	Token embeddings
LM Head	Q1_0_g128	Output logits
RMSNorm	FP16	Kept high precision (small)
RoPE frequencies	FP32	Kept full precision (tiny)

Why Embeddings and LM Head Work in 1-bit

These are typically sensitive to quantization because they sit at the token boundary. Bonsai succeeds here through:

Calibration-aware quantization: Running thousands of sample inputs to find optimal scales
Per-channel scales: Different scales for different embedding dimensions
Outlier preservation: Identifying and separately handling weight outliers

Usage with llama.cpp

# Download the model
wget https://huggingface.co/bonsai-8b/gguf/bonsai-8b-q1_0_g128.gguf

# Run inference
./llama-server \
    -m bonsai-8b-q1_0_g128.gguf \
    -c 4096 \
    -ngl 35 \
    --host 0.0.0.0 \
    --port 8080

# Memory usage: ~1.3GB total (weights + KV cache overhead)

Verification

# Check model info
./llama-quantize --help 2>&1 | grep -i "q1_0"

# Expected output shows Q1_0 support

Trade-offs and When to Use Bonsai

Use Bonsai When:

✅ Memory is the bottleneck (edge devices, mobile)
✅ Throughput matters more than peak quality (batch processing)
✅ Cost per token is critical (cloud inference at scale)
✅ Running on CPU (memory bandwidth limited)

Don’t Use Bonsai When:

❌ Maximum accuracy required (medical diagnosis, legal analysis)
❌ Complex reasoning tasks (math proofs, competitive programming)
❌ Context is critical (the quantization noise compounds over long contexts)

Quality vs Size Trade-off Curve

Accuracy
   │
80%├──────────────────── FP16 (16GB)
   │
75%├──────────── Q4_0 (4.5GB)
   │
72%├──── Bonsai Q1_0 (1.125GB) ← Sweet spot for many apps
   │
65%│
   │
50%└────────────────────────────────
      0    4    8    12   16  GB
                 Model Size

The Future of Extreme Quantization

Bonsai-8B proves that 1-bit quantization is viable for production LLMs. The research frontier is pushing even further:

Emerging Directions

Sub-1-bit quantization: Using ternary {-1, 0, +1} or custom codebooks
Mixed-precision: Different bit widths for different layers (like TurboQuant for KV cache)
Learned quantization: Training-aware quantization that optimizes for the specific bit constraints
Activation quantization: Extending extreme quantization to activations (currently kept in FP16)

Hardware Implications

Binary neural networks enable specialized hardware:

XNOR-popcount operations: Binary dot product = XNOR + population count
In-memory compute: Performing operations where data lives, no movement
Analog computing: Memristor-based binary weights

Summary

Bonsai-8B represents a paradigm shift in LLM deployment:

Aspect	Standard	Bonsai
Weight values	65,536 (FP16)	2 {-s, +s}
Model size	16 GB	1.125 GB
Memory traffic	High	14x lower
Compute	MAC operations	Accumulate-only
Quality	100%	~92%
Intelligence/GB	0.098	1.062 (10.8x)

The Memory Wall isn’t destroyed—it’s redefined. By making memory bandwidth 14x less critical, Bonsai shifts the bottleneck back to raw compute, where Moore’s Law still applies.

For edge AI, mobile deployment, and cost-sensitive inference: Bonsai-8B is the new baseline.

Building efficient AI systems, one bit at a time.

Case Snapshot

Situation

Used In