Introducing Open-TQ-Metal: Open-Source Fused Attention for Long-Context LLM Inference

| Sai Vegasena

Introducing Open-TQ-Metal - an open-source, Metal-native implementation of fused compressed-domain attention, extending Google Research's TurboQuant approach (Zandieh et al., ICLR 2026) to Apple Silicon. It enables Llama 3.1 70B at 128K context on a single 64GB Mac, a configuration none of the major open-source inference frameworks (mlx-lm, llama.cpp, Ollama) can reach on this hardware.

Llama 3.1 70B at 128K context needs 79 GB of memory. Normally, a 64 GB Mac can't do it. Open-TQ-Metal writes fused Metal attention kernels that compress the KV cache from 40 GB to 12.5 GB. Now 128K fits - and runs at 6 tok/s.

This is the first implementation of fused compressed-domain attention on Apple Silicon, built across 330 experiments on two model families (Gemma 4 31B and Llama 3.1 70B).

Paper: arxiv.org/abs/2604.16957

Key Contributions

  • First Metal implementation of fused compressed-domain attention. Custom Metal compute shaders compute attention directly on int4-quantized KV cache data, with zero intermediate dequantization buffers. 48x speedup at 128K context over the dequantize-then-attend baseline.
  • Extended the TurboQuant approach from 8B to 70B. The original TurboQuant paper tested on 8B models with 32 layers. We ran 330 experiments across Gemma 4 31B (60 layers) and Llama 3.1 70B (80 layers), identifying which techniques transfer and which break at depth.
  • A cross-architecture quantization finding. A single architectural parameter (attn_scale) determines whether angular quantization methods like PolarQuant succeed or fail - a finding that was not in the original TurboQuant, PolarQuant, or QJL papers.
  • Fully open-sourced. Metal shader code, C++ inference engine, and all benchmarks are public.

Overview

Standard int4 inference dequantizes the full key matrix to FP32 before computing attention - wasting bandwidth and memory on temporary matrices that grow with context length:

Standard vs Fused Attention Pipeline
Standard:
dequantize(K_int4) → K_fp32 [40 GB temp] → Q @ KT → softmax → @ V → output
Ours:
sdpa_int4(Q, K_int4, V_int4) → output   [0 bytes temp, all in GPU registers]

The fused sdpa_int4 Metal kernel reads packed int4 keys and values directly from device memory, dequantizes per-element via bitwise operations in GPU registers, and computes attention with online softmax - no intermediate matrices ever materialize.

Split-K parallelism divides long KV sequences into 512-token chunks processed in parallel across Metal's GPU cores, then reduces partial softmax results via chained MLX Primitives. This solves the Metal dispatch race condition (multiple kernels within a single eval_gpu() execute without barriers) and achieves flat latency scaling from 1K to 128K tokens.

Key Discoveries

The attn_scale Finding

The most surprising result: the attention scale factor - not model size - determines whether angular quantization methods work.

PolarQuant encodes KV vectors as recursive polar coordinates. It works beautifully on Llama (attn_scale = 1/√d = 0.0884) but produces gibberish on Gemma 4 (attn_scale = 1.0). The reason: Gemma's undampened scale amplifies a 4% directional error 25-100x more per layer. Across 60 layers, this compounds through softmax into incoherent output.

The attn_scale parameter determines whether angular quantization methods succeed or fail
Method Llama 70B (α = 0.0884) Gemma 31B (α = 1.0)
PolarQuant 4-bit cosine sim 0.958 0.621
PolarQuant 4-bit output Coherent Gibberish
Int4 (group=32) cosine sim 0.998 0.992
Int4 (group=32) output Identical Excellent

Int4 per-group quantization works on both architectures because its per-element affine errors are uncorrelated and cancel in dot products - unlike PolarQuant's structured angular bias.

Wired Memory Is Mandatory

A 39 GB model on 64 GB hardware causes macOS to page weights to SSD. One line fixes a 12x slowdown: mx::set_wired_limit(max_working_set_size)0.6 → 6.0 tok/s.

Results

Kernel Speedup (Llama 3.1 70B)

Fused kernel speedup grows with context length, reaching 48x at 128K
KV Length Fused Kernel Baseline (deq + SDPA) Speedup
1K 1.6 ms 4.6 ms 3x
16K 2.7 ms 54.6 ms 20x
64K 5.7 ms 225.3 ms 39x
128K 9.9 ms 480.6 ms 48x

Constant Throughput (Gemma 4 31B)

Decode Throughput vs Context Length (Gemma 4 31B)
Decode throughput (tok/s) 12 11 10 9 8 7 6 0 200 400 600 800 1000 Context length (tokens) 9.8 tok/s 7.2 tok/s Fused sdpa_int4 kernel Dequantize-then-attend baseline

The fused kernel maintains constant 10 tok/s regardless of context length. The baseline degrades from 10.8 to 7.2 tok/s as dequantization bandwidth increases.

Memory: 128K Context on 64 GB

Int4 KV cache compression enables 128K+ context on 64 GB hardware
Context FP16 KV int4 KV Total (int4) Fits 64 GB?
16K 5.0 GB 1.6 GB 40.7 GB Yes
64K 20.0 GB 6.3 GB 45.4 GB Yes
128K 40.0 GB 12.5 GB 51.6 GB Yes
236K 23.4 GB 62.5 GB Yes (max)

End-to-End

Open-TQ-Metal is the only framework that runs Llama 70B at 128K on consumer hardware
Framework tok/s Max Context (64 GB) 128K?
Open-TQ-Metal 6.0 236K Yes
mlx-lm 7.3 73K No
llama.cpp ~5 ~50K No
Ollama ~5 ~40K No

Open-TQ-Metal trades 18% throughput for 3.2x context capacity - the only framework that runs Llama 70B at 128K on consumer hardware.

Total Memory at Each Context Length (Llama 3.1 70B)
80 GB 60 GB 40 GB 20 GB 64 GB 44.1 40.7 FP16 int4 16K 59.1 45.4 FP16 int4 64K 79.1 Δ 51.6 FP16 int4 128K OOM 62.5 int4 236K max Model weights (4-bit, 39.1 GB) FP16 KV cache Open-TQ-Metal cache

Memory breakdown at each context length. The dashed red line is the 64 GB limit. At 128K, FP16 KV needs 79.1 GB (overflows). Open-TQ-Metal fits in 51.6 GB.

What Didn't Work (and Why)

Approach Result Why
PolarQuant (Gemma 4) Gibberish attn_scale=1.0 amplifies angular error
QJL 1-bit (both) Repetition loops 0.8580 ≈ 0 compound noise
QJL + int4 hybrid 1.7% savings Not worth the quality risk
Speculative decode (E2B→31B) 12 tok/s (slower) 25% acceptance rate (need >60%)
int4 KV >950 tok (Gemma) Quality degrades Compound error at α=1.0
int8 KV >1K tok (Gemma) Same failure Same fundamental limit
2-bit weights 0.67x slower MLX 2-bit kernel too slow
FP16 intermediates Slower Cast overhead
async_eval pipelining Neutral Mutable KV cache prevents overlap

Paper Comparison

The original TurboQuant paper (ICLR 2026) proposes PolarQuant + QJL on CUDA for 8B models. Open-TQ-Metal ports it to Metal, scales to 70B, and discovers that int4 with fused attention outperforms the paper's pipeline on real hardware.

TurboQuant Paper
(ICLR 2026, A100, 8B)
Our Open-TQ-Metal Implementation
(M1 Max, 70B)
Compression Method QJL + PolarQuant
2.5-bit, near-lossless
int4 asymmetric
3.2x compression
Fused Kernel CUDA, standard sizes Metal, split-K
48x speedup at 128K
KV at 128K (70B) Not tested at 70B 12.5 GB
(fits 64GB with 11.5GB headroom)
QJL 1-bit keys Works on 8B (32 layers) Fails on 70B (80 layers)
error compounds
PolarQuant Works (4.2x compression) Works (cosine 0.989)
but slower than int4
Model Scale 7-8B parameter models 70B parameters
first Apple Silicon port
Speed Not reported for
70B on Apple Silicon
6.0 tok/s decode
(mlx-lm: 7.3 tok/s)
Quality Near-lossless
(0.997 recall)
Coherent text 200+ tokens
matches mlx-lm top-1

Run It Yourself

The inference engines are in separate repos, each with full build instructions, pretrained weight loaders, and streaming chat interfaces:

Repo Model Speed Context Hardware
gemma4metal Gemma 4 31B 10 tok/s (fused) / 59 tok/s (MoE) 950 tok M1/M2/M3/M4, 32 GB+
turboquant-llama3.170B Llama 3.1 70B 6 tok/s 128K tok M1 Max+, 64 GB

Both are C++ and Metal with Python only for tokenization. Clone, build MLX from source, cmake && make, and chat.

How We Discovered These Breakthroughs: The Ensue Agent Swarm

The 330-experiment sweep across Gemma 4 31B and Llama 3.1 70B was done with our agent swarm and coordinated over a shared memory layer, Ensue.

A swarm of orchestrator agents worked in parallel conducting sub agents on separate Macs. Each relied on custom retrieval systems sifting through memories under one data hierarchy. The memories were indexed through insights, hypotheses, task delegation, and reasoning. The aim was to asynchronously conduct an efficient sweep of the optimization frontier with diverse, creative experiments.

Our sub agents were categorized under hardware profiler, data diagnostician, model builder, experiment runner, plateau analyst, validator, and reflector. Each provided separate reasoning traces with complementary agentic loops that enhanced the capability of the overall swarm. The cross-experiment synthesis that connects findings across architectures, parameter sweeps, and differing philosophies culminates to an enriched collective intelligence.

Ensue memory graph visualization showing 697 memories across claims, insights, hypotheses, and results categories
The Ensue memory layer: 697 memories organized across claims, insights, hypotheses, and results

Optimizing ML Systems

The kind of work that produced Open-TQ-Metal kernel-level optimization, cross-architecture analysis, and agent-coordinated experiment sweeps is the kind of ML systems work we do for clients.

Check out our previous work on how we compressed months of ML work to 48 hours and fixed LLM inference slowdown and other case studies.

If you're hitting memory walls, scaling a paper's approach to your stack, or trying to ship inference optimizations, book a call or contact us.