Introducing Open-TQ-Metal: Open-Source Fused Attention for Long-Context LLM Inference

Apr 21, 2026 | Sai Vegasena

Introducing Open-TQ-Metal - an open-source, Metal-native implementation of fused compressed-domain attention, extending Google Research's TurboQuant approach (Zandieh et al., ICLR 2026) to Apple Silicon. It enables Llama 3.1 70B at 128K context on a single 64GB Mac, a configuration none of the major open-source inference frameworks (mlx-lm, llama.cpp, Ollama) can reach on this hardware.

Llama 3.1 70B at 128K context needs 79 GB of memory. Normally, a 64 GB Mac can't do it. Open-TQ-Metal writes fused Metal attention kernels that compress the KV cache from 40 GB to 12.5 GB. Now 128K fits - and runs at 6 tok/s.

This is the first implementation of fused compressed-domain attention on Apple Silicon, built across 330 experiments on two model families (Gemma 4 31B and Llama 3.1 70B).

Paper: arxiv.org/abs/2604.16957

Key Contributions

First Metal implementation of fused compressed-domain attention. Custom Metal compute shaders compute attention directly on int4-quantized KV cache data, with zero intermediate dequantization buffers. 48x speedup at 128K context over the dequantize-then-attend baseline.
Extended the TurboQuant approach from 8B to 70B. The original TurboQuant paper tested on 8B models with 32 layers. We ran 330 experiments across Gemma 4 31B (60 layers) and Llama 3.1 70B (80 layers), identifying which techniques transfer and which break at depth.
A cross-architecture quantization finding. A single architectural parameter (attn_scale) determines whether angular quantization methods like PolarQuant succeed or fail - a finding that was not in the original TurboQuant, PolarQuant, or QJL papers.
Fully open-sourced. Metal shader code, C++ inference engine, and all benchmarks are public.

Overview

Standard int4 inference dequantizes the full key matrix to FP32 before computing attention - wasting bandwidth and memory on temporary matrices that grow with context length:

Standard vs Fused Attention Pipeline

Standard:

dequantize(K_int4) → K_fp32 [40 GB temp] → Q @ K^T → softmax → @ V → output

Ours:

sdpa_int4(Q, K_int4, V_int4) → output [0 bytes temp, all in GPU registers]

The fused sdpa_int4 Metal kernel reads packed int4 keys and values directly from device memory, dequantizes per-element via bitwise operations in GPU registers, and computes attention with online softmax - no intermediate matrices ever materialize.

Split-K parallelism divides long KV sequences into 512-token chunks processed in parallel across Metal's GPU cores, then reduces partial softmax results via chained MLX Primitives. This solves the Metal dispatch race condition (multiple kernels within a single eval_gpu() execute without barriers) and achieves flat latency scaling from 1K to 128K tokens.

Key Discoveries

The attn_scale Finding

The most surprising result: the attention scale factor - not model size - determines whether angular quantization methods work.

PolarQuant encodes KV vectors as recursive polar coordinates. It works beautifully on Llama (attn_scale = 1/√d = 0.0884) but produces gibberish on Gemma 4 (attn_scale = 1.0). The reason: Gemma's undampened scale amplifies a 4% directional error 25-100x more per layer. Across 60 layers, this compounds through softmax into incoherent output.

The attn_scale parameter determines whether angular quantization methods succeed or fail
Method	Llama 70B (α = 0.0884)	Gemma 31B (α = 1.0)
PolarQuant 4-bit cosine sim	0.958	0.621
PolarQuant 4-bit output	Coherent	Gibberish
Int4 (group=32) cosine sim	0.998	0.992
Int4 (group=32) output	Identical	Excellent

Int4 per-group quantization works on both architectures because its per-element affine errors are uncorrelated and cancel in dot products - unlike PolarQuant's structured angular bias.

Wired Memory Is Mandatory

A 39 GB model on 64 GB hardware causes macOS to page weights to SSD. One line fixes a 12x slowdown: mx::set_wired_limit(max_working_set_size) → 0.6 → 6.0 tok/s.

Results

Kernel Speedup (Llama 3.1 70B)

Fused kernel speedup grows with context length, reaching 48x at 128K
KV Length	Fused Kernel	Baseline (deq + SDPA)	Speedup
1K	1.6 ms	4.6 ms	3x
16K	2.7 ms	54.6 ms	20x
64K	5.7 ms	225.3 ms	39x
128K	9.9 ms	480.6 ms	48x

Constant Throughput (Gemma 4 31B)

Decode Throughput vs Context Length (Gemma 4 31B)

The fused kernel maintains constant 10 tok/s regardless of context length. The baseline degrades from 10.8 to 7.2 tok/s as dequantization bandwidth increases.

Memory: 128K Context on 64 GB

Int4 KV cache compression enables 128K+ context on 64 GB hardware
Context	FP16 KV	int4 KV	Total (int4)	Fits 64 GB?
16K	5.0 GB	1.6 GB	40.7 GB	Yes
64K	20.0 GB	6.3 GB	45.4 GB	Yes
128K	40.0 GB	12.5 GB	51.6 GB	Yes
236K	–	23.4 GB	62.5 GB	Yes (max)

End-to-End

Open-TQ-Metal is the only framework that runs Llama 70B at 128K on consumer hardware
Framework	tok/s	Max Context (64 GB)	128K?
Open-TQ-Metal	6.0	236K	Yes
mlx-lm	7.3	73K	No
llama.cpp	~5	~50K	No
Ollama	~5	~40K	No

Open-TQ-Metal trades 18% throughput for 3.2x context capacity - the only framework that runs Llama 70B at 128K on consumer hardware.

Total Memory at Each Context Length (Llama 3.1 70B)

Memory breakdown at each context length. The dashed red line is the 64 GB limit. At 128K, FP16 KV needs 79.1 GB (overflows). Open-TQ-Metal fits in 51.6 GB.

What Didn't Work (and Why)

Approach	Result	Why
PolarQuant (Gemma 4)	Gibberish	attn_scale=1.0 amplifies angular error
QJL 1-bit (both)	Repetition loops	0.85⁸⁰ ≈ 0 compound noise
QJL + int4 hybrid	1.7% savings	Not worth the quality risk
Speculative decode (E2B→31B)	12 tok/s (slower)	25% acceptance rate (need >60%)
int4 KV >950 tok (Gemma)	Quality degrades	Compound error at α=1.0
int8 KV >1K tok (Gemma)	Same failure	Same fundamental limit
2-bit weights	0.67x slower	MLX 2-bit kernel too slow
FP16 intermediates	Slower	Cast overhead
async_eval pipelining	Neutral	Mutable KV cache prevents overlap

Paper Comparison

The original TurboQuant paper (ICLR 2026) proposes PolarQuant + QJL on CUDA for 8B models. Open-TQ-Metal ports it to Metal, scales to 70B, and discovers that int4 with fused attention outperforms the paper's pipeline on real hardware.

	TurboQuant Paper (ICLR 2026, A100, 8B)	Our Open-TQ-Metal Implementation (M1 Max, 70B)
Compression Method	QJL + PolarQuant 2.5-bit, near-lossless	int4 asymmetric 3.2x compression
Fused Kernel	CUDA, standard sizes	Metal, split-K 48x speedup at 128K
KV at 128K (70B)	Not tested at 70B	12.5 GB (fits 64GB with 11.5GB headroom)
QJL 1-bit keys	Works on 8B (32 layers)	Fails on 70B (80 layers) error compounds
PolarQuant	Works (4.2x compression)	Works (cosine 0.989) but slower than int4
Model Scale	7-8B parameter models	70B parameters first Apple Silicon port
Speed	Not reported for 70B on Apple Silicon	6.0 tok/s decode (mlx-lm: 7.3 tok/s)
Quality	Near-lossless (0.997 recall)	Coherent text 200+ tokens matches mlx-lm top-1

Run It Yourself

The inference engines are in separate repos, each with full build instructions, pretrained weight loaders, and streaming chat interfaces:

Repo	Model	Speed	Context	Hardware
gemma4metal	Gemma 4 31B	10 tok/s (fused) / 59 tok/s (MoE)	950 tok	M1/M2/M3/M4, 32 GB+
turboquant-llama3.170B	Llama 3.1 70B	6 tok/s	128K tok	M1 Max+, 64 GB

Both are C++ and Metal with Python only for tokenization. Clone, build MLX from source, cmake && make, and chat.

How We Discovered These Breakthroughs: The Ensue Agent Swarm

The 330-experiment sweep across Gemma 4 31B and Llama 3.1 70B was done with our agent swarm and coordinated over a shared memory layer, Ensue.

A swarm of orchestrator agents worked in parallel conducting sub agents on separate Macs. Each relied on custom retrieval systems sifting through memories under one data hierarchy. The memories were indexed through insights, hypotheses, task delegation, and reasoning. The aim was to asynchronously conduct an efficient sweep of the optimization frontier with diverse, creative experiments.

Our sub agents were categorized under hardware profiler, data diagnostician, model builder, experiment runner, plateau analyst, validator, and reflector. Each provided separate reasoning traces with complementary agentic loops that enhanced the capability of the overall swarm. The cross-experiment synthesis that connects findings across architectures, parameter sweeps, and differing philosophies culminates to an enriched collective intelligence.

Ensue memory graph visualization showing 697 memories across claims, insights, hypotheses, and results categories — The Ensue memory layer: 697 memories organized across claims, insights, hypotheses, and results

Optimizing ML Systems

The kind of work that produced Open-TQ-Metal kernel-level optimization, cross-architecture analysis, and agent-coordinated experiment sweeps is the kind of ML systems work we do for clients.

Check out our previous work on how we compressed months of ML work to 48 hours and fixed LLM inference slowdown and other case studies.

If you're hitting memory walls, scaling a paper's approach to your stack, or trying to ship inference optimizations, book a call or contact us.