Months of ML Work Compressed to 48 Hours: ICLR Paper Implemented and Inference Slowdown Fixed

Apr 14, 2026 | Christine Yip

Our agent swarm ran 177 experiments in 48 hours, spanning 14 different optimization approaches. We implemented the TurboQuant paper (ICLR 2026) on Apple Silicon, a first-ever Metal implementation, and built a custom GPU kernel that delivers 37% faster attention, memory savings, and, crucially, speed that stays constant as conversations grow.

LLMs Slow Down the Longer You Talk to Them

Every time a language model generates a word, it looks back at the entire conversation so far. It stores that conversation in a KV cache - a running memory that grows with every token. The longer the conversation, the more data has to be read, and the inference (decode) speed decreases.

When we ran Gemma 4 31B, Google's latest 31-billion parameter open-source model, with an int4 KV cache (necessary to fit within consumer hardware memory limits), we found that the decode speed drops from 10.8 tokens/s at the start of a conversation to 7.2 tokens/s a few hundred tokens in. A 33% degradation, and it keeps getting worse as the conversation grows.

The Results

Speed That Doesn't Degrade

Within a weekend, our agent swarm developed a fused int4 attention (SDPA) kernel - an open-source implementation of fused compressed-domain attention on Apple Silicon's Metal, inspired by TurboQuant (ICLR 2026).

Code: github.com/svv232/gemma4metal

While in the baseline approach the model gets noticeably sluggish the longer you talk to it, the line with the fused int4 SDPA kernel stays flat: 10.4 → 9.8 tok/s. In this case, the speed stays relatively constant regardless of conversation length.

Fused int4 SDPA: Constant Throughput Regardless of Context

Figure 1: With the fused attention kernel, the inference speed no longer dramatically decreases as conversations grow

Why It Matters

The fused int4 attention kernel enables bigger models to be run locally at reasonable speeds. Larger context windows can be utilized since the decode speed is staying relatively stable as conversations go on.

But in real usage, context grows, and how latency scales with it matters just as much:

longer conversations remain usable
latency is more predictable over time
memory bandwidth pressure is reduced

The longer the conversation → the bigger the win.

Table 1: Decode speed decreases as the context length grows
Context Length	Standard Path	Open-TQ-Metal	Speedup
33 tokens	9.6 tok/s	10.4 tok/s	+8%
423 tokens	7.4 tok/s	10.0 tok/s	+34%
786 tokens	7.3 tok/s	10.0 tok/s	+37%
950 tokens	7.2 tok/s	9.8 tok/s	+35%

The Longer the Conversation, the Larger the Advantage of the Fused Kernel

Most inference work optimizes for throughput at a fixed context length.

On Gemma 4 31B with 50 sliding-attention layers at context length of 100 tokens, that temporary allocation is 82 MB (small). At 950 tokens, it hits nearly 780 MB (significant). At longer contexts, it would keep growing.

Peak Memory Saved by Fused Kernel

(eliminates dequantized K temporaries entirely)

Memory Saved (MB)

800 600 400 200 0

100 tok

164

200 tok

328

400 tok

492

600 tok

655

800 tok

778

950 tok

Figure 2: Peak memory saved vs. standard dequantize-attend baseline

This isn't stored memory, it's transient: created and destroyed for every single token. But it has real consequences:

It eats into available memory - On a 64 GB Mac running a 17.4 GB model, 780 MB of temporary overhead per token means less room for longer conversations, other applications, or running larger models. Eliminating it gives you that headroom back.
It consumes memory bandwidth - The temporary buffer has to be written to memory and read back: 780 MB of data movement that the GPU has to wait for. On Apple Silicon, where CPU and GPU share the same memory pool, this bandwidth consumption competes with everything else on the system.
It's the reason speed degrades with context - The temporary buffer grows linearly with conversation length. At 100 tokens it's small. At 950 tokens it's 780 MB. This scaling is what makes the standard path get slower as conversations grow.

The fused int4 attention kernel that the agents developed eliminates all of it. It reads compressed data directly in GPU registers, the fastest storage on the chip, with effectively zero access latency. No temporary buffer created. No memory bandwidth wasted on decompression. No scaling with context length.

Since temporary memory grows as the context length increases, the fused int4 kernel advantage gets bigger the longer the conversation.

What We Did

We pointed an autonomous agent swarm at this problem: make Gemma 4 31B run on Apple Silicon, and by increasing memory efficiency, keep it fast as conversations grow.

A mission briefing defined the research question, the metric to optimize, and what agents could and couldn't touch.

System Architecture

Apple Silicon Mac - Agent Search Space

CPU

Python Wrappers

Model loading, chat, tokenization

Search Space

C++ Inference Engine

60-layer forward pass, KV cache management

Search Space

C++ Library

CPU ↔ GPU glue, kernel dispatch, MLX integration

Search Space

GPU

Metal Compute Shaders

sdpa_int4: fused attention

The Deliverable

PolarQuant + QJL Kernels

PolarQuant: tried, failed
QJL kernels: tried, worse

Tried & Rejected

MLX Built-in Kernels

Weight matmuls (77% of time). Already optimized by Apple.

Fixed

Unified Memory

Model Weights

17.4 GB, pre-quantized

Fixed

KV Cache

Agents optimized how the GPU reads this

Search Space

Temp Buffers

780 MB standard → 0 bytes with fused kernel

Search Space

Search Space
Benchmarks & tests - run after every experiment

Fixed
Model files on disk - weights, tokenizer from Hugging Face
MLX framework - external dependency (agents read its source)

Figure 3: Components within the agents' search space for experimentation and those that are fixed

Agents autonomously claimed experiments, ran them, published results with structured metrics, and posted insights to shared memory on Ensue. One agent's findings informed the next agent's hypothesis.

The swarm ran continuously for 48 hours, with periodic human review of findings and pivots when the evidence warranted it, turning interation into a compounding autoresearch loop.

All 14 Approaches Explored

Table 2: Experiments and approaches that the agents explored and documented within 48 hours
Approach	Status	Key Result
Fused sdpa_int4 kernel	Shipped	35% attention speedup, constant throughput
Split-K parallelism	Shipped	Flat latency scaling
Hybrid prefill/decode	Shipped	66x faster prefill
Wired memory pinning	Shipped	10x throughput recovery
int4 KV cache	Shipped	6.4x compression
PolarQuant	Failed	Gibberish - attn_scale=1.0 amplifies angular error
QJL 1-bit correction	Failed	+6.5% perplexity, variance compounds through softmax
QJL + int4 hybrid	Failed	1.7% savings, quality risk
Speculative decode	Failed	25% acceptance rate (need 60%+)
int4 KV > 950 tokens	Failed	Compound error across 60 layers
int8 KV > 1K tokens	Failed	Same fundamental limit
2-bit weights	Failed	0.67x - kernel too slow
FP16 intermediates	Failed	Cast overhead ate the savings
Async eval pipelining	Failed	Mutable KV cache prevents overlap

Phase 1: Implementing the State of the Art - Weeks of Work in Under 4 Hours

First direction: implement TurboQuant (ICLR 2026), a KV cache compression technique combining PolarQuant (AISTATS 2026) and QJL (AAAI 2025). Within 3.5 hours, the agents had built the first-ever Metal implementation: 10 GPU compute shaders ported from NVIDIA's CUDA, iterated through 5 generations, 65 experiments completed.

An ML engineer would traditionally spend weeks on the literature review, understanding the math, and implementing these algorithms on unfamiliar hardware. The agent swarm did it in an afternoon.

Then they ran it on Gemma 4 31B. It produced gibberish.

The Scientific Discovery: Why It Fails

This finding isn't in the TurboQuant paper, which was only tested on 8B models with 32 layers, and where fixed attention scaling hides the error.

Our agents tested it on Gemma 4 31B, 60 layers with learned normalization (QK-norm), where the model learns its own scaling rather than using a fixed 1/sqrt(d) factor. Without that fixed dampening, PolarQuant's angular error passes through softmax at full strength. Softmax amplifies it exponentially, and across 60 layers, the output degrades to noise.

QJL's correction compounds errors across 60 layers rather than fixing them.

The implication:

If a model uses learned attention scaling, angular quantization methods can fail in ways that aren’t visible in smaller benchmarks.

Quantization performance is not just about bitwidth or method, it also depends on how attention is parameterized. More details can be found on our github.

Phase 2: Fused Attention Kernel That Achieved 35% Speedup

The agents pivoted. Profiling showed 77% of decode time was weight multiplications (already optimized by Apple - no room), while 16% was attention, dominated by decompressing the KV cache to full precision, using it once, and discarding it.

What if you never decompress? The agents read MLX's open-source code, found its qdot pattern for arithmetic on compressed data, and adapted it into a fused attention kernel. Zero temporary memory. Zero decompression overhead.

Standard Approach

Compressed KV

→

Decompress (MBs temp)

→

Compute

→

Discard temp

Fused int4 Attention Kernel

Compressed KV

→

Compute directly (0 bytes temp)

→

Done

Over 91 experiments, the agents iterated from an initial kernel that was 1.6x slower than baseline (experiment 99) to the final vectorized version that achieved 35% speedup with constant throughput (experiment 165).

Experiment Log

Table 3: 177 experiments across two research phases
Experiment	Result	Status
1-65	TurboQuant Metal shaders - 5 generations (PolarQuant + QJL, 10 compute shaders, 3.5 hrs)	Built
66-86	TurboQuant on Gemma 4 - gibberish Discovery: attn_scale=1.0 breaks angular quant QJL correction: +6.5% perplexity	Failed
--- PIVOT: stop compressing, start fusing ---
87-95	Profile pipeline, identify attention bottleneck	Found target
96	quantized_matmul for attention scoring	2.2x at 1K
99	First fused sdpa_int4 kernel	1.6x slower
163	Correct kernel, SIMD-parallel	1.6x slower
165	Vectorized K+V, final kernel	35% faster
166-177	Speculative decode, 2-bit weights, async...	All failed

What 48 Hours Produced

177 experiments across two research phases
A scientific finding: the attention scale factor determines whether angular quantization methods succeed or fail. This was not in the TurboQuant paper, and it's immediately useful to any team choosing a KV cache compression method
The first Metal implementation of PolarQuant and QJL: 10 compute shaders ported from CUDA to Apple Silicon (open-sourced, applicable to other model architectures)
A custom fused int4 attention kernel that eliminates speed degradation as context grows: 37% faster attention, 780 MB memory savings, constant throughput

The Bigger Point

This wasn’t just optimization, but a different way of doing ML systems work:

Because experiments compound → discovery accelerates
Because agents share state → search becomes structured
Because iteration is parallel → bottlenecks get exposed faster

The sequential process is turned into a search process. The advantage isn’t just better solutions, it’s also finding them faster.

What Agent Swarms Can Do for You

This is one example of what our autonomous research swarms produce.

The pattern applies broadly. If you're dealing with:

Slow inference
High cost
Models accuracy
Models that don't fit
Or other ML optimization problems

Point a swarm at it.

Within hours, agents are running experiments. Within days, you have results, a performance report, and production-ready code.

We typically do one onboarding call and set up the swarm after that. The swarm starts finding improvements immediately and compounds its own knowledge as it goes, every experiment informs the next.

Get started here, or book a demo call. We'll show you how it works and scope what the swarm can do for your specific setup. Getting started is fast and requires minimal support on your end.