Months of ML Work Compressed to 48 Hours: ICLR Paper Implemented and Inference Slowdown Fixed
· Christine Yip
Our agent swarm ran 177 experiments in 48 hours, spanning 14 different optimization approaches. We implemented the TurboQuant paper (ICLR 2026) on Apple Silicon, a first-ever Metal implementation. From that, we built a custom GPU kernel that delivers 37% faster attention, memory savings, and, crucially, speed that stays constant as conversations grow.
LLMs Slow Down the Longer You Talk to Them
Every time a language model generates a word, it looks back at the entire conversation so far. It stores that conversation in a KV cache - a running memory that grows with every token. The longer the conversation, the more data has to be read, and the inference (decode) speed decreases.
When we ran Gemma 4 31B, Google's latest 31-billion parameter open-source model, with an int4 KV cache (necessary to fit within consumer hardware memory limits), we found that the decode speed drops from 10.8 tokens/s at the start of a conversation to 7.2 tokens/s a few hundred tokens in. A 33% degradation, and it keeps getting worse as the conversation grows.
The Results
Speed That Doesn't Degrade
Within a weekend, our agent swarm developed a fused int4 attention (SDPA) kernel - an open-source implementation of fused compressed-domain attention on Apple Silicon's Metal, inspired by TurboQuant (ICLR 2026).
Code: github.com/svv232/gemma4metal
While in the baseline approach the model gets noticeably sluggish the longer you talk to it, the line with the fused int4 SDPA kernel stays flat: 10.4 → 9.8 tok/s. In this case, the speed stays relatively constant regardless of conversation length.
Figure 1: With the fused attention kernel, the inference speed no longer dramatically decreases as conversations grow
The fused int4 attention kernel enables bigger models to be run locally at reasonable speeds. Larger context windows can be utilized since the decode speed is staying relatively stable as conversations go on.
| Context Length | Standard Path | Open-TQ-Metal | Speedup |
|---|---|---|---|
| 33 tokens | 9.6 tok/s | 10.4 tok/s | +8% |
| 423 tokens | 7.4 tok/s | 10.0 tok/s | +34% |
| 786 tokens | 7.3 tok/s | 10.0 tok/s | +37% |
| 950 tokens | 7.2 tok/s | 9.8 tok/s | +35% |
The Longer the Conversation, the Larger the Advantage of the Fused Kernel
Every time the model generates a token with the standard approach, it decompresses the entire KV cache from int4 to float32, creating a temporary copy roughly 8x larger than the compressed original.
Figure 2: Peak memory saved vs. standard dequantize-attend baseline
On Gemma 4 31B with 50 sliding-attention layers at context length of 100 tokens, that temporary allocation is 82 MB (small). At 950 tokens, it hits nearly 780 MB (significant). At longer contexts, it would keep growing.
This isn't stored memory, it's transient: created and destroyed for every single token. But it has real consequences:
- It eats into available memory - On a 64 GB Mac running a 17.4 GB model, 780 MB of temporary overhead per token means less room for longer conversations, other applications, or running larger models. Eliminating it gives you that headroom back.
- It consumes memory bandwidth - The temporary buffer has to be written to memory and read back: 780 MB of data movement that the GPU has to wait for. On Apple Silicon, where CPU and GPU share the same memory pool, this bandwidth consumption competes with everything else on the system.
- It's the reason speed degrades with context - The temporary buffer grows linearly with conversation length. At 100 tokens it's small. At 950 tokens it's 780 MB. This scaling is what makes the standard path get slower as conversations grow.
The fused int4 attention kernel that the agents developed eliminates all of it. It reads compressed data directly in GPU registers, the fastest storage on the chip, with effectively zero access latency. No temporary buffer created. No memory bandwidth wasted on decompression. No scaling with context length.
Since temporary memory grows as the context length increases, the fused int4 kernel advantage gets bigger the longer the conversation.
What We Did
We pointed an autonomous agent swarm at this problem: make Gemma 4 31B run on Apple Silicon, and by increasing memory efficiency, keep it fast as conversations grow.
A mission briefing defined the research question, the metric to optimize, and what agents could and couldn't touch.
System Architecture
QJL kernels: tried, worse
Search SpaceBenchmarks & tests - run after every experiment
FixedModel files on disk - weights, tokenizer from Hugging Face
MLX framework - external dependency (agents read its source)
Figure 3: Components within the agents' search space for experimentation and those that are fixed
Agents autonomously claimed experiments, ran them, published results with structured metrics, and posted insights to shared memory on Ensue. One agent's findings informed the next agent's hypothesis.
The swarm ran continuously for 48 hours, with periodic human review of findings and pivots when the evidence warranted it.
| Approach | Status | Key Result |
|---|---|---|
| Fused sdpa_int4 kernel | Shipped | 35% attention speedup, constant throughput |
| Split-K parallelism | Shipped | Flat latency scaling |
| Hybrid prefill/decode | Shipped | 66x faster prefill |
| Wired memory pinning | Shipped | 10x throughput recovery |
| int4 KV cache | Shipped | 6.4x compression |
| PolarQuant | Failed | Gibberish - attn_scale=1.0 amplifies angular error |
| QJL 1-bit correction | Failed | +6.5% perplexity, variance compounds through softmax |
| QJL + int4 hybrid | Failed | 1.7% savings, quality risk |
| Speculative decode | Failed | 25% acceptance rate (need 60%+) |
| int4 KV > 950 tokens | Failed | Compound error across 60 layers |
| int8 KV > 1K tokens | Failed | Same fundamental limit |
| 2-bit weights | Failed | 0.67x - kernel too slow |
| FP16 intermediates | Failed | Cast overhead ate the savings |
| Async eval pipelining | Failed | Mutable KV cache prevents overlap |
Phase 1: Implementing the State of the Art - Weeks of Work in Under 4 Hours
First direction: implement TurboQuant (ICLR 2026), a KV cache compression technique combining PolarQuant (AISTATS 2026) and QJL (AAAI 2025). Within 3.5 hours, the agents had built the first-ever Metal implementation: 10 GPU compute shaders ported from NVIDIA's CUDA, iterated through 5 generations, 65 experiments completed.
An ML engineer would traditionally spend weeks on the literature review, understanding the math, and implementing these algorithms on unfamiliar hardware. The agent swarm did it in an afternoon.
Then they ran it on Gemma 4 31B. It produced gibberish.
The Scientific Discovery: Why It Fails
This finding isn't in the TurboQuant paper, which was only tested on 8B models with 32 layers, and where fixed attention scaling hides the error.
Our agents tested it on Gemma 4 31B, 60 layers with learned normalization (QK-norm), where the model learns its own scaling rather than using a fixed 1/sqrt(d) factor. Without that fixed dampening, PolarQuant's angular error passes through softmax at full strength. Softmax amplifies it exponentially, and across 60 layers, the output degrades to noise.
QJL's correction compounds errors across 60 layers rather than fixing them.
The takeaway: when a model uses learned scaling, angular quantization methods break. More details can be found on our github.
Phase 2: Fused Attention Kernel That Achieved 35% Speedup
The agents pivoted. Profiling showed 77% of decode time was weight multiplications (already optimized by Apple - no room), while 16% was attention, dominated by decompressing the KV cache to full precision, using it once, and discarding it.
What if you never decompress? The agents read MLX's open-source code, found its qdot pattern for arithmetic on compressed data, and adapted it into a fused attention kernel. Zero temporary memory. Zero decompression overhead.
Over 91 experiments, the agents iterated from an initial kernel that was 1.6x slower than baseline (experiment 99) to the final vectorized version that achieved 35% speedup with constant throughput (experiment 165).
| Experiment | Result | Status |
|---|---|---|
| 1-65 | TurboQuant Metal shaders - 5 generations (PolarQuant + QJL, 10 compute shaders, 3.5 hrs) |
Built |
| 66-86 | TurboQuant on Gemma 4 - gibberish Discovery: attn_scale=1.0 breaks angular quant QJL correction: +6.5% perplexity |
Failed |
| --- PIVOT: stop compressing, start fusing --- | ||
| 87-95 | Profile pipeline, identify attention bottleneck | Found target |
| 96 | quantized_matmul for attention scoring | 2.2x at 1K |
| 99 | First fused sdpa_int4 kernel | 1.6x slower |
| 163 | Correct kernel, SIMD-parallel | 1.6x slower |
| 165 | Vectorized K+V, final kernel | 35% faster |
| 166-177 | Speculative decode, 2-bit weights, async... | All failed |
What 48 Hours Produced
- 177 experiments across two research phases
- A scientific finding: the attention scale factor determines whether angular quantization methods succeed or fail. This was not in the TurboQuant paper, and it's immediately useful to any team choosing a KV cache compression method
- The first Metal implementation of PolarQuant and QJL: 10 compute shaders ported from CUDA to Apple Silicon (open-sourced, applicable to other model architectures)
- A custom fused int4 attention kernel that eliminates speed degradation as context grows: 37% faster attention, 780 MB memory savings, constant throughput
What Agent Swarms Can Do for You
This is one example of what our autonomous research swarms produce.
The pattern applies broadly: you bring an ML systems problem, e.g. inference too slow, model doesn't fit on your hardware, deployment costs too high, quantization breaking quality.
We point the swarm at it.
Within hours, agents are running experiments. Within days, you have results, a performance report, and production-ready code. No months-long hiring process. No waiting for a specialized engineer to ramp up on your codebase.
We typically do one onboarding call and set up the swarm after that. The swarm starts finding improvements immediately and compounds its own knowledge as it goes, every experiment informs the next.
Get started here, or book a demo call. We'll show you how it works and scope what the swarm can do for your specific setup. Getting started is fast and requires minimal support on your end.