Solving "Impossible" Hardware Constraints: A Mac Mini Can Now Run a 70B Model at Full 128K Context
Llama 3.1 70B at 128K context needs ~79GB of memory. A top spec Mac mini has 64GB.
This gap isn't an optimization problem, it's a hard limit. So in practice, you just... can't run it.
Until now.
We built Open-TQ-Metal, a fused attention kernel that operates directly on an int4-compressed KV cache.
It reduces the KV cache at 128K context from 40GB to 12.5GB, bringing total memory down to ~51.6GB, enough to fit.
And because the kernel is fused, it's also fast:
→ 48x faster attention at 128K
→ identical top-1 outputs under greedy decode
The key insights from the 330 experiments across two model families, the paper, and the code:
- Memory Is the Bottleneck, Not Compute
- Fusing Operations Beats Just Compressing Storage
- Consumer Hardware Is Closer to Production Than People Think
- Quantization Methods Are Not Architecture-Universal
- Agents with Collective Intelligence Surfaced the Finding
Why This Is "Impossible"
At 128K context:
- Weights (int4): 39.1GB
- KV cache (FP16): ~40GB
- Total: ~79GB
That's already beyond the physical memory of a 64GB Mac.
Every major inference stack hits this wall on this hardware:
mlx-lm → max context of ~73K tokens for Llama 3.1 70B
llama.cpp → max context of ~50K tokens for Llama 3.1 70B
Ollama → max context of ~40K tokens for Llama 3.1 70B
The model was designed for 128K context, but no 64GB Mac could actually run it locally at full capacity.
The bottleneck is memory.
Open-TQ-Metal made this possible for the first time. By compressing the KV cache from 40GB down to 12.5GB, the total drops to 51.6GB, fitting in 64GB with 12.4GB to spare.
| Framework | Max Context (70B, 64GB) | tok/s | KV Format | 128K? |
|---|---|---|---|---|
| Open-TQ-Metal | 236K | 6.0 | int4 | Yes |
| mlx-lm | 73K | 7.3 | FP16 | No |
| llama.cpp | ~50K | ~5 | Mixed | No |
| Ollama | ~40K | ~5 | GGUF | No |
Why It Matters
-
It unlocks the model's full capability on hardware people actually own (Mac with 64GB memory vs. datacenter-grade hardware). 128K tokens is the context window Llama 3.1 was designed for. It's roughly a 100-page document, or an entire codebase, or a very long conversation. Without it, you are running the model below its capability.
At 40K context, you're constantly losing earlier context. At 128K, the model can reason over everything at once. This is the difference between "I can use this model" and "I can use this model for real work."
-
It's the first time anyone did this on Apple Silicon. All the prior research on compressed-domain attention (TurboQuant, PolarQuant, QJL) targeted NVIDIA GPUs with CUDA. Apple's Metal compute architecture is fundamentally different:
- different memory hierarchies
- different dispatch semantics
- different programming model.
This is a reimplementation that had to solve new problems.
-
It does this with zero quality loss. The int4 compressed path produces the exact same top-1 token as the FP16 path under greedy decode.
Key Insights On How We Made This Breakthrough
1. Memory Is the Bottleneck, Not Compute
Most LLM optimization work focuses on compute: faster matrix multiplications, better attention kernels, more efficient scheduling. These matter. But at long context, the bottleneck shifts.
The KV cache grows linearly with context length. At 128K tokens on Llama 70B, it is larger than the model weights. This memory bottleneck is what makes 128K impossible on 64GB hardware.
In our experiments, the model weights were already quantized to 4-bit, saving memory, and the KV cache was the remaining blocker. Compressing that also from FP16 to int4 (a 3.2x reduction in practice, accounting for per-group scale and zero-point metadata) brings the total at 128K down to 51.6GB, leaving 12.4GB of headroom.
The key: knowing where the bottleneck actually lives - compute or memory - is what turns optimization work into results.
2. Fusing Operations Beats Just Compressing Storage
Compressing the KV cache to int4 solves the memory problem. But if you still decompress back to FP16 before computing attention, you lose the gains. The decompression creates a temporary buffer that scales with context length, consuming memory bandwidth that the GPU needs for actual computation.
Open-TQ-Metal's fused kernel eliminates this entirely; it operates directly on compressed data. It reads packed int4 values directly in GPU registers, dequantizes on the fly using vectorized nibble extraction, and accumulates attention scores - never allocating a temporary memory buffer.
As a result, the speedup scales super-linearly with context length.
Takeaway: Compression saves memory; fusion saves time. Both matter, and they're separate problems.
3. Consumer Hardware Is Closer to Production Than People Think
A single Mac mini with 64GB memory - a machine that costs roughly $1.8K - now runs Llama 3.1 70B at its full 128K context window at 6.0 tok/s. That throughput is not interactive-chat fast, but it is usable for long-context workloads where the alternative is not running at all: document analysis, code review over large repositories, batched generation.
The theoretical maximum context on 64GB with int4 KV cache is 236K tokens - 1.8x the designed context window.
On-prem long-context inference on consumer-grade hardware is now a real deployment option for privacy-sensitive workloads.
| Context | FP16 KV | int4 KV | Total (int4) | Fits 64 GB? |
|---|---|---|---|---|
| 16K | 5.0 GB | 1.6 GB | 40.7 GB | Yes |
| 64K | 20.0 GB | 6.3 GB | 45.4 GB | Yes |
| 128K | 40.0 GB | 12.5 GB | 51.6 GB | Yes |
| 236K | – | 23.4 GB | 62.5 GB | Yes (max) |
4. Quantization Methods Are Not Architecture-Universal
We implemented the most recent quantization research to solve the memory problem, notably Google Research's TurboQuant (ICLR 2026). But the quantization method that worked on Llama completely failed on Gemma. Same code, same config, but different output - something the paper didn't cover.
Why?
The difference is learned vs. fixed scaling. Llama uses fixed attention scaling (1/√d); Gemma 4 uses learned scaling, which is more sensitive to small quantization errors. And errors compound with depth: a small error at one layer becomes the next layer's input, amplified at every subsequent layer.
This isn't visible at smaller scales. The original TurboQuant paper's 8B results (32 layers) don't transfer to 70B (80 layers). Full math in the write-up and paper.
The implication: you can't pick a quantization method and apply it. The scale factor, layer depth, and quantization method interact in ways per-layer metrics don't capture. Architecture-specific testing is critical - something we can do at a fast pace with our agent swarms.
5. Agents With Collective Intelligence Surfaced the Finding
This discovery emerged from 330 experiments across models, coordinated by an agent swarm over a shared memory layer, Ensue.
The key moment:
An agent noticed plateauing results on one model
Cross-referenced with another
Surfaced a pattern humans would likely miss
That kind of cross-experiment synthesis is what agents with collective intelligence do well, and is what we run for clients who need not just benchmarks, but actual breakthrough findings.
Dive Deeper and Run It Yourself
Open-TQ-Metal write-up: https://ensue.dev/blog/introducing-open-tq-metal/
Open-TQ-Metal paper: https://arxiv.org/abs/2604.16957
Code (incl. inference engines, build instructions, pretrained weight loaders, kernels, and streaming chat interfaces):
- https://github.com/mutable-state-inc/turboquant-llama3.170B
- https://github.com/mutable-state-inc/gemma4m
HuggingFace (attention kernel): https://huggingface.co/EnsueAI/metal-int4-sdpa
Tackling Bottlenecks in ML Systems
This kind of work: kernel optimization, cross-architecture behavior, and large-scale experiment search, is becoming less about linear tuning and more about exploring the search space.
That's what produced Open-TQ-Metal, and other case studies, including how we compressed months of ML work to 48 hours and fixed LLM inference slowdown.
If you're hitting similar limits, scaling a paper's approach to your stack, or trying to ship inference optimizations, book a call or contact us.