Solving "Impossible" Hardware Constraints: A Mac Mini Can Now Run a 70B Model at Full 128K Context

Apr 22, 2026 | Christine Yip

Llama 3.1 70B at 128K context needs ~79GB of memory. A top spec Mac mini has 64GB.
This gap isn't an optimization problem, it's a hard limit. So in practice, you just... can't run it.

Until now.

We built Open-TQ-Metal, a fused attention kernel that operates directly on an int4-compressed KV cache.

It reduces the KV cache at 128K context from 40GB to 12.5GB, bringing total memory down to ~51.6GB, enough to fit.

And because the kernel is fused, it's also fast:
→ 48x faster attention at 128K
→ identical top-1 outputs under greedy decode

The key insights from the 330 experiments across two model families, the paper, and the code:

Memory Is the Bottleneck, Not Compute
Fusing Operations Beats Just Compressing Storage
Consumer Hardware Is Closer to Production Than People Think
Quantization Methods Are Not Architecture-Universal
Agents with Collective Intelligence Surfaced the Finding

Why This Is "Impossible"

At 128K context:

Weights (int4): 39.1GB
KV cache (FP16): ~40GB
Total: ~79GB

That's already beyond the physical memory of a 64GB Mac.

Every major inference stack hits this wall on this hardware:

mlx-lm → max context of ~73K tokens for Llama 3.1 70B

llama.cpp → max context of ~50K tokens for Llama 3.1 70B

Ollama → max context of ~40K tokens for Llama 3.1 70B

The model was designed for 128K context, but no 64GB Mac could actually run it locally at full capacity.

The bottleneck is memory.

Open-TQ-Metal made this possible for the first time. By compressing the KV cache from 40GB down to 12.5GB, the total drops to 51.6GB, fitting in 64GB with 12.4GB to spare.

Framework	Max Context (70B, 64GB)	tok/s	KV Format	128K?
Open-TQ-Metal	236K	6.0	int4	Yes
mlx-lm	73K	7.3	FP16	No
llama.cpp	~50K	~5	Mixed	No
Ollama	~40K	~5	GGUF	No

Table 1: Framework comparison for Llama 3.1 70B on Apple M1 Max 64GB (decode throughput, single-stream). Open-TQ-Metal is the only framework that reaches 128K context. 236K is the theoretical maximum.

Why It Matters

It unlocks the model's full capability on hardware people actually own (Mac with 64GB memory vs. datacenter-grade hardware). 128K tokens is the context window Llama 3.1 was designed for. It's roughly a 100-page document, or an entire codebase, or a very long conversation. Without it, you are running the model below its capability.

At 40K context, you're constantly losing earlier context. At 128K, the model can reason over everything at once. This is the difference between "I can use this model" and "I can use this model for real work."
It's the first time anyone did this on Apple Silicon. All the prior research on compressed-domain attention (TurboQuant, PolarQuant, QJL) targeted NVIDIA GPUs with CUDA. Apple's Metal compute architecture is fundamentally different:
- different memory hierarchies
- different dispatch semantics
- different programming model.
This is a reimplementation that had to solve new problems.
It does this with zero quality loss. The int4 compressed path produces the exact same top-1 token as the FP16 path under greedy decode.

Key Insights On How We Made This Breakthrough

1. Memory Is the Bottleneck, Not Compute

Most LLM optimization work focuses on compute: faster matrix multiplications, better attention kernels, more efficient scheduling. These matter. But at long context, the bottleneck shifts.

The KV cache grows linearly with context length. At 128K tokens on Llama 70B, it is larger than the model weights. This memory bottleneck is what makes 128K impossible on 64GB hardware.

Total Memory at Each Context Length (Llama 3.1 70B)

Fig. 1: KV cache memory at different context lengths for Llama 3.1 70B (39.1GB weights). FP16 KV cache exceeds 64GB at 128K. Int4 compression (3.2x) keeps the total under 64GB up to 236K tokens.

In our experiments, the model weights were already quantized to 4-bit, saving memory, and the KV cache was the remaining blocker. Compressing that also from FP16 to int4 (a 3.2x reduction in practice, accounting for per-group scale and zero-point metadata) brings the total at 128K down to 51.6GB, leaving 12.4GB of headroom.

The key: knowing where the bottleneck actually lives - compute or memory - is what turns optimization work into results.

2. Fusing Operations Beats Just Compressing Storage

Compressing the KV cache to int4 solves the memory problem. But if you still decompress back to FP16 before computing attention, you lose the gains. The decompression creates a temporary buffer that scales with context length, consuming memory bandwidth that the GPU needs for actual computation.

Open-TQ-Metal's fused kernel eliminates this entirely; it operates directly on compressed data. It reads packed int4 values directly in GPU registers, dequantizes on the fly using vectorized nibble extraction, and accumulates attention scores - never allocating a temporary memory buffer.

As a result, the speedup scales super-linearly with context length.

Takeaway: Compression saves memory; fusion saves time. Both matter, and they're separate problems.

Open-TQ-Metal speedup at various context lengths for Llama 3.1 70B

Figure 2: Fused kernel speedup vs context length for Llama 3.1 70B. The advantage grows super-linearly because the baseline scales O(S²) while the fused path scales O(S) with split-K parallelism.

3. Consumer Hardware Is Closer to Production Than People Think

A single Mac mini with 64GB memory - a machine that costs roughly $1.8K - now runs Llama 3.1 70B at its full 128K context window at 6.0 tok/s. That throughput is not interactive-chat fast, but it is usable for long-context workloads where the alternative is not running at all: document analysis, code review over large repositories, batched generation.

The theoretical maximum context on 64GB with int4 KV cache is 236K tokens - 1.8x the designed context window.

On-prem long-context inference on consumer-grade hardware is now a real deployment option for privacy-sensitive workloads.

KV Cache Memory Breakdown

Context	FP16 KV	int4 KV	Total (int4)	Fits 64 GB?
16K	5.0 GB	1.6 GB	40.7 GB	Yes
64K	20.0 GB	6.3 GB	45.4 GB	Yes
128K	40.0 GB	12.5 GB	51.6 GB	Yes
236K	–	23.4 GB	62.5 GB	Yes (max)

4. Quantization Methods Are Not Architecture-Universal

We implemented the most recent quantization research to solve the memory problem, notably Google Research's TurboQuant (ICLR 2026). But the quantization method that worked on Llama completely failed on Gemma. Same code, same config, but different output - something the paper didn't cover.

Why?

The difference is learned vs. fixed scaling. Llama uses fixed attention scaling (1/√d); Gemma 4 uses learned scaling, which is more sensitive to small quantization errors. And errors compound with depth: a small error at one layer becomes the next layer's input, amplified at every subsequent layer.

This isn't visible at smaller scales. The original TurboQuant paper's 8B results (32 layers) don't transfer to 70B (80 layers). Full math in the write-up and paper.

The implication: you can't pick a quantization method and apply it. The scale factor, layer depth, and quantization method interact in ways per-layer metrics don't capture. Architecture-specific testing is critical - something we can do at a fast pace with our agent swarms.

5. Agents With Collective Intelligence Surfaced the Finding

This discovery emerged from 330 experiments across models, coordinated by an agent swarm over a shared memory layer, Ensue.

The key moment:

An agent noticed plateauing results on one model

Cross-referenced with another

Surfaced a pattern humans would likely miss

That kind of cross-experiment synthesis is what agents with collective intelligence do well, and is what we run for clients who need not just benchmarks, but actual breakthrough findings.

Dive Deeper and Run It Yourself

Open-TQ-Metal write-up: https://ensue.dev/blog/introducing-open-tq-metal/

Open-TQ-Metal paper: https://arxiv.org/abs/2604.16957

Code (incl. inference engines, build instructions, pretrained weight loaders, kernels, and streaming chat interfaces):

HuggingFace (attention kernel): https://huggingface.co/EnsueAI/metal-int4-sdpa

Tackling Bottlenecks in ML Systems

This kind of work: kernel optimization, cross-architecture behavior, and large-scale experiment search, is becoming less about linear tuning and more about exploring the search space.

That's what produced Open-TQ-Metal, and other case studies, including how we compressed months of ML work to 48 hours and fixed LLM inference slowdown.

If you're hitting similar limits, scaling a paper's approach to your stack, or trying to ship inference optimizations, book a call or contact us.