Ollama now runs on Apple's MLX framework. On M5 chips, that means 1,851 tokens per second prefill and 134 tokens per second decode with int4.
For context, that is a local coding agent running faster than most cloud inference endpoints.
What changed under the hood:
- Cache now reuses across conversations, so Claude Code and similar tools get faster responses with less memory burn
- NVFP4 support brings production-grade model accuracy to local inference
For context, that is a local coding agent running faster than most cloud inference endpoints.
What changed under the hood:
- Cache now reuses across conversations, so Claude Code and similar tools get faster responses with less memory burn
- NVFP4 support brings production-grade model accuracy to local inference