檔案總覽
📌 Ollama now 共 3 條紀錄
主帖子 @harishgoswamicse
❤️ 73
Ollama now runs on Apple's MLX framework. On M5 chips, that means 1,851 tokens per second prefill and 134 tokens per second decode with int4.

For context, that is a local coding agent running faster than most cloud inference endpoints.

What changed under the hood:

- Cache now reuses across conversations, so Claude Code and similar tools get faster responses with less memory burn
- NVFP4 support brings production-grade model accuracy to local inference
回覆 @harishgoswamicse
❤️ 9
- Intelligent checkpoints cut prompt reprocessing on every branch

You need a Mac with 32GB+ unified memory. If you have that, a 35B parameter model now runs locally at speeds that make the cloud optional, not mandatory.

The line between local and cloud inference just got a lot thinner.
回覆 @_keen_
❤️ 0
Call me when it’s beyond alfa