檔案總覽
📌 You can no 共 12 條紀錄
主帖子 @artificialintelligenceeee
❤️ 703
You can now run 70B model on a single 4GB GPU and it even scales up to the colossal Llama 3.1 405B on just 8GB of VRAM. AirLLM uses “Layer-wise Inference.” Instead of loading the whole model, it loads, computes, and flushes one layer at a time. → No quantization needed by default → Supports Llama, Qwen, and Mistral → Works on Linux, Windows, and macOS 100% 100% Open Source.
回覆 @shakito999
❤️ 10
How many tokens per second?
回覆 @xseryii
❤️ 17
Finally someone invented this. Loading the whole thing into the VRAM was dumb af from the beggining. Hopefully in a near future we'll be able to run LLMs like Kimi 2.5 or Deep Seek R1 on normal hardware and only requiring storage space.
回覆 @barneyp6410
❤️ 4
https://medium.com/@zhamdi/running-airllm-locally-on-apple-silicon-not-so-good-2b48d41cdb7c
回覆 @ahmdrdhwan
❤️ 3
The reason we want intelligent models is to do complicated stuffs & save time... If it's slow it defeats the purpose.

But any innovation towards improvement is always interesting. The hardware manufacturers might even consider it to cater for certain innovations they see as feasible.
回覆 @kyori_dev
❤️ 2
Tps on the floor but still, if that means good quality responses maybe it’s a good trade
回覆 @mevmedia.io
❤️ 0
Is this real?
回覆 @lol_baj_rucoyx
❤️ 0
Quickquant then airllm op maybe now Ram prices go down maybe?
回覆 @mikan.adhi
❤️ 0
Isnt this hacked
回覆 @dantesp4rda7
❤️ 0
I believe you're talking about litellm, not this one
回覆 @iammarsyad
❤️ 0
the tps?
回覆 @papiarsy
❤️ 0
Thanks