Hacker News
Ask HN: Anyone Using a Mac Studio for Local AI/LLM?
josefcub
|next
[-]
I've only had it a couple of months, but so far it's proving its worth in the quality of LLM output, even quantized.
I generally run Qwen3-vl at 235b, at a Q4_K_M quantization level so that it fits, and it leaves me plenty of RAM for workstation tasks while delivering tokens at around 30tok/s
The smaller Qwen3 models (like qwen3-coder) I use in tandem, of course they run much faster and I tend to run them at higher quants up to Q8 for quality purposes.
The gigantic RAM's biggest boon, I've found, is letting me run the models with full context allocated, which lets me hand them larger and more complicated things than I could before. This alone makes the money I spent worth it, IMO.
I did manage to get glm-4.7 (a 358b model) running at a Q3 quantization level; it's delivery is adequate quality-wise, although it delivers at 15tok/s, though I did have to cut down to only 128k context to leave me enough room for the desktop.
If you get something this big, it's a powerhouse, but not nearly as much of a powerhouse as a dedicated nVidia GPU rig. The point is to be able to run them _adequately_, not at production speeds, to get your work done. I found price/performance/energy usage to be compelling at this level and I am very satisfied.
ryan-c
|next
|previous
[-]
I'm still trying to figure out a good solution for fast external storage, I only went for 1TB internal which doesn't go very far with models that have hundreds of billions of parameters.
ProllyInfamous
|root
|parent
|next
[-]
Acasis makes 40gbps external nVME cases. Mine feels quick (for non-LLM tasks).
I also use 10gbps Terramaster 4-bay RAIDs (how I finally retired my Pro5,1).
>energy usage
This thing uses an order of magnitude -less- energy than the computer it replaced, and is faster in almost every aspect.
ryan-c
|root
|parent
[-]
ProllyInfamous
|root
|parent
[-]
Obviously increases your failure rate, but if you're constantly updating the same models (and not creating your own) you don't really need redundancy.
StevenNunez
|next
|previous
[-]
circularfoyers
|root
|parent
|next
[-]
pcf
|root
|parent
|next
|previous
[-]
Can you share more info about quants or whatever is relevant? That's super interesting, since it's such a capable model.
satvikpendem
|next
|previous
[-]
[0] https://old.reddit.com/r/LocalLLaMA/search?q=mac+studio&rest...
pcf
|next
|previous
[-]
I'm using LM Studio now for ease of use and simple logging/viewing of previous conversations. Later I'm gonna use my own custom local LLM system on the Mac Studio, probably orchestrated by LangChain and running models with llama.cpp.
My goal has all the time been to use them in ensembles in order to reduce model biases. The same principle has just now been introduced as a feature called "model council" in Perplexity Max: https://www.perplexity.ai/hub/blog/introducing-model-council
Chats will be stored in and recalled from a PostgreSQL database with extensions for vectors (pgvector) and graph (Apache AGE).
For both sets of tests below, MLX was used when available, but ultimately ran at almost the same speed as GGUF.
I hope this information helps someone!
/////////
Mac Studio M3 Ultra (default w/96 GB RAM, 1 TB SSD, 28C CPU, 60C GPU):
• Gemma 3 27B (Q4_K_M): ~30 tok/s, TTFT ~0.52 s
• GPT-OSS 20B: ~150 tok/s
• GPT-OSS 120B: ~23 tok/s, TTFT ~2.3 s
• Qwen3 14B (Q6_K): ~47 tok/s, TTFT ~0.35 s
(GPT-OSS quants and 20B TTFT info not available anymore)
//////////
MacBook Pro M1 Max 16.2" (64 GB RAM, 2 TB SSD, 10C CPU, 32C GPU):
• Gemma 3 1B (Q4_K): ~85.7 tok/s, TTFT ~0.39 s
• Gemma 3 27B (Q8_0): ~7.5 tok/s, TTFT ~3.11 s
• GPT-OSS 20B (8bit): ~38.4 tok/s, TTFT ~21.15 s
• LFM2 1.2B: ~119.9 tok/s, TTFT ~0.57 s
• LFM2 2.6B (Q6_K): ~69.3 tok/s, TTFT ~0.14 s
• Olmo 3 32B Think: ~11.0 tok/s, TTFT ~22.12 s
TomMasz
|next
|previous
[-]
hnfong
|next
|previous
[-]
There are benchmarks on token generation speed out there for some of the large models. You can probably guess the speed for models you're interested in by comparing the sizes (mostly look at the active params).
Currently the main issue for M1-M4 is the prompt "preprocessing" speed. In practical terms, if you have a very long prompt, it's going to take a much long time to process it. IIRC it's due to lack of efficient matrix multiplication operations in the hardware, which I hear is rectified in the M5 architecture. So if you need to process long prompts, don't count on the Mac Studio, at least not with large models.
So in short, if your prompts are relatively short (eg. a couple thousand tokens at most), you need/want a large model, you don't need too much scale/speed, and you need to run inference locally, then Macs are a reasonable option.
For me personally, I got my M3 Ultra somewhat due to geopolitical issues. I'm barred from accessing some of the SOTA models from the US due to where I live, and sometimes the Chinese models are not conveniently accessible either. With the hardware, they can pry DeepSeek R1, Kimi-K2, etc. from my cold dead hands lol.
giancarlostoro
|next
|previous
[-]
timothyduong
|next
|previous
[-]
Edit: of course the software would need to leverage the neural accelerators which is another variable if the software supports it in the first place
runjake
|next
|previous
[-]
Ultra processors are priced high enough, I'd be asking myself if I'm serious about local LLM work and do a cost analysis.
rlupi
|next
|previous
[-]
caterama
|next
|previous
[-]
stoneforger
|next
|previous
[-]
manarth
|next
|previous
[-]
b_brief
|next
|previous
[-]
speedgoose
|next
|previous
[-]
kingkongjaffa
|next
|previous
[-]
So not only are they content charging +$400 or +$600 for RAM which in itself ludicrously overpriced, they force you to upgrade +$1000-2000 on the top CPU's.
Its impossible to spec a macbook pro or a mac mini with a base CPU and a decent amount of RAM. Total scam since they know people want the RAM to use with local LLMs.
This was not always the case - When I specced out my macbook pro M1 16gb it was entirely possible to get 32 and 64gb without any tie-in to CPU upgrades.
I was ready to drop a few grand on a new macbook pro M5 or M4 pro with a decent amount of RAM but it's currently set up to be an insane price gouge.
To get 32GB of RAM it's an M5 chip price $1999.
To get 64GB of RAM you are forced to to grab the M4 max CPU, and it's $3,899 on apple right now. What a scam.