Hacker News
RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8
sieste
|next
[-]
I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
eurekin
|root
|parent
|next
[-]
The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.
For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.
I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).
This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.
christkv
|root
|parent
|next
|previous
[-]
porridgeraisin
|root
|parent
|previous
[-]
ydj
|next
|previous
[-]
Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).
Being in California electricity alone puts this non-competitive with just paying a cloud though.
arjie
|root
|parent
|next
[-]
Very interesting though, these Tenstorrent chips. Might get one to experiment with.
skhameneh
|next
|previous
[-]
https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-EX...
https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-Mi...
Do be sure to use dflash and/or mtp for the draft:
triwats
|next
|previous
[-]
NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb
NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb
avyeed_desa
|next
|previous
[-]
cybertim
|next
|previous
[-]
stared
|next
|previous
[-]
On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5. Not sure how it compares to llama.cpp performance.
In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.
deng
|next
|previous
[-]
redfloatplane
|root
|parent
|next
[-]
I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.
I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.
jubilanti
|root
|parent
[-]
If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.
alexjplant
|root
|parent
|next
|previous
[-]
ThunderSizzle
|root
|parent
|next
|previous
[-]
In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.
Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.
There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.
medfield
|root
|parent
|next
|previous
[-]
PeterStuer
|root
|parent
|next
|previous
[-]
TSiege
|root
|parent
|next
|previous
[-]
toyg
|root
|parent
|next
|previous
[-]
NicoJuicy
|root
|parent
|next
|previous
[-]
Der_Einzige
|root
|parent
|previous
[-]
Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.
jubilanti
|root
|parent
[-]
Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.
> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI
But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.
ComputerGuru
|next
|previous
[-]
atq2119
|root
|parent
|next
[-]
Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.
verdverm
|root
|parent
|previous
[-]
I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on