Hacker News
Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions
nl
|next
[-]
You can train it in under a minute, and it will work perfectly well on embedded devices.
Small LLMs are good choices for text classification in two cases:
- If you next to provide in-context examples and classifier based on them.
- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/
djsjajah
|root
|parent
|next
[-]
nl
|root
|parent
|next
[-]
https://github.com/thelgevold/fine-tuned-classifier/blob/mai...
dev-experiments
|root
|parent
[-]
In summary: Using logistic regression actually improves accuracy, but also performance during both runtime and during training.
IanCal
|root
|parent
|previous
[-]
You can even get fancy and do things like active learning with the llm taking the role of the human annotator and sending in trial statements (and you can use a cheap one for larger gen and a more expensive one for the classification).
I’d be interested in seeing how well LLMs work with writing things like code for what snorkel AI used to have (there was open source code a while back that I assume is still around somewhere, you wrote code that was a low quality set of classifiers and it trained a model around those)
zubiaur
|root
|parent
|next
|previous
[-]
Trains quickly and classifies speedily on modern hardware.
Had a lot of fun doing stuff like this years ago, before LLMs were a thing.
brokensegue
|root
|parent
|previous
[-]
deepsquirrelnet
|next
|previous
[-]
- Zero-shot encoders like tasksource or GliNER
- Natural language inference: https://huggingface.co/blog/dleemiller/nli-xenc-ways-to-use
- GRPO training
- GEPA prompt tuning Qwen 0.6B (or GEPA, then GRPO)
- Use an embedding model and train a classifier (MLP, logistic, svm)
- Use a larger LLM to generate a synthetic dataset (beware of lack of diversity, mine "seed text" from real sources first)
- Synthetically generate "hard examples" where more than one category may be valid and DPO tune your preferred responses
throwaw12
|root
|parent
[-]
deepsquirrelnet
|root
|parent
[-]
There's even more options still, especially if you go further back toward more traditional methods. Static word vectors like GloVe or fasttext (optionally more modern equivalents like WordLlama or Model2Vec). Then there's sklearn-style stuff too. Those can be really small/fast but have more accuracy-level tradeoffs.
mickael-kerjean
|next
|previous
[-]
nextaccountic
|next
|previous
[-]
Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)
nl
|root
|parent
|next
[-]
It's something that is implemented by the thing that runs the model - eg Llama.cpp - rather than the model itself.
Note that it is hard to make work if you turn thinking on because the grammar gets complicated quickly (I don't recall if Qwen 0.6B can do thinking).
doubtfuluser
|next
|previous
[-]
GardenLetter27
|next
|previous
[-]
zwaps
|next
|previous
[-]
kamranjon
|root
|parent
[-]
pj_mukh
|next
|previous
[-]
Cool write up! Really appreciate it but incidentally how does this categorization help you get better retrieval results?
mettamage
|root
|parent
[-]
all2
|root
|parent
|next
[-]
I wonder if one could build a 'mixture of experts' at the model level that leveraged a variety of small models "within" a larger model...
jszymborski
|next
|previous
[-]
I'm also interested in it as a student for distillation.
electroglyph
|next
|previous
[-]
also, you could stick a classifier head on a BERT model as another option.
abhashanand1501
|next
|previous
[-]
wongarsu
|root
|parent
|next
[-]
For larger sizes you still can, it just becomes slower and slower. For a simple classification task (small input, tiny output, and you can constrain output to a couple tokens) you could even run something like a 4B or 8B model on the CPU
a96
|root
|parent
|next
|previous
[-]
GPU and VRAM (or fast unified RAM) is generally the option that is both available and performant, but especially really small models also run quite well on CPU and system RAM.
avadodin
|root
|parent
|previous
[-]
The advantage is mainly in memory bandwidth. External GPUs' internal memory is slightly faster than DDR attached to your CPU.
Other types of "AI" models do make use of the extra compute in GPUs but not LLMs.
throwa356262
|next
|previous
[-]
Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."
rhdunn
|root
|parent
|next
[-]
providers:
- # llama-server
id: openai:chat:qwen
config:
apiBaseUrl: http://localhost:7876
apiKey: "..."
passthrough:
chat_template_kwargs:
enable_thinking: false
The looping may be due to quantization -- I've seen it on locally quantized Q6_K Qwen 3.5/3.6 models. I recall seeing somewhere (here or r/LocalLlama) that Qwen models are sensitive to quantization of the keys, though I haven't yet experimented with/looked into fixing this. (I've been building up my promptfoo tests/infrastructure to detect looping, etc. on Qwen and other models.)