Astro Hacker News - Ornith-1.0: Self-scaffolding LLMs for agentic coding

SwellJoe |next [-]

I added this to a benchmark I've been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read/grep/ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.

https://swelljoe.com/post/will-it-mythos/

hedgehog |root |parent |next [-]

It would be really interesting to see how the Qwen 3.6 35B model compares to the 27B on your benchmark.

kordlessagain |root |parent |previous [-]

Good to know. Thanks for the research!

Balinares |next |previous [-]

I'd have expected this to get more HN attention. Qwen 3.6 35B capability in a 9B model is a bonkers claim.

Balinares |root |parent |next [-]

Ok, I gave the 35B MoE weights a shot at Q6. It failed my go-to code reasoning benchmark (local models usually do), but delivered the most advanced output I've seen so far for my (non-agentic) coding execution benchmark. It was also full of bugs, though more in the form of coding mistakes than the deep logic bugs I usually see in local models. The code was also a bit haphazard, and some of those bugs could have been prevented entirely with a better code structure.

It was fairly good at diagnosing the bugs once informed of their symptoms. However, if I mischaracterized the symptoms, it would weigh my input too heavily and reject its own (correct) hunch about the root cause.

So it's an interesting one. There's definitely some latent capability in there that arguably exceeds Qwen 3.6, which is absolutely no small feat. But that capability seems to come in a somewhat erratic package.

It's probably worth benchmarking it unquantized if you can. I've grown to suspect that quantization damages small models more than perplexity and KL divergence accurately reflect.

I'll also give the 9B weights a shot when I can.

juliangoldsmith |root |parent |next |previous [-]

It looks like they're comparing Orinth 9B to Qwen 3.5 35B, not Qwen 3.6. I guess it kind of makes sense since it's a finetune of 3.5, but I totally missed until I looked closely.

In my brief tests, Ornith 35B performed quite well. It won't replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.

I don't remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.

Balinares |root |parent [-]

A 9B model with the capability of the 35B SOTA from last February is too good to be true, and a wild claim to make IMO even if there's a newer 35B SOTA. I'll need to make time to take it on a test drive and see how it holds up.

juliangoldsmith |root |parent [-]

The claim isn't so wild when it's a generalist versus a finetune trained specifically on the tasks being benchmarked.

chid |root |parent |previous [-]

I thought so too when I read the headline but I expect it's basically Qwen3.5-9B

nzach |previous [-]

Instead of training the model to directly answer questions we trained the model to always write and execute the code that would solve the question ?

If that is the case, this isn't just a fancy way to perform prompt optimization?