Hacker News
Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)
I started working on this because no single AI model is right all the time, but their errors don’t strongly correlate. In other words, models often make unique mistakes relative to other models. So I run multiple models in parallel and synthesize the outputs by weighting segments based on confidence. Low entropy in the output token probability distributions correlates with accuracy. High entropy is often where hallucinations begin.
My dad Scott (AI Research Scientist at TRI) is my research partner on this. He sends me papers at all hours, we argue about whether they actually apply and what modifications make sense, and then I build and test things. The entropy-weighting approach came out of one of those conversations.
In our eval on Humanity's Last Exam, Sup scored 52.15%. The best individual model in the same evaluation run got 44.74%. The relative gap is statistically significant (p < 0.001).
Methodology, eval code, data, and raw results:
- https://sup.ai/research/hle-white-paper-jan-9-2026
- https://github.com/supaihq/hle
Limitations:
- We evaluated 1,369 of the 2,500 HLE questions (details in the above links)
- Not all APIs expose token logprobs; we use several methods to estimate confidence when they don't
We tried offering free access and it got abused so badly it nearly killed us. Right now the sustainable option is a $5 starter credit with card verification (no auto-charge). If you don't want to sign up, drop a prompt in the comments and I'll run it myself and post the result.
Try it at https://sup.ai. My dad Scott (@scottmu) is in the thread too. Would love blunt feedback, especially where this really works for you and where it falls short.
Here's a short demo video: https://www.youtube.com/watch?v=DRcns0rRhsg
scottmu
|next
[-]
There is interesting research in the correlation of entropy with accuracy and hallucinations:
- https://www.nature.com/articles/s41586-024-07421-0
- https://arxiv.org/abs/2405.19648
- https://arxiv.org/abs/2509.04492 (when only a small number of probabilities are available, which is something we frequently deal with)
- https://arxiv.org/abs/2603.18940
- tons more, happy to chat about if interested
mememememememo
|root
|parent
|next
[-]
Maybe this insight is why I feel hallucinations are much rarer in the last 12 months on top models. Are they being detected before they get sent out.
scottmu
|root
|parent
[-]
Hallucinations may seem rarer for a few reasons. First, models are more accurate with certain prompts. Second, models are more convincing when they do hallucinate. They may get an overall idea, but hallucinate the details. Hallucinations are still a major problem and are fundamental to the way modern LLMs work.
stephantul
|root
|parent
|previous
[-]
scottmu
|root
|parent
[-]
Tomjosetj31
|next
|previous
[-]
siliconc0w
|next
|previous
[-]
wavemode
|next
|previous
[-]
hello12343214
|next
|previous
[-]
Also, discovering HLE was great... scrolling through some of the questions brings back memories of college organic chem.
algolint
|next
|previous
[-]
supai
|root
|parent
|next
[-]
- We do something similar to OpenRouter which measures the latency of the different providers, to ensure we always get the fastest results
- Users can cancel a single model stream if it's taking too long
- The orchestrator is pretty good at choosing what models for what task. The actual confidence scoring and synthesis at the end is the difficult part that you cannot do naively, however, the orchestrator plays the biggest part in optimizing cost + speed. I've made sure that we don't exceed 25% extra in cost or time in the vast majority of queries, compared to equivalent prompts in ChatGPT/Gemini/etc.
The reason why this is viable IMO is because of the fact that you can run multiple less-intelligent models with lower thinking efforts and beat a single more-intelligent model with a large thinking effort. The thinking effort reduction speeds up the prompt dramatically.
The sequential steps are then:
1. Ensemble RAG 2. Orchestrator 3. Models in parallel 4. Synthesizer
And retries for low-confidence (although that's pretty optimized with selective retries of portions of the answer).
mememememememo
|root
|parent
|previous
[-]
I.e. you get 3 replies. 80% confidence. You decide at 80% you are fairly good but happy to wait 5 seconds for completion / 500ms for time to first token. If either breaches you give the current answer.
But if you are at 5% you wait for 60s total/2s for a token since the upside of that unspoken model is much higher.
Basically wagering time for quality in a dynamic prediction market in front of the LLM.