Hacker News
Universal Reasoning Model (53.8% pass 1 ARC1 and 16.0% ARC 2)
marojejian
|next
[-]
Decent comment via x: https://x.com/r0ck3t23/status/2002383378566303745
I continue to be fascinated by these architectures that: - Build in recurrence / inference scaling to transformers more natively. - Don't use full recurrent gradient traces, and succeed not just despite, but because of that.
amluto
|next
|previous
[-]
But suppose an extra attention head were added that queried the KV data from lower layers. At the very least, I imagine this might cleanly solve the STRAWBERRY problem: whatever layer has figured out that the prompt wants to count instances of R could attend to lower layers that actually perceive those Rs.
yorwba
|root
|parent
|next
[-]
None of this helps with the strawberry problem, where the very first layer already gets a tokenized representation, so there is no layer that "actually perceives those Rs."
cainxinth
|root
|parent
[-]
nl
|root
|parent
|next
[-]
The “Rs in strawberry problem” is literally "count the token R" in the word "strawberry".
One could argue that the learnt tokenization model where it is tokenized into 3 tokens (see https://platform.openai.com/tokenizer) is problematic, but one solution to that is to double-down on it and learn tokenization as part of the end-to-end training instead of separately.
If you mean that the idea of the current idea of the tokenization model being entirely fixed then I agree.
(I'm not entirely sure how multi-modal models function in this regard - they must have a idea of the bytestream, but not familiar enough with that to comment intelligently.)
idiotsecant
|root
|parent
|previous
[-]
krackers
|root
|parent
|next
|previous
[-]
Isn't this sort of similar to latent looping? E.g. [1]. But actually as [2] argues, even that wasn't a good experiment because it used the very last hidden state, which is too close to the logits and loses most of the rich embedding structure. Perhaps you don't even need access to the state of anything except the penultimate hidden layer, since based on my vague reading of [3] the residual stream doesn't "lose information" as it passes deeper down the attention layers, so each block maybe manipulates a different subspace of the residual stream.
[1] https://arxiv.org/abs/2412.06769
[2] https://snimu.github.io/2025/03/30/multi-layer-language-head...
amluto
|root
|parent
[-]
I imagine that conventional transformers kind of force this. If you train a transformer such that it needs to learn the ability to do tasks like “Repeat the following words: apple banana cat” then the model is sort of forced to internally propagate the input far enough along to be able to perform the task. But maybe if you pre-trained from scratch with an architecture where later layers get direct access to earlier layers and/or the raw input, then the model wouldn’t need to propagate information.
Or maybe it would all fall apart and something would go wrong with the gradients.
andy12_
|root
|parent
|next
|previous
[-]
Hard to say whether something scales or not from a couple dozen million parameters to an actual billion-sized model, but I have the impression that the nature of the residual stream and its high dimensionality allows any layer to access information of previous layers if the transformers needs it.
ImHereToVote
|root
|parent
|next
|previous
[-]
I feel simple transformers simply don't get access to those modalities that a human would use. I can't use my "talking" centers to count letters in words either.
You just need to pay attention to understand you don't use your language skills to count words.
Moosdijk
|next
|previous
[-]
Instead of big models that “brute force” the right answer by knowing a lot of possible outcomes, this model seems to come to results with less knowledge but more wisdom.
Kind of like having a database of most possible frames in a video game and blending between them instead of rendering the scene.
omneity
|root
|parent
|next
[-]
ctoa
|root
|parent
[-]
The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.
omneity
|root
|parent
[-]
> UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs.
Very interesting, it seems to be an “old” architecture that is only now being leveraged to a promising extent. Curious what made it an active area (with the works of Samsung and Sapient and now this one), perhaps diminishing returns on regular transformers?
mysterEFrank
|next
|previous
[-]
in-silico
|root
|parent
|next
[-]
Here you go: https://arxiv.org/abs/2502.05171