Astro Hacker News - Matrix Orthogonalization Improves Memory in Recurrent Models

imurray |next [-]

Here is a pytorch optimizer that can maintain a matrix as orthogonal throughout optimization:

https://github.com/adrianjav/pogo — POGO: A Proximal One-step Geometric Orthoptimizer

https://arxiv.org/abs/2602.14656 — An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale; Adrián Javaloy, Antonio Vergari

big-chungus4 |root |parent [-]

That's useful, but wouldn't help with this particular experiment because they orthogonalize activations, not weights

BirbSingularity |next |previous [-]

I can't help but think of orthogonal frequency-division multiplexing and it's use in encoding data on multiple carrier frequencies, and it makes me wonder what other parallels we will discover between digital transmission technology for cross-domain stuff like this.

dapperdrake |root |parent |next [-]

Not even cross-domain. (Nor cross-co-domain.)

Trigonometric polynomials are also polynomials. And linear spaces are all "the same". That is what the definition is for. Even the transpose-mapping is linear.

chimpanzee2 |root |parent |next |previous [-]

I have this strange sensation that I can't put into words that somehow we are on the brink of unveiling an entirely new paradigm of AIs or perhaps even of combining AI with classical algorithms in a way to rapidly iterate between each other (and sensor data) that will instantly 10x or 100x current capabilities.

Anyone else feel this?

digdugdirk |root |parent |next [-]

I think part of it is the feeling of false understanding that comes from using llms regularly. They let you operate at a higher conceptual level, and they paper over enough of the actual details that your conceptual model might not actually be correct.

I'm a mechanical engineer by training, and have similar vibes with the similarities I see between llm training and metallurgy. I could probably put together a formal concept for these vibes at this point, but is there actually a "there" there? I have no idea. And it would take me years to actually dive in and learn everything to gain the deep understanding that would be required to know if I'm just experiencing my own brand of AI psychosis or not.

It's a brave new world, that's for sure.

seanhunter |root |parent [-]

Andrej Karpathy said something along the lines of “while you can use llms to outsource some of your thinking, you can’t use them to outsource your understanding “.

duped |root |parent |next |previous [-]

> that will instantly 10x or 100x current capabilities.

In the 1920s we had legions of very smart, highly trained (arguably better trained in mathematics) basically chucking relays and vacuum tubes together with reckless abandon to build the most valuable and complicated systems mankind had ever come up with (telephony, radio, radar, etc). They had no idea how they worked and only ad-hoc rules of thumb to construct them.

It took the insight of a handful of these people both in and outside of industry to formalize the theory of operation of most of what people were already building and then use that theory to establish formal design practices.

The people before these theories were realized were exceptionally smart and good at what they did, it's just they didn't have better design tools to reason about the things they were building.

And once they had those tools they didn't 10x or 100x overnight.

cyanydeez |root |parent |previous [-]

no. we're approach a sigmoid. AI is bloated carcass and we're tweaking out the size of the models and speed they'll run on smaller hardware.

I think to feel what you're feeling, you've bought into "all we need is more context". I think evolution demonstrates that's not really true.

geysersam |root |parent |next [-]

They said "there are algorithmic changes that remains to be discovered" and you said they bought into the idea that "all we need is more context". Seems like opposites to me.

chimpanzee2 |root |parent |previous [-]

would you really bet that this is it? there is nothing beyond this?

reminds me of the famous anecdote of a 19th century physics professor who said "there is nothing left to be discovered in physics, only minor corrections"

then came Einstein...

seanhunter |root |parent |next [-]

That wasn’t just a physics professor that was William Thompson aka Lord Kelvin (the dude the temperature unit is named after and one of the most important mathematical physicists of the 19th century [1]), who also said that heavier than air flight was physically impossible only a couple of weeks before the Wright Brothers (and presumably in spite of having at least once in his lifetime seen a bird). Proof that you can be both very smart and simultaneously a bit of a jackass.

[1] https://en.wikipedia.org/wiki/Lord_Kelvin

cyanydeez |root |parent [-]

I love these arguments "You know, we thought we couldn't cross the ocean, and now we did!"

This means we can just jump over to mars, then explore other planets, etc, etc.

We know tons of regimes where there is non-continuous progress. Finding a smart dude with an anecdote does not invalidate the breadth and width of all human experience with non-continuous systems.

Some dude thought all fluid was newtonian, and then we discovered non-newtonian fluid. It does exactly what yuou don't expect. Which basically demos physics is complex but that still doesn't mean progress is fluid.

seanhunter |root |parent [-]

Definitely. It’s a lesson for me in remaining humble and not making too many confident predictions.

cyanydeez |root |parent [-]

while cute, that doesn't address the size and magnitude of the "AGI" and Singularity that AI proponents claim, and definitely not the person with anxiety that they're some how going to be put into the "permanent underclass"

Another good line to look at is how people believe in ghosts: people with established religions without "ghosts" are less likely to believe in ghosts than people with atheism, even when they'd supposedly be skeptics superstitious claims.

Having functional paradigms is important, and being confident that there isn't a magical extrapolation into AGI is healthier than there being some magical exponential increase that you have to ride the dragon.

Sorry man, we're not solipsistic here. There are reasonable beliefs that are justifiable, instructive and then there are ones that require cherry picking technology indistinguishable from magic without reference to reality and physics.

wwweston |root |parent [-]

I’ve totally lost the thread here but this is interesting.

Who are the religions without ghosts?

And is the overall point “just because we’ve made big leaps of progress doesn’t mean every challenge is tractable let alone a moonshot sprint away especially those we have solid theoretical limits on”? That’s a point I certainly find amenable to I just want to make sure I’m not missing something more subtle or sophisticated.

cyanydeez |root |parent |previous [-]

see, I don't need to "bet this"; the inverse is true: the people placing large bets are either going to get their AGI, or fail miserably.

I don't need to bet anything. I'm not a sociopath who thinks the AI god needs to be built, appeased, etc. That's the torment nexus.

So, it's pretty easy to see realistically if you are satisified with local models and how they affect what you actually do.

I can see the POV of a software engineer that isn't specialized to any specific topic being replaced by various models.

But again, I see the sigmoid, not the "AGI" or the "this baby has grow very big in 1 year, urely it'll become a giant in 5.

hgoel |root |parent |previous [-]

I feel like this is an inverted interpretation? Transmission tech uses those methods because the math shows the desired properties.

Linear algebra is used everywhere, orthogonalization, SVD, eigenvalues etc are valuable because the resulting properties are very useful in many places.

BirbSingularity |root |parent [-]

Yea, I could have used a better word choice. I was thinking about the domains here in the generalized sense such as signal processing and wireless communication being applicable to the domain of artificial intelligence. In reality, you are correct that it's all tied together under of domain of applied maths or computer science.

hasley |next |previous [-]

I suspect with "orthogonalization" they mean to find vectors that form an orthogonal bases (same subspace) for the vectors in the source matrix.

I wonder what would be the result if they used a matrix that is orthogonal and closest to the source matrix. Usually one uses the Frobenius norm (root of the sum of all squared matrix entries). Maybe, one could even try another norm that gives a sparser matrix.

aesthesia |root |parent |next [-]

The Newton-Schulz iteration they use approximates setting all singular values of the matrix to 1. That computes the nearest orthogonal matrix under the Frobenius norm.

hasley |root |parent [-]

Interesting, thanks!

CamperBob2 |root |parent |previous [-]

3D graphics and kinematics people dodge the need for periodic orthonormalization by using quaternions. When they need a rotation matrix, they create it on demand rather than having to maintain it incrementally.

I wonder if there's a similar shortcut representation that we will eventually realize we should be using for ML. I suppose if there is one, it won't have native GPU support, so no one will bother looking for it.

phkahler |next |previous [-]

If it can be made orthogonal, can you go a step further and diagonalize it? The storage and performance improvement from that would be huge.

big-chungus4 |root |parent |next [-]

You can take the output of the matrix LSTM, which is going to be matrix for each token, and compute the SVD. To get better storage, we want U and V to be the same for all tokens, so that we can operate on the diagonal S matrix. But LSTM is likely highly nonlinear, U and V will be vastly different for different tokens.

bee_rider |root |parent |next |previous [-]

I don’t know AI, but, weight matrices aren’t square in general, right? My first guess for something like this would be to take the SVD instead, since you can always do that, but I’m sure that’s been tried already.

phkahler |root |parent [-]

But orthogonal matrices are square.

bee_rider |root |parent [-]

Ah, good point. I assumed they were coming up with some orthogonal basis vectors for a (potentially) non-square matrix. Edit: actually I’m not sure, this Newton-Schulz process seems to work for non-square matrices as well. Generally I see “orthogonalization” refer to the process of coming up with those orthogonal basis vectors but it could be a domain-specific lingo thing.

impossiblefork |root |parent |previous [-]

I wouldn't say that making the matrix diagonal in some basis is some further step.

If we have an singular value decomposition, M=USV^*, the columns of U are linearly independent they are a basis for the space M maps things into, and the columns of V are linearly independent then it's a basis for the space it maps things from, and [M]_{BB'} = S.

harveyrook |next |previous [-]

Now I’m wondering what is the eigenspace of an LLM? If I take a set of LLM’s with the same number of parameters, then what are the eigenvectors? Do they have different personalities?

bee_rider |root |parent [-]

Neural networks are non-linear, so I think you wouldn’t be able to compute typical eigenvalues. You could compute the eigenvalues and/or singular of the individual weight matrices (I’m sure this has been studied). SVDs are very conventional for making low-rank approximations, so it must have been studied.

The concept of nonlinear eigenvalues exists, but it is a bit more exotic.

dapperdrake |root |parent [-]

I saw a presentation about this in 2022.

Someone found a way to get "something like" a tri-diagonal matrix that was equivalent to the LLM they were studying in 2022.

Apologies for being informal and hand-wavey. Been a long time and I probably forgot a few important points.

mv_d5339e31 |previous [-]

[dead]