Hacker News
Data Compression Explained (2012)
rurban
|next
[-]
Also, you could say the same for the related data search problem. How to prepare data, so that it can most efficiently searched. Smallest encoding vs fastest search. Databases are mostly very, very stupid compared to more data-specific tuned algorithms. Like factor 1000 slower and bigger.
dang
|next
|previous
[-]
Data Compression Explained (2011) - https://news.ycombinator.com/item?id=40631931 - June 2024 (1 comment)
Data Compression Explained - https://news.ycombinator.com/item?id=5931493 - June 2013 (14 comments)
Data Compression Explained by Matt Mahoney - https://news.ycombinator.com/item?id=1179242 - March 2010 (1 comment)
Bnjoroge
|next
|previous
[-]
usernametaken29
|next
|previous
[-]
adrian_b
|root
|parent
|next
[-]
However, I believe that is less useful to think that AI "finds an universal compression", than to think that the training of an AI model has the purpose to find a specific lossy data compression method, which is close to optimal for the input data that constitutes the training data set.
One could consider the training algorithm as a universal lossy data compression method, but this view is not useful in practice, because unlike a traditional lossy data compression algorithm, used e.g. for movie or picture compression, which you can use every day to compress many data sets, even in real time on an incoming data stream, the training of an AI model is a very long and expensive operation, which can be done only infrequently and which makes sense only for special important datasets, from which data will be frequently extracted by querying (i.e. AI inference) for a long time, to make worthwhile the compression, i.e. training, cost.
Moreover, for the best compression results the training of a new improved AI model does not consist only in determining the values of parameters (weights) of a fixed inference algorithm, but the structure of the inference algorithms is also tweaked for each new generation of models.
This is an additional reason that makes impractical to think about training as a universal compression algorithm (instead of a method for searching specific compression algorithms, which work for a given training set), because it is not a fixed algorithm, but a family of algorithms that evolves continuously, at least for now.
hyperpape
|root
|parent
|next
|previous
[-]
It's true that LLMs do something that looks very compression like in their weights, but it is lossy, and it has to be--if you're not lossy, you've overfitted the corpus, and that's bad. Post-training takes this even further, because you're not doing anything that looks like training on a specific corpus, you're exploring in a wider space of text. That text doesn't even concretely exist until you start exploring it.
I'm sure there must be a serious attempt to pursue this analogy that isn't just handwaving, but I haven't seen it.
miki123211
|root
|parent
[-]
You can use the fact that LLMs predict P(next token | existing tokens) to losslessly and efficiently compress arbitrary token sequences. This idea is closely related to arithmetic coding.
hyperpape
|root
|parent
[-]
Many things about the process are similar, so there's some analogy, but it just isn't the same.
fuglede_
|root
|parent
|next
|previous
[-]
briansm
|root
|parent
|next
|previous
[-]
A modern version of the book would include an extra section in the 'Lossy compression' chapter - 'Text' (alongside Images/Video/Audio) that would discuss LLM's.
eru
|root
|parent
[-]
An LLM can give you a probability distribution for the next token. You can pair that with arithmetic coding to get a lossless compression/decompression algorithm. See https://en.wikipedia.org/wiki/Arithmetic_coding
adrian_b
|root
|parent
[-]
In the latter applications, you do queries which aim to extract information from the training data set, but which may return hallucinated content instead of correct content.
If you use an LLM just to provide an estimation for the frequencies of tokens in an input data stream, and then you use the estimated frequencies to encode the input data, then you do not care about which were the tokens predicted by the LLM, because they are not used. The worst effect of any wrong predictions by the LLM is a slightly worse data compression ratio than the optimum.
When it is said that LLMs do a lossy data compression, that refers to the compression from the training data set to sequences of output tokens.
eru
|root
|parent
|next
|previous
[-]
woadwarrior01
|root
|parent
|next
[-]
[1]: https://arxiv.org/abs/2105.13626
energy123
|root
|parent
|previous
[-]
jeremyjh
|root
|parent
[-]
https://en.wikipedia.org/wiki/Lossless_compression#Limitatio...
firesteelrain
|next
|previous
[-]
wps
|next
|previous
[-]
NooneAtAll3
|previous
[-]
I remember hearing a lot about "compression is a lot about prediction", but I don't remember reading any practical result
Karliss
|root
|parent
|next
[-]
sltkr
|root
|parent
|previous
[-]
Since LLM are inherently token predictors, that makes using them for losless compression almost trivial. For something close to the state of the art see e.g. Fabrice Bellard (of course) ts_zip: https://bellard.org/ts_zip/
I think some of the confusion comes from the fact that there is a pretty big difference between the techniques employed by compressors that optimize compression ratio at the cost of nearly everything else, like ts_zip above, and practical tools that intend to balance compression ratio with limitation on CPU speed / memory, like zstd.
When optimizing for compression ratio, the prediction + entropy coding paradigm dominates. Practical tools, even modern ones like zstd, are mostly based around sliding window compression à la LZ77 (unzip/deflate), with the main selling point of more modern tools being that they scale up to larger window sizes and run really really fast. Some of these (like LZO) don't even have an entropy coding step to save time. zstd has both Huffman coding and FSE: Huffman coding is suboptimal but presumably it's an option because it's faster, and on lower compression levels it's preferable to be fast.
Anyway, the bottom line is: don't get confused between the state of the art in terms of compression ratio, and practical tools. Those are quite different things.