Hacker News
Fc, a lossless compressor for floating-point streams
pella
|next
[-]
The new OpenZL SDDL2 (Simple Data Description Language) supports several different floating-point types. It would be worthwhile to contribute some of the FC project's experience to OpenZL. Now the OpenZL supported types:
| Type | Size |Endian|
|----------------|---------|-----|
| `Int8` | 1 byte | N/A |
| `UInt8` | 1 byte | N/A |
| `Int16LE/BE` | 2 bytes | Yes |
| `UInt16LE/BE` | 2 bytes | Yes |
| `Int32LE/BE` | 4 bytes | Yes |
| `UInt32LE/BE` | 4 bytes | Yes |
| `Int64LE/BE` | 8 bytes | Yes |
| `UInt64LE/BE` | 8 bytes | Yes |
| `Float16LE/BE` | 2 bytes | Yes |
| `Float32LE/BE` | 4 bytes | Yes |
| `Float64LE/BE` | 8 bytes | Yes |
| `BFloat16LE/BE`| 2 bytes | Yes |
| `Bytes(n)` | n bytes | N/A |
Some links:- https://github.com/facebook/openzl/releases/tag/v0.2.0
- https://openzl.org/getting-started/introduction/
userbinator
|next
|previous
[-]
This is, for lack of a better term, a "metacompressor", but it will be interesting to see which of the choices end up dominating; in my past experiences with metacompression, one algorithm is usually consistently ahead.
apodik
|root
|parent
|next
[-]
enduku
|root
|parent
|next
[-]
whizzter
|root
|parent
|previous
[-]
Floating point data is a mess to compress, but I think the idea here is to apply different transforms (and perhaps back-end codecs) on data and see if one fits the data so perfectly that you magically get a lot of compression.
Say you have an audio with a sawtooth, it's linear an gradient but if the peaks is "random" values like 1.245 and PI then the mantissa bits of the interpolation range will look fairly "random" to a classic compressor, whilst this compressor can test to see if there are linear gradient spans (or near linear gradient) where it stores the gradient and dumps out the "difference" bits for a regular compressor.
Or 3d coordinates for 3d models (non-stripified), plenty of repeating 8-byte doubles that will be garbage and not help a classic compressor much, building a float aware dictionary and using that would easily bring down the data by quite a few %.
(I don't agree with GP, one method might win out for certain workloads, but the idea here seems to be a pluggable utility that can help a wide range of developers with something "for free").
loeg
|next
|previous
[-]
abcd_f
|next
|previous
[-]
Scaevolus
|next
|previous
[-]
KerrickStaley
|next
|previous
[-]
enduku
|previous
[-]
It is not trying to replace zstd or lz4. The idea is narrower: take blocks of doubles, try a set of float-specific predictors/transforms/coders, and emit whichever representation is smallest for that block.
It is aimed at time-series, scientific, simulation, and analytics data where the numbers often have structure: smooth curves, repeated values, fixed increments, periodic signals, predictable deltas, or low-entropy mantissas.
The API is intentionally small: "fc_enc", "fc_dec", a config struct, and a few counters to inspect which modes won. Decode is parallel and meant to be fast; encode spends more CPU searching for a better representation.
Current caveats: x86-64 only for now, tuned for IEEE-754 doubles, research-grade rather than production-hardened.
jiggawatts
|root
|parent
|next
[-]
Please run it through your preferred AI once or twice with instruction to look for bugs. The version of Fc in the main branch has at least a few memory safety bugs that attacker-controlled inputs could exploit.
I'd link a chat history but the tool I used has that feature blocked for some weird reason, and the locals round these parts don't take kindly to copy-pasted AI content...