Hacker News
Modern Perfect Hashing
adzm
|next
[-]
Though honestly this post really needed some numbers and benchmarks.
Sesse__
|root
|parent
[-]
The blog post was mainly for curious readers, I'm surprised it hit HN at all. :-)
jibal
|root
|parent
[-]
Sesse__
|root
|parent
[-]
eschaton
|next
|previous
[-]
It’s a very worthwhile thing to keep in your back pocket if you’re going to deal with hashes whose keys are 100% knowable at runtime but not before.
o11c
|next
|previous
[-]
hinkley
|root
|parent
[-]
I remember Cliff Click presenting one that was lockless and if I'm recalling correctly, where capacity growth operations could happen in parallel with reads. And possibly writes.
rurban
|next
|previous
[-]
https://github.com/rurban/gperf/tree/autotools or some other branch. Forgot which.
https://github.com/rurban/cmph/tree/compressed_o (lotsa improvements)
https://github.com/rurban/pthash (compile to static C++)
I've also extended nbperf for those features
rob-p
|root
|parent
[-]
rurban
|root
|parent
[-]
They give the hash sizes without the needed ordering tables, they don't even produce those seperate tables. Because academic competition. So in reality you'd need O(2n) more time and O(n) space. Without random key access it gets slow. Then there are no checks for false-positives. You'd need fast random access to the keys then. Then they could not compile to static sourcecode.
All this is pure fantasy for real products. gperf, nbperf are practical. They store the keys, and give a proper search function. They produce C code. pthash, ptrhash, cmph not. It's complicated because the keys can be in a file, a stream, a database, with just linear access or random access. I'm adding options to either store the ordering table, or emit an ordered keys file. And I compile to C or C++ code. And I allow several hash algos, not just random seeds. Jenkins hash is mostly good, but there are better, depending on the keys.
rob-p
|root
|parent
[-]
The fact that minimal perfect hashes do not reject alien keys is not an "academic" limitation, it is implicit in the definition of the problem being solved. In fact, the relevant space bounds cannot be satisfied if you either (1) must reject elements not in the key set at the time of construction or (2) require a prescribed order be assigned to the specific keys.
However, even given your misapprehensions about the purpose of minimal perfect hash functions, your statement about how they are not useful in practice is also simply false. You are describing uses for classical hash tables that store their key set explicitly and in their entirety. These serve an entirely different purpose in application to minimal perfect hash functions.
Minimal perfect hash functions are particularly useful when you know the key set ahead of time and are certain that only valid keys will be queried. In this case, the values can be stored in a flat array (in the order prescribed by the MPHF) and the cost of the hash itself is only ~2 bits/key. So the total storage is the minimal storage for the values + a small overhead to allow constant-time access to the value for a given key. If you need to reject alien keys, then you can of course store those as well. In that case, the MPHF costs the storage cost for the key set + value set + ~2 bits/key; still smaller than most existing dynamic hash tables.
In many cases, you don't need an ordering table. If you are allowed to permute the keys and all you require is O(1) access to the corresponding index for a key, then the additional ordering table is unnecessary overhead. On the other hand, if you need, for some reason, to remap the keys to be in a specific order, then this is easy simply by iterating over the keys and observing what indices they yield.
Many times, people wish to use MPHFs for non-static purposes (i.e. where the key set is not known at compile time). Consider e.g. in network mapping (https://dl.acm.org/doi/abs/10.1109/tnet.2018.2820067), or in taxonomic classification (https://academic.oup.com/bioinformatics/article/34/1/171/393...) or in k-mer indexing (https://academic.oup.com/bioinformatics/article/34/13/i169/5... or https://academic.oup.com/bioinformatics/article/38/Supplemen...). Moreover, MPHFs form the basis for near-optimal static filters, by storing fingerprints of the keys in the indexed spots (https://www.sciencedirect.com/science/article/pii/S0166218X1...).
Simply because a particular MPHF construction doesn't suit the problem you have in mind doesn't mean it's not useful or that it's "academic". MPHFs have plenty of uses, some without the need for a key routing or key-reordeing table. Other times, one needs the keys in a particular order and even then MPHFs are very useful because if the key set is static, the MPHF + the rerouting table is often still smaller than alternative dynamic hash table approaches while being similarly fast for query. They can be built in an parallel, external memory, and cache efficient manner.
The argument that MPHFs are "academic" and not useful because they don't adhere to constraints to which they are not intended to adhere is like arguing that a hammer is not useful because you cannot use it to tighten philips head screws.
rurban
|root
|parent
[-]
With some algos you can export the ordering tables trivially, but nobody does. Because someone else's problem.
Algofish
|root
|parent
[-]
If you do need to preserve key order, that’s a different problem entirely — and one that’s already well-studied. You’d use an order-preserving MPHF (OMPHF), not a standard MPHF. There’s a large body of literature on this (e.g. “Order Preserving Minimal Perfect Hashing” – VTechWorks, ).
You seem to conflate the two problems — building an efficient MPHF versus building an OMPHF. Both are useful, but for different goals. Dismissing something as “academic” because it solves a different problem efficiently isn’t a fair critique — it’s misunderstanding what it’s optimizing for. It’s like saying “Bloom filters aren’t practical because they don’t store the original data.”
mgaunard
|next
|previous
[-]
jibal
|root
|parent
|next
[-]
hinkley
|root
|parent
[-]
There are other hash tables that use double hashing but they don't guarantee a lack of collisions. They just make them highly improbable.
zvr
|root
|parent
|next
|previous
[-]
hinkley
|root
|parent
|next
|previous
[-]
anitil
|root
|parent
|previous
[-]
It was just for fun, but in the end even ignoring the startup costs of all of that, the default performance of python's dictionary was better.