Minimal perfect hash functions.
Package Specification
This package provides a number of state-of-the-art implementations of static (i.e., immutable)
minimal perfect hash functions, and, more generally, of static functions from objects to integers.
The classes can be gathered in three broad groups:
- General functions (e.g., {@link it.unimi.dsi.sux4j.mph.MWHCFunction} and {@link it.unimi.dsi.sux4j.mph.TwoStepsMWHCFunction}). They can be used to associate arbitrary values to a set of objects,
so, in particular, they can be used to implement order-preserving minimal perfect hashing (elements are mapped
to their order in which they were provided, independently of their lexicographical order). They are also essential
building blocks for all other classes.
- Minimal perfect hash functions (e.g., {@link it.unimi.dsi.sux4j.mph.MinimalPerfectHashFunction}); they map a set
of n object to the set n = { 0, 1,… n − 1 }.
- Monotone minimal perfect hash functions (e.g., {@link it.unimi.dsi.sux4j.mph.LcpMonotoneMinimalPerfectHashFunction},
{@link it.unimi.dsi.sux4j.mph.PaCoTrieDistributorMonotoneMinimalPerfectHashFunction}, {@link it.unimi.dsi.sux4j.mph.HollowTrieMonotoneMinimalPerfectHashFunction}
and {@link it.unimi.dsi.sux4j.mph.HollowTrieDistributorMonotoneMinimalPerfectHashFunction}). These
functions requires keys to be prefix-free and provided in lexicographical order; they will map back each key to its position using
a very small number of bits per element, providing different space/time tradeoffs (in what follows,
l is maximum the string length):
- {@link it.unimi.dsi.sux4j.mph.LcpMonotoneMinimalPerfectHashFunction} is very fast, as it has just to evaluate three {@link it.unimi.dsi.sux4j.mph.MWHCFunction}s
(so if the length of the strings is a constant multiplied by the machine word, it is actually constant time); however it uses
2.14 + log log n + log l bits per element. {@link it.unimi.dsi.sux4j.mph.TwoStepsLcpMonotoneMinimalPerfectHashFunction} gains
a few bits by performing some additional compression, but it is usually slightly slower (albeit always constant time).
- {@link it.unimi.dsi.sux4j.mph.PaCoTrieDistributorMonotoneMinimalPerfectHashFunction} is slower, as it uses a {@linkplain it.unimi.dsi.sux4j.mph.PaCoTrieDistributor partial compacted trie}
(which requires linear time to be accessed) to distribute keys between buckets; theoretically it uses
2.14 + log(l - log n) bits per element, but the partial compacted trie is every efficiency in exploiting data redundancy, so the actual
occupancy is in general half with respect to the previous function.
- {@link it.unimi.dsi.sux4j.mph.HollowTrieMonotoneMinimalPerfectHashFunction} is rather slow as it has to traverse a succinct trie on the whole key set;
it uses just 4 + log l + log log lvar> bits per element, and in practice it
is the monotone minimal perfect hash function that uses less space.
- {@link it.unimi.dsi.sux4j.mph.HollowTrieDistributorMonotoneMinimalPerfectHashFunction} is slow, as it
uses a {@linkplain it.unimi.dsi.sux4j.mph.HollowTrieMonotoneMinimalPerfectHashFunction enriched hollow trie as a distributor}, but it is faster than a hollow trie,
and it has the (quite surprising) property of using 3.21 + 1.23 log log l bits per element (note the double log). In practice,
it will use less than a byte per element for strings of length up to a billion bits.
- {@link it.unimi.dsi.sux4j.mph.ZFastTrieDistributorMonotoneMinimalPerfectHashFunction} is faster than
{@link it.unimi.dsi.sux4j.mph.HollowTrieDistributorMonotoneMinimalPerfectHashFunction}, but occupies in practice more space,
even if, from an asymptotic viewpoint, the space required is the same. Presently it is mainly of theoretical interest, but
it has the best behaviour when scaling to very large data sets (billions of strings).
{@link it.unimi.dsi.sux4j.mph.LcpMonotoneMinimalPerfectHashFunction} and {@link it.unimi.dsi.sux4j.mph.ZFastTrieDistributorMonotoneMinimalPerfectHashFunction}
were introduced by Djamal Belazzougui, Paolo Boldi, Rasmus Pagh and Sebastiano Vigna
in “Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) Accesses”,
Proc. of the 20th Annual ACM–SIAM Symposium On Discrete Mathematics (SODA), ACM Press, 2009.
{@link it.unimi.dsi.sux4j.mph.TwoStepsLcpMonotoneMinimalPerfectHashFunction}, {@link it.unimi.dsi.sux4j.mph.PaCoTrieDistributorMonotoneMinimalPerfectHashFunction},
{@link it.unimi.dsi.sux4j.mph.HollowTrieMonotoneMinimalPerfectHashFunction} and {@link it.unimi.dsi.sux4j.mph.HollowTrieDistributorMonotoneMinimalPerfectHashFunction} were introduced
by the same authors in “Theory and Practise of Minimal Monotone Perfect Hashing” (the class {@link it.unimi.dsi.sux4j.mph.MWHCFunction} implements a compacted version
of the classical {@linkplain it.unimi.dsi.sux4j.mph.HypergraphSorter 3-hypergraph-based structure} introduced therein).
Usage
Functions in this package implement the {@link it.unimi.dsi.fastutil.objects.Object2LongFunction} interface. However,
the underlying machinery manipulates {@linkplain it.unimi.dsi.bits.BitVector bit vectors} only. To bring you own data
into the bit vector world, each constructor requires to specify a {@linkplain it.unimi.dsi.bits.TransformationStrategy transformation strategy}
that maps your objects into bit vectors. For instance, {@link it.unimi.dsi.bits.TransformationStrategies#utf16()},
{@link it.unimi.dsi.bits.TransformationStrategies#prefixFreeUtf16()}, {@link it.unimi.dsi.bits.TransformationStrategies#iso()},
and {@link it.unimi.dsi.bits.TransformationStrategies#prefixFreeIso()} are ready-made strategies that can be used with character sequences.
Note that if you plain to use monotone hashing, you must provide objects in an order such that the corresponding bit vectors
are lexicographically ordered. For instance, {@link it.unimi.dsi.bits.TransformationStrategies#utf16()} obtain this
results by concatenating the reversed 16-bit representation of each character.
Signing functions
All functions in this package will return a value in their range for most of the keys that are not in their domain. In other
words, they will produce false positives; in the few cases in which it is possible to detect a negative, you will get the default return
value.
If you are interested in getting a more precise behaviour (e.g., you are migrating from the deprecated SignedMinimalPerfectHash
class that was distributed with MG4J), you can sign a function, that is, you can record
a signature for each key and use it to filter false positives. A signing class for character sequences is provided
by the DSI utilities class
ShiftAddXorSignedStringMap. By
creating a function using one of the implementation provided with Sux4J and signing it using the above class, you can obtain the same functionality of
the old signed classes, but you can choose the size of the signature, whether to require monotonicity, and also the space/time tradeoff
of your function. Alternatively, by signing with
LiterallySignedStringMap
we will get a two-way function (i.e., a full
StringMap implementation).