Skip to content

Instantly share code, notes, and snippets.

@bool-dev
Forked from badboy/inthash.md
Created November 12, 2016 13:29
Show Gist options
  • Save bool-dev/d7f6a0d5c9adeb7d81c3f5dc5127887a to your computer and use it in GitHub Desktop.
Save bool-dev/d7f6a0d5c9adeb7d81c3f5dc5127887a to your computer and use it in GitHub Desktop.

Revisions

  1. @badboy badboy revised this gist Aug 19, 2013. 1 changed file with 4 additions and 0 deletions.
    4 changes: 4 additions & 0 deletions inthash.md
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,7 @@
    Original link: http://www.concentric.net/~Ttwang/tech/inthash.htm
    Taken from: http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm
    Reformatted using pandoc

    Integer Hash Function
    =====================

  2. @badboy badboy revised this gist Aug 19, 2013. 1 changed file with 0 additions and 4 deletions.
    4 changes: 0 additions & 4 deletions inthash.md
    Original file line number Diff line number Diff line change
    @@ -343,8 +343,4 @@ improve the quality of a hash value.

    * * * * *

    [![HOME](/web/20071223173210im_/http://www.concentric.net/~Ttwang/homeicon.gif)](/web/20071223173210/http://www.concentric.net/~Ttwang/index.html)
    \
    \

    Give feedbacks to [[email protected]](mailto:[email protected])
  3. @badboy badboy renamed this gist Aug 19, 2013. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  4. @badboy badboy created this gist Aug 19, 2013.
    350 changes: 350 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,350 @@
    Integer Hash Function
    =====================

    Thomas Wang, Jan 1997

    last update Mar 2007

    Version 3.1

    * * * * *

    Abstract
    --------

    An integer hash function accepts an integer hash key, and returns an
    integer hash result with uniform distribution. In this article, we will
    be discussing the construction of integer hash functions.

    Introduction
    ------------

    Hash table is an important data structure. All elementary data structure
    text books contain some algorithms of hash table. However, all too often
    the treatment of hash function is discussed as an after-thought. Thus
    examples abound in systems where the poor choice of the hash function
    resulted in inferior performance.

    Certainly the integer hash function is the most basic form of the hash
    function. The integer hash function transforms an integer hash key into
    an integer hash result. For a hash function, the distribution should be
    uniform. This implies when the hash result is used to calculate hash
    bucket address, all buckets are equally likely to be picked. In
    addition, similar hash keys should be hashed to very different hash
    results. Ideally, a single bit change in the hash key should influence
    all bits of the hash result.

    Hash Function Construction Principles
    -------------------------------------

    A good mixing function must be reversible. A hash function has form h(x)
    -\> y. If the input word size and the output word size are identical,
    and in addition the operations in h() are reversible, then the following
    properties are true.

    1. If h(x1) == y1, then there is an inverse function h\_inverse(y1) == x1
    2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.

    The case of h(x1) == y1, and h(x2) == y1 is called a collision. Using
    only reversible operations in a hash function makes collisions
    impossible. There is an one-to-one mapping between the input and the
    output of the mixing function.

    Beside reversibility, the operations must use a chain of computations to
    achieve avalanche. Avalanche means that a single bit of difference in
    the input will make about 1/2 of the output bits be different. At a
    point in the chain, a new result is obtained by a computation involving
    earlier results.

    For example, the operation a = a + b is reversible if we know the value
    of 'b', and the after value of 'a'. The before value of 'a' is obtained
    by subtracting the after value of 'a' with the value of 'b'.

    Knuth's Multiplicative Method
    -----------------------------

    In Knuth's "The Art of Computer Programming", section 6.4, a
    multiplicative hashing scheme is introduced as a way to write hash
    function. The key is multiplied by the golden ratio of 2\^32
    (2654435761) to produce a hash result.

    Since 2654435761 and 2\^32 has no common factors in common, the
    multiplication produces a complete mapping of the key to hash result
    with no overlap. This method works pretty well if the keys have small
    values. Bad hash results are produced if the keys vary in the upper
    bits. As is true in all multiplications, variations of upper digits do
    not influence the lower digits of the multiplication result.

    Robert Jenkins' 96 bit Mix Function
    -----------------------------------

    [Robert Jenkins](/web/20071223173210/http://www.burtleburtle.net/bob/hash/doobs.html)
    has developed a hash function based on a sequence of subtraction,
    exclusive-or, and bit shift.

    All the sources in this article are written as Java methods, where the
    operator '\>\>\>' represents the concept of unsigned right shift. If the
    source were to be translated to C, then the Java 'int' data type should
    be replaced with C 'uint32\_t' data type, and the Java 'long' data type
    should be replaced with C 'uint64\_t' data type.

    The following source is the mixing part of the hash function.

    int mix(int a, int b, int c)
    {
    a=a-b; a=a-c; a=a^(c >>> 13);
    b=b-c; b=b-a; b=b^(a << 8);
    c=c-a; c=c-b; c=c^(b >>> 13);
    a=a-b; a=a-c; a=a^(c >>> 12);
    b=b-c; b=b-a; b=b^(a << 16);
    c=c-a; c=c-b; c=c^(b >>> 5);
    a=a-b; a=a-c; a=a^(c >>> 3);
    b=b-c; b=b-a; b=b^(a << 10);
    c=c-a; c=c-b; c=c^(b >>> 15);
    return c;
    }

    Variable 'c' contains the input key. When the mixing is complete,
    variable 'c' also contains the hash result. Variable 'a', and 'b'
    contain initialized random bits. Notice the total number of internal
    state is 96 bits, much larger than the final output of 32 bits. Also
    notice the sequence of subtractions rolls through variable 'a' to
    variable 'c' three times. Each row will act on one variable, mixing in
    information from the other two variables, followed by a shift operation.

    Subtraction is similar to multiplication in that changes in upper bits
    of the key do not influence lower bits of the addition. The 9 bit shift
    operations in Robert Jenkins' mixing algorithm shifts the key to the
    right 61 bits in total, and shifts the key to the left 34 bits in total.
    As the calculation is chained, each exclusive-or doubles the number of
    states. There are at least 2\^9 different combined versions of the
    original key, shifted by various amounts. That is why a single bit
    change in the key can influence widely apart bits in the hash result.

    The uniform distribution of the hash function can be determined from the
    nature of the subtraction operation. Look at a single bit subtraction
    operation between a key, and a random bit. If the random bit is 0, then
    the key remains unchanged. If the random bit is 1, then the key will be
    flipped. A carry will occur in the case where both the key bit and the
    random bit are 1. Subtracting the random bits will cause about half of
    the key bits to be flipped. So even if the key is not uniform,
    subtracting the random bits will result in uniform distribution.

    32 bit Mix Functions
    --------------------

    Based on an original suggestion on Robert Jenkin's part in 1997, I have
    done some research for a version of the integer hash function. This is
    my latest version as of January 2007. The specific value of the bit
    shifts are obtained from running the accompanied search program.

    public int hash32shift(int key)
    {
    key = ~key + (key << 15); // key = (key << 15) - key - 1;
    key = key ^ (key >>> 12);
    key = key + (key << 2);
    key = key ^ (key >>> 4);
    key = key * 2057; // key = (key + (key << 3)) + (key << 11);
    key = key ^ (key >>> 16);
    return key;
    }

    (\~x) + y is equivalent to y - x - 1 in two's complement representation.

    By taking advantages of the native instructions such as 'add
    complement', and 'shift & add', the above hash function runs in 11
    machine cycles on HP 9000 workstations.

    Having more rounds will strengthen the hash function by making the
    result more random looking, but performance will be slowed down
    accordingly. Simulation seems to prefer small shift amounts for inner
    rounds, and large shift amounts for outer rounds.

    Robert Jenkins' 32 bit integer hash function
    --------------------------------------------

    uint32_t hash( uint32_t a)
    {
    a = (a+0x7ed55d16) + (a<<12);
    a = (a^0xc761c23c) ^ (a>>19);
    a = (a+0x165667b1) + (a<<5);
    a = (a+0xd3a2646c) ^ (a<<9);
    a = (a+0xfd7046c5) + (a<<3);
    a = (a^0xb55a4f09) ^ (a>>16);
    return a;
    }

    This version of integer hash function uses operations with integer
    constants to help producing a hash value. I suspect the actual values of
    the magic constants are not very important. Even using 16 bit constants
    may still work pretty well.

    These magic constants open up the construction of perfect integer hash
    functions. A test program can vary the magic constants until a set of
    perfect hashes are found.

    Using Multiplication for Hashing
    --------------------------------

    Using multiplication requires a mechanism to transport changes from high
    bit positions to low bit positions. Bit reversal is best, but is slow to
    implement. A viable alternative is left shifts.

    Using multiplication presents some sort of dilemma. Certain machine
    platforms supports integer multiplication in hardware, and an integer
    multiplication can be completed in 4 or less cycles. But on some other
    platforms an integer multiplication could take 8 or more cycles to
    complete. On the other hand, integer hash functions implemented with bit
    shifts perform equally well on all platforms.

    A compromise is to multiply the key with a 'sparse' bit pattern, where
    on machines without fast integer multiplier they can be replaced with a
    'shift & add' sequence. An example is to multiply the key with (4096 + 8
    + 1), with an equivalent expression of (key + (key \<\< 3)) + (key \<\<
    12).

    On most machines a bit shift of 3 bits or less, following by an addition
    can be performed in one cycle. For example, Pentium's 'lea' instruction
    can be used to good effect to compute a 'shift & add' in one cycle.

    Function hash32shiftmult() uses a combination of bit shifts and integer
    multiplication to hash the input key.

    public int hash32shiftmult(int key)
    {
    int c2=0x27d4eb2d; // a prime or an odd constant
    key = (key ^ 61) ^ (key >>> 16);
    key = key + (key << 3);
    key = key ^ (key >>> 4);
    key = key * c2;
    key = key ^ (key >>> 15);
    return key;
    }

    64 bit Mix Functions
    --------------------

    public long hash64shift(long key)
    {
    key = (~key) + (key << 21); // key = (key << 21) - key - 1;
    key = key ^ (key >>> 24);
    key = (key + (key << 3)) + (key << 8); // key * 265
    key = key ^ (key >>> 14);
    key = (key + (key << 2)) + (key << 4); // key * 21
    key = key ^ (key >>> 28);
    key = key + (key << 31);
    return key;
    }

    The longer width of 64 bits require more mixing than the 32 bit version.

    64 bit to 32 bit Hash Functions
    -------------------------------

    One such use for this kind of hash function is to hash a 64 bit virtual
    address to a hash table index. Because the output of the hash function
    is narrower than the input, the result is no longer one-to-one.

    Another usage is to hash two 32 bit integers into one hash value.

    public int hash6432shift(long key)
    {
    key = (~key) + (key << 18); // key = (key << 18) - key - 1;
    key = key ^ (key >>> 31);
    key = key * 21; // key = (key + (key << 2)) + (key << 4);
    key = key ^ (key >>> 11);
    key = key + (key << 6);
    key = key ^ (key >>> 22);
    return (int) key;
    }

    Parallel Operations
    -------------------

    If a CPU can dispatch multiple instructions in the same clock cycle, one
    can consider adding more parallelism in the formula.

    For example, for the following formula, the two shift operations can be
    performed in parallel. On certain platforms where there are multiple
    ALUs but a single shifter unit, this idea does not offer a speed
    increase.

    key ^= (key << 17) | (key >>> 16);

    For 32 bit word operations, only certain pairs of shift amounts will be
    reversible. The valid pairs include: (17,16) (16,17) (14,19) (19,14)
    (13,20) (20,13) (10,23) (23,10) (8,25) (25,8)

    Multiplication can be computed in parallel. Any multiplication with odd
    number is reversible.

    key += (key << 3) + (key << 9); // key = key * (1 + 8 + 512)

    On certain machines, bit rotation can be performed in one cycle. Any odd
    number bits rotation can be made reversible when exclusive-or is applied
    to the un-rotated key with one particular bit set to 1 or 0.

    key = (key | 64) ^ ((key >>> 15) | (key << 17));

    However, on certain machine and compiler combinations, this code
    sequence can run as slow as 4 cycles. 2 cycles: a win, 3 cycles: tie,
    more than 3 cycles: a loss.

    Pseudo Random Usages
    --------------------

    There has been queries whether these mix functions can be used for
    pseudo random purposes. Although the out does look random to the naked
    eye, the official recommendation is to use a real pseudo random number
    generator instead, such as the [Mercenne Twister](/web/20071223173210/http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html).

    The hash functions listed in this article were only tested for hashing
    purpose. No tests of randomness were performed.

    Test Program
    ------------

    This is a [test program](/web/20071223173210/http://www.concentric.net/~Ttwang/tech/testchange.java)
    testing the choices of the shift amounts with regard to the resulting
    avalanche property. The program detects if a certain bit position has
    both changes and no changes, based on a single bit flip. Promising
    candidates are further tested to verify the percentage chance of bit
    flip is sufficiently close to 50% for all input and output bit pairs.

    The test program prints out the name of the algorithm under test,
    followed by the list of shift amounts that pass the avalanche test.

    Power of 2 Hash Table Size
    --------------------------

    Programmer uses hash table size that is power of 2 because address
    calculation can be performed very quickly. The integer hash function can
    be used to post condition the output of a marginal quality hash function
    before the final address calculation is done.

    addr = inthash(marginal_hash_value) & (tablesize - 1);

    Using the inlined version of the integer hash function is faster than
    doing a remaindering operation with a prime number! An integer remainder
    operation may take up to 18 cycles or longer to complete, depending on
    machine architecture.

    Conclusions
    -----------

    In this article, we have examined a number of hash function construction
    algorithms. Knuth's multiplicative method is the simplest, but has some
    known defects. Robert Jenkins' 96 bit mix function can be used as an
    integer hash function, but is more suitable for hashing long keys. A
    dedicated hash function is well suited for hashing an integer number.

    We have also presented an application of the integer hash function to
    improve the quality of a hash value.

    * * * * *

    [![HOME](/web/20071223173210im_/http://www.concentric.net/~Ttwang/homeicon.gif)](/web/20071223173210/http://www.concentric.net/~Ttwang/index.html)
    \
    \

    Give feedbacks to [[email protected]](mailto:[email protected])