Hash Policies

This subsection describes hash policies. It is organized as follows:

  1. The General Terms Section describes some general terms.
  2. The Range-Hashing Functions Section describes range-hasing functions.
  3. The Ranged-Hash Functions Section describes ranged-hash functions.
  4. The Implementation in pb_assoc Section describes the implementation of these concepts in pb_assoc.

General Terms

There are actually three functions involved in transforming a key into a hash-table's position (see Figure Hash runctions, ranged-hash functions, and range-hashing functions ):

  1. A ranged-hash function, which maps keys into an interval of the non-negative integrals. This is the function actually required by the hash-table algorithm.
  2. A hash function, which maps keys into non-negative integral types. This is typically specified by the writer of the key class.
  3. A range-hashing function, which maps non-negative integral types into an interval of non-negative integral types.
no image Hash runctions, ranged-hash functions, and range-hashing functions.

Let U be a domain (e.g., the integers, or the strings of 3 characters). A hash-table algorithm needs to map elements of U "uniformly" into the range [0,..., m - 1] (where m is a non-negative integral value, and is, in general, time varying). I.e., the algorithm needs a ranged-hash function

f : U × Z+ → Z+ ,

such that for any u in U ,

0 ≤ f(u, m) ≤ m - 1 ,

and which has "good uniformity" properties [knuth98sorting]. One common solution is to use the composition of the hash function

h : U → Z+ ,

which maps elements of U into the non-negative integrals, and

g : Z+ × Z+ → Z+,

which maps a non-negative hash value, and a non-negative range upper-bound into a non-negative integral in the range between 0 (inclusive) and the range upper bound (exclusive), i.e., for any r in Z+,

0 ≤ g(r, m) ≤ m - 1 .

The resulting ranged-hash function, is

f(u , m) = g(h(u), m) (1) .

From the above, it is obvious that given g and h, f can always be composed (however the converse is not true).

The above describes the case where a key is to be mapped into a single position within a hash table, e.g., in a collision-chaining table. In other cases, a key is to be mapped into a sequence of poisitions within a table, e.g., in a probing table.

Similar terms apply in this case: the table requires a ranged probe function, mapping a key into a sequence of positions withing the table. This is typically acheived by composing a hash function mapping the key into a non-negative integral type, a probe function transforming the hash value into a sequence of hash values, and a range-hashing function transforming the sequence of hash values into a sequence of positions.

Range-Hashing Functions

Some common choices for range-hashing functions are the division, multiplication, and middle-square methods [knuth98sorting], defined as

g(r, m) = r mod m (2) ,

g(r, m) = ⌈ u/v ( a r mod v ) ⌉ ,

and

g(r, m) = ⌈ u/v ( r2 mod v ) ⌉ ,

respectively, for some positive integrals u and v (typically powers of 2), and some a. Each of these range-hashing functions works best for some different setting.

The division method (2) is a very common choice. However, even this single method can be implemented in two very different ways. It is possible to implement (2) using the low level % (modulo) operation (for any m), or the low level & (bit-mask) operation (for the case where m is a power of 2), i.e.,

g(r, m) = r % m (3) ,

and

g(r, m) = r & m - 1, ( m = 2k for some k) (4) ,

respectively.

The % (modulo) implementation (3) has the advantage that for m a prime far from a power of 2, g(r, m) is affected by all the bits of r (minimizing the chance of collision). It has the disadvantage of using the costly modulo operation. This method is hard-wired into SGI's implementation [sgi_stl].

The & (bit-mask) implementation (4) has the advantage of relying on the fast bitwise and operation. It has the disadvantage that for g(r, m) is affected only by the low order bits of r. This method is hard-wired into Dinkumware's implementation [dinkumware_stl].

Ranged-Hash Functions

Although rarer, there are cases where it is beneficial to allow the client to directly specify a ranged-hash hash function. It is true, that the writer of the ranged-hash function cannot rely on the values of m having specific numerical properties suitable for hashing (in the sense used in [knuth98sorting]), since the values of m are determined by a resize policy with possibly orthogonal considerations [austern98segmented]. The values of m can be used in some cases, though, to estimate the "general" number of distinct values required.

Let

s = [ s0,..., st - 1]

be a string of t characters, each of which is from domain S. Consider the following ranged-hash function:

f1(s, m) = ∑ i = 0t - 1 si ai mod m (5) ,

where a is some non-negative integral value. This is the standard string-hashing function used in SGI's implementation (with a = 5) [sgi_stl]. Its advantage is that it takes into account all of the characters of the string.

Now assume that s is the string representation of a of a long DNA sequence (and so S = {'A', 'C', 'G', 'T'}). In this case, scanning the entire string might be prohibitively expensive. A possible alternative might be to use only the first k characters of the string, where

k |S| ≥ m ,

i.e., using the hash function

f2(s, m) = ∑ i = 0k - 1 si ai mod m , (6)

requiring scanning over only

k = log4( m )

characters.

Other more elaborate hash-functions might scan k characters starting at a random position (determined at each resize), or scanning k random positions (determined at each resize), i.e., using

f3(s, m) = ∑ i = r0r0 + k - 1 si ai mod m ,

or

f4(s, m) = ∑ i = 0k - 1 sri ari mod m ,

respectively, for r0,..., rk-1 each in the (inclusive) range [0,...,t-1].

Implementation in pb_assoc

Containers based on collision-chaining hash tables in pb_assoc are parameterized by the functors Hash_Fn, and Comb_Hash_Fn.

If such a container is instantiated with any hash functor and range-hashing functor, the container will synthesize a ranged-hash functor automatically. For example, Figure Insert hash sequence diagram shows an insert sequence diagram. The user inserts an element (point A), the container transforms the key into a non-negative integral using the hash functor (points B and C), and transforms the result into a position using the combining functor (points D and E).

no image
Insert hash sequence diagram.

If such a container is instantiated with the null policy hash functor, null_hash_fn, and a combining-hash functor, the container treats the combining hash functor as a ranged-hash function. For example, Figure Insert hash sequence diagram with a null combination policy shows an insert sequence diagram. The user inserts an element (point A), the container transforms the key into a position using the combining functor (points B and C).

no image
Insert hash sequence diagram with a null combination policy.

pb_assoc contains the following hash-related policies:

  1. direct_mask_range_hashing and direct_mod_range_hashing are range-hashing functions based on a bit-mask and a modulo operation, respectively.
  2. linear_probe_fn and quadratic_probe_fn are probe classes based on linear and quadratic increment, respectively.
  3. null_hash_fn and null_probe_fn are null policy classes for creating ranged-hash and ranged-probe functions directly (i.e., not through composition).

pb_assoc does not provide any hash functions (it relies on those of the STL).