Hash function: Difference between revisions
Jump to navigation
Jump to search
| [unchecked revision] | [unchecked revision] |
imported>Kacc26353815 Link suggestions feature: 3 links added. |
→Uniformity: fixed some spacing and removed the descriptor "last" from the first paragraph |
||
| Line 3: | Line 3: | ||
{{Redirect|Hash code|the programming competition|Hash Code (programming competition)}} | {{Redirect|Hash code|the programming competition|Hash Code (programming competition)}} | ||
{{about|a computer programming construct|other meanings of "hash" and "hashing"|Hash (disambiguation)}} | {{about|a computer programming construct|other meanings of "hash" and "hashing"|Hash (disambiguation)}} | ||
{{More citations needed|date=July 2010}} | {{More citations needed|section|date=July 2010}} | ||
[[File:Hash table 4 1 1 0 0 1 0 LL.svg|thumb|240px|right|A hash function that maps names to integers from 0 to 15. There is a collision between keys "John Smith" and "Sandra Dee".]] | [[File:Hash table 4 1 1 0 0 1 0 LL.svg|thumb|240px|right|A hash function that maps names to integers from 0 to 15. There is a [[hash collision|collision]] between keys "John Smith" and "Sandra Dee".]] | ||
A '''hash function''' is any [[Function (mathematics)|function]] that can be used to map [[data | A '''hash function''' is any [[Function (mathematics)|function]] that can be used to map [[Digital data|data]] of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output.<ref>{{cite conference |last1=Aggarwal |first1=Kirti |last2=Verma |first2=Harsh K. |date=March 19, 2015 |title=Hash_RC6 — Variable length Hash algorithm using RC6 |doi=10.1109/ICACEA.2015.7164747 |conference=2015 International Conference on Advances in Computer Engineering and Applications (ICACEA) }}</ref> The values returned by a hash function are called ''hash values'', ''hash codes'', (''hash/message'') ''digests'',<ref> | ||
*{{cite web|url=https://csrc.nist.gov/glossary/term/hash_digest|title=hash digest|publisher=[[NIST]] |website=Computer Security Resource Center - Glossary}} | *{{cite web|url=https://csrc.nist.gov/glossary/term/hash_digest|title=hash digest|publisher=[[NIST]] |website=Computer Security Resource Center - Glossary}} | ||
*{{cite web |url=https://csrc.nist.gov/glossary/term/message_digest |title=message digest |publisher=[[NIST]] |website=Computer Security Resource Center - Glossary}}</ref> or simply ''hashes''. The values are usually used to index a fixed-size table called a ''[[hash table]]''. Use of a hash function to index a hash table is called ''hashing'' or ''scatter-storage addressing''. | *{{cite web |url=https://csrc.nist.gov/glossary/term/message_digest |title=message digest |publisher=[[NIST]] |website=Computer Security Resource Center - Glossary}}</ref> or simply ''hashes''. The values are usually used to index a fixed-size table called a ''[[hash table]]''. Use of a hash function to index a hash table is called ''hashing'' or ''scatter-storage addressing''. | ||
Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval. They require an amount of storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a | Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval. They require an amount of storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a way to access data quickly and efficiently. Unlike lists or trees, it provides near-constant access time. It also uses much less storage than trying to store all possible keys directly, especially when keys are large or variable in length. | ||
Use of hash functions relies on statistical properties of key and function interaction: worst-case behavior is intolerably bad but rare, and average-case behavior can be nearly optimal (minimal [[hash collision|collision]]).<ref name="knuth-1973">{{cite book |last=Knuth |first=Donald E. |author-link=Donald Knuth |date=1973 |title=The Art of Computer Programming, Vol. 3, Sorting and Searching |publisher=[[Addison-Wesley]] |location=Reading, MA., United States|isbn=978-0-201-03803-3|bibcode=1973acp..book.....K }}</ref>{{rp|page=527}} | Use of hash functions relies on statistical properties of key and function interaction: worst-case behavior is intolerably bad but rare, and average-case behavior can be nearly optimal (minimal [[hash collision|collision]]).<ref name="knuth-1973">{{cite book |last=Knuth |first=Donald E. |author-link=Donald Knuth |date=1973 |title=The Art of Computer Programming, Vol. 3, Sorting and Searching |publisher=[[Addison-Wesley]] |location=Reading, MA., United States|isbn=978-0-201-03803-3|bibcode=1973acp..book.....K }}</ref>{{rp|page=527}} | ||
| Line 17: | Line 17: | ||
== Overview == | == Overview == | ||
{{Unreferenced section|date=March 2026}} | |||
In a hash table, a hash function takes a key as an input, which is associated with a datum or record and used to identify it to the data storage and retrieval application. The keys may be fixed-length, like an integer, or variable-length, like a name. In some cases, the key is the datum itself. The output is a hash code used to index a hash table holding the data or records, or pointers to them. | In a hash table, a hash function takes a key as an input, which is associated with a datum or record and used to identify it to the data storage and retrieval application. The keys may be fixed-length, like an integer, or variable-length, like a name. In some cases, the key is the datum itself. The output is a hash code used to index a hash table holding the data or records, or pointers to them. | ||
| Line 43: | Line 44: | ||
=== Uniformity === | === Uniformity === | ||
A good hash function should map the expected inputs as evenly as possible over its output range. | A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same [[probability]]. The reason for this requirement is that the cost of hashing-based methods goes up sharply as the number of ''collisions'' — pairs of inputs that are mapped to the same hash value — increases. If some hash values are more likely to occur than others, then a larger fraction of the lookup operations will have to search through a larger set of colliding table entries. | ||
This criterion only requires the value to be ''uniformly distributed'', not ''random'' in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true. | This criterion only requires the value to be ''uniformly distributed'', not ''random'' in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true. | ||
| Line 56: | Line 57: | ||
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the [[chi-squared test]]. This test is a goodness-of-fit measure: it is the actual distribution of items in buckets versus the expected (or uniform) distribution of items. The formula is | When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the [[chi-squared test]]. This test is a goodness-of-fit measure: it is the actual distribution of items in buckets versus the expected (or uniform) distribution of items. The formula is | ||
<math>\frac{\sum_{j=0}^{m-1}(b_j)(b_j+1)/2}{(n/2m)(n+2m-1)},</math> | <math>\frac{\sum_{j=0}^{m-1}(b_j)(b_j+1)/2}{(n/2m)(n+2m-1)},</math>{{Dubious|date=March 2026|reason=This is not a Chi squared test.}}{{Citation needed|date=March 2026|reason=seemingly wrong formula with no source}} | ||
where {{Mvar|n}} is the number of keys, {{Mvar|m}} is the number of buckets, and {{Math|''b''<sub>''j''</sub>}} is the number of items in bucket {{Mvar|j}}. | where {{Mvar|n}} is the number of keys, {{Mvar|m}} is the number of buckets, and {{Math|''b''<sub>''j''</sub>}} is the number of items in bucket {{Mvar|j}}. | ||
A ratio within one confidence interval (such as 0.95 to 1.05) is indicative that the hash function evaluated has an expected uniform distribution. | A ratio within one [[confidence interval]] (such as 0.95 to 1.05) is indicative that the hash function evaluated has an expected uniform distribution. | ||
Hash functions can have some technical properties that make it more likely that they will have a uniform distribution when applied. One is the [[strict avalanche criterion]]: whenever a single input bit is complemented, each of the output bits changes with a 50% probability. The reason for this property is that selected subsets of the keyspace may have low variability. For the output to be uniformly distributed, a low amount of variability, even one bit, should translate into a high amount of variability (i.e. distribution over the tablespace) in the output. Each bit should change with a probability of 50% because, if some bits are reluctant to change, then the keys become clustered around those values. If the bits want to change too readily, then the mapping is approaching a fixed XOR function of a single bit. Standard tests for this property have been described in the literature.<ref>{{cite journal |first1=Julio Cesar Hernandez |last1=Castro |first2=José María |last2=Sierra |first3=Andre |last3=Seznec |first4=Antonio |last4=Izquierdo |first5=Arturo |last5=Ribagorda |display-authors=1 |date=3 February 2005 |title=The strict avalanche criterion randomness test |journal=Mathematics and Computers in Simulation |volume=68 |issue=1 |pages=1–7 |publisher=[[Elsevier]] |doi=10.1016/j.matcom.2004.09.001|s2cid=18086276 }}</ref> The relevance of the criterion to a multiplicative hash function is assessed here.<ref name="fibonacci-hashing">{{cite web |first=Malte |last=Sharupke |title=Fibonacci Hashing: The Optimization that the World Forgot |url=https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ |website=Probably Dance |date=16 June 2018}}</ref> | Hash functions can have some technical properties that make it more likely that they will have a uniform distribution when applied. One is the [[strict avalanche criterion]]: whenever a single input bit is complemented, each of the output bits changes with a 50% probability. The reason for this property is that selected subsets of the keyspace may have low variability. For the output to be uniformly distributed, a low amount of variability, even one bit, should translate into a high amount of variability (i.e. distribution over the tablespace) in the output. Each bit should change with a probability of 50% because, if some bits are reluctant to change, then the keys become clustered around those values. If the bits want to change too readily, then the mapping is approaching a fixed XOR function of a single bit. Standard tests for this property have been described in the literature.<ref>{{cite journal |first1=Julio Cesar Hernandez |last1=Castro |first2=José María |last2=Sierra |first3=Andre |last3=Seznec |first4=Antonio |last4=Izquierdo |first5=Arturo |last5=Ribagorda |display-authors=1 |date=3 February 2005 |title=The strict avalanche criterion randomness test |journal=Mathematics and Computers in Simulation |volume=68 |issue=1 |pages=1–7 |publisher=[[Elsevier]] |doi=10.1016/j.matcom.2004.09.001|s2cid=18086276 }}</ref> The relevance of the criterion to a multiplicative hash function is assessed here.<ref name="fibonacci-hashing">{{cite web |first=Malte |last=Sharupke |title=Fibonacci Hashing: The Optimization that the World Forgot |url=https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ |website=Probably Dance |date=16 June 2018}}</ref> | ||
| Line 71: | Line 72: | ||
Because collisions should be infrequent, and cause a marginal delay but are otherwise harmless, it is usually preferable to choose a faster hash function over one that needs more computation but saves a few collisions. | Because collisions should be infrequent, and cause a marginal delay but are otherwise harmless, it is usually preferable to choose a faster hash function over one that needs more computation but saves a few collisions. | ||
Division-based implementations can be of particular concern because a division requires multiple cycles on nearly all processor [[microarchitecture]]s. Division ([[Modulo operation|modulo]]) by a constant can be inverted to become a multiplication by the word-size multiplicative-inverse of that constant. This can be done by the programmer, or by the compiler. Division can also be reduced directly into a series of shift-subtracts and shift-adds, though minimizing the number of such operations required is a daunting problem; the number of machine-language instructions resulting may be more than a dozen and swamp the pipeline. If the microarchitecture has [[hardware multiply]] [[functional unit]]s, then the multiply-by-inverse is likely a better approach. | Division-based implementations can be of particular concern because a division requires multiple cycles on nearly all processor [[microarchitecture]]s. Division ([[Modulo operation|modulo]]) by a constant can be inverted to become a multiplication by the word-size multiplicative-inverse of that constant. This can be done by the programmer, or by the [[compiler]]. Division can also be reduced directly into a series of shift-subtracts and shift-adds, though minimizing the number of such operations required is a daunting problem; the number of machine-language instructions resulting may be more than a dozen and swamp the pipeline. If the microarchitecture has [[hardware multiply]] [[functional unit]]s, then the multiply-by-inverse is likely a better approach. | ||
We can allow the table size {{math|''n''}} to not be a power of 2 and still not have to perform any remainder or division operation, as these computations are sometimes costly. For example, let {{math|''n''}} be significantly less than {{math|2<sup>''b''</sup>}}. Consider a [[pseudorandom number generator]] function {{math|''P''(key)}} that is uniform on the interval {{math|[0, 2<sup>''b''</sup> − 1]}}. A hash function uniform on the interval {{math|[0, ''n'' − 1]}} is {{math|''n'' ''P''(key) / 2<sup>''b''</sup>}}. We can replace the division by a (possibly faster) right [[bit shifting|bit shift]]: {{math|''n P''(key) >> ''b''}}. | We can allow the table size {{math|''n''}} to not be a power of 2 and still not have to perform any remainder or division operation, as these computations are sometimes costly. For example, let {{math|''n''}} be significantly less than {{math|2<sup>''b''</sup>}}. Consider a [[pseudorandom number generator]] function {{math|''P''(key)}} that is uniform on the interval {{math|[0, 2<sup>''b''</sup> − 1]}}. A hash function uniform on the interval {{math|[0, ''n'' − 1]}} is {{math|''n'' ''P''(key) / 2<sup>''b''</sup>}}. We can replace the division by a (possibly faster) right [[bit shifting|bit shift]]: {{math|''n P''(key) >> ''b''}}. | ||
| Line 79: | Line 80: | ||
=== Universality === | === Universality === | ||
{{main|Universal hashing}} | {{main|Universal hashing}} | ||
A ''universal hashing'' scheme is a [[randomized algorithm]] that selects a hash function {{math|''h''}} among a family of such functions, in such a way that the probability of a collision of any two distinct keys is {{math|1/''m''}}, where {{math|''m''}} is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash [[function application]] will behave as well as if it were using a random function, for any distribution of the input data. It will, however, have more collisions than perfect hashing and may require more operations than a special-purpose hash function. | A ''universal hashing'' scheme is a [[randomized algorithm]] that selects a hash function {{math|''h''}} among a family of such functions, in such a way that the probability of a collision of any two distinct keys is {{math|1/''m''}}, where {{math|''m''}} is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash [[function application]] will behave as well as if it were using a random function, for any distribution of the input data. It will, however, have more collisions than perfect hashing and may require more operations than a special-purpose hash function. | ||
| Line 162: | Line 164: | ||
Multiplicative hashing is susceptible to a "common mistake" that leads to poor diffusion—higher-value input bits do not affect lower-value output bits.<ref> | Multiplicative hashing is susceptible to a "common mistake" that leads to poor diffusion—higher-value input bits do not affect lower-value output bits.<ref> | ||
{{cite web |url= | {{cite web |url=https://www.cs.cornell.edu/courses/cs3110/2008fa/lectures/lec21.html |title=CS 3110 Lecture 21: Hash functions |at=Section "Multiplicative hashing"}}</ref> A transmutation on the input which shifts the span of retained top bits down and XORs or ADDs them to the key before the multiplication step corrects for this. The resulting function looks like:<ref name="fibonacci-hashing"/> | ||
<syntaxhighlight lang="c"> | <syntaxhighlight lang="c"> | ||
unsigned hash(unsigned K) { | unsigned hash(unsigned K) { | ||
| Line 180: | Line 182: | ||
=== Zobrist hashing === | === Zobrist hashing === | ||
{{main | Tabulation hashing|Zobrist hashing}} | {{main|Tabulation hashing|Zobrist hashing}} | ||
Zobrist hashing was originally introduced as a means of compactly representing chess positions in computer game-playing programs. A unique random number was assigned to represent each type of piece (six each for black and white) on each space of the board. Thus a table of 64×12 such numbers is initialized at the start of the program. The random numbers could be any length, but 64 bits was natural due to the 64 squares on the board. A position was transcribed by cycling through the pieces in a position, indexing the corresponding random numbers (vacant spaces were not included in the calculation) and XORing them together (the starting value could be 0 (the identity value for XOR) or a random seed). The resulting value was reduced by modulo, folding, or some other operation to produce a hash table index. The original Zobrist hash was stored in the table as the representation of the position. | ''Zobrist hashing'', named after [[Albert Lindsey Zobrist|Albert Zobrist]], is a form of [[tabulation hashing]], which is a method for constructing universal families of hash functions by combining table lookup with XOR operations. This algorithm has proven to be very fast and of high quality for hashing purposes (especially hashing of integer-number keys).<ref>{{citation|first=Albert L.|last=Zobrist|author-link= Albert Lindsey Zobrist|url=https://www.cs.wisc.edu/techreports/1970/TR88.pdf|title=A New Hashing Method with Application for Game Playing|series=Tech. Rep. 88|publisher=Computer Sciences Department, University of Wisconsin|location=Madison, Wisconsin|date=April 1970}}.</ref> | ||
Zobrist hashing was originally introduced as a means of compactly representing chess positions in computer game-playing programs. A unique random number was assigned to represent each type of piece (six each for black and white) on each space of the board. Thus a table of 64×12 such numbers is initialized at the start of the program. The random numbers could be any length, but 64 bits was natural due to the 64 squares on the board. A position was transcribed by cycling through the pieces in a position, indexing the corresponding random numbers (vacant spaces were not included in the calculation) and XORing them together (the starting value could be 0 (the identity value for XOR) or a [[random seed]]). The resulting value was reduced by modulo, folding, or some other operation to produce a hash table index. The original Zobrist hash was stored in the table as the representation of the position. | |||
Later, the method was extended to hashing integers by representing each byte in each of 4 possible positions in the word by a unique 32-bit random number. Thus, a table of 2<sup>8</sup>×4 random numbers is constructed. A 32-bit hashed integer is transcribed by successively indexing the table with the value of each byte of the plain text integer and XORing the loaded values together (again, the starting value can be the identity value or a random seed). The natural extension to 64-bit integers is by use of a table of 2<sup>8</sup>×8 64-bit random numbers. | Later, the method was extended to hashing integers by representing each byte in each of 4 possible positions in the word by a unique 32-bit random number. Thus, a table of 2<sup>8</sup>×4 random numbers is constructed. A 32-bit hashed integer is transcribed by successively indexing the table with the value of each byte of the plain text integer and XORing the loaded values together (again, the starting value can be the identity value or a random seed). The natural extension to 64-bit integers is by use of a table of 2<sup>8</sup>×8 64-bit random numbers. | ||
| Line 214: | Line 217: | ||
=== Rolling hash === | === Rolling hash === | ||
{{Main|Rolling hash}} | {{Main|Rolling hash}} | ||
{{See also|Linear congruential generator}} | {{See also|Linear congruential generator}} | ||
In some applications, such as [[string searching algorithm|substring search]], one can compute a hash function {{math|''h''}} for every {{math|''k''}}-character [[substring]] of a given {{math|''n''}}-character string by advancing a window of width {{math|''k''}} characters along the string, where {{math|''k''}} is a fixed integer, and {{Math|''n'' > ''k''}}. The straightforward solution, which is to extract such a substring at every character position in the text and compute {{math|''h''}} separately, requires a number of operations proportional to {{math|''k''·''n''}}. However, with the proper choice of {{math|''h''}}, one can use the technique of rolling hash to compute all those hashes with an effort proportional to {{math|''mk'' + ''n''}} where {{math|''m''}} is the number of occurrences of the substring.<ref>{{Cite book |last=Singh |first=N. B. |url=https://books.google.com/books?id=ALIMEQAAQBAJ&dq=rolling+hash&pg=PT102 |title=A Handbook of Algorithms |publisher=N.B. Singh |language=en}}</ref>{{fix|text=what is the choice of h?}} | In some applications, such as [[string searching algorithm|substring search]], one can compute a hash function {{math|''h''}} for every {{math|''k''}}-character [[substring]] of a given {{math|''n''}}-character string by advancing a window of width {{math|''k''}} characters along the string, where {{math|''k''}} is a fixed integer, and {{Math|''n'' > ''k''}}. The straightforward solution, which is to extract such a substring at every character position in the text and compute {{math|''h''}} separately, requires a number of operations proportional to {{math|''k''·''n''}}. However, with the proper choice of {{math|''h''}}, one can use the technique of rolling hash to compute all those hashes with an effort proportional to {{math|''mk'' + ''n''}} where {{math|''m''}} is the number of occurrences of the substring.<ref>{{Cite book |last=Singh |first=N. B. |url=https://books.google.com/books?id=ALIMEQAAQBAJ&dq=rolling+hash&pg=PT102 |title=A Handbook of Algorithms |publisher=N.B. Singh |language=en}}</ref>{{fix|text=what is the choice of h?}} | ||