Why is it best to use a prime number as a mod in a hashing function?

Question Detail: 

If I have a list of key values from 1 to 100 and I want to organize them in an array of 11 buckets, I've been taught to form a mod function

$$ H = k \bmod \ 11$$

Now all the values will be placed one after another in 9 rows. For example, in the first bucket there will be $0, 11, 22 \dots$. In the second, there will be $1, 12, 23 \dots$ etc.

Let's say I decided to be a bad boy and use a non-prime as my hashing function - take 12. Using the Hashing function

$$ H = k \bmod \ 12$$

would result in a hash table with values $0, 12, 24 \dots $ in the first bucket, $1, 13, 25 \dots$ etc. in the second and so on.

Essentially they are the same thing. I didn't reduce collisions and I didn't spread things out any better by using the prime number hash code and I can't see how it is ever beneficial.

Asked By : CodyBugstein
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/11029

Answered By : Mario Cervera

Consider the set of keys $K=\{0,1,...,100\}$ and a hash table where the number of buckets is $m=12$. Since $3$ is a factor of $12$, the keys that are multiples of $3$ will be hashed to buckets that are multiples of $3$:

  • Keys $\{0,12,24,36,...\}$ will be hashed to bucket $0$.
  • Keys $\{3,15,27,39,...\}$ will be hashed to bucket $3$.
  • Keys $\{6,18,30,42,...\}$ will be hashed to bucket $6$.
  • Keys $\{9,21,33,45,...\}$ will be hashed to bucket $9$.

If $K$ is uniformly distributed (i.e., every key in $K$ is equally likely to occur), then the choice of $m$ is not so critical. But, what happens if $K$ is not uniformly distributed? Imagine that the keys that are most likely to occur are the multiples of $3$. In this case, all of the buckets that are not multiples of $3$ will be empty with high probability (which is really bad in terms of hash table performance).

This situation is more common that it may seem. Imagine, for instance, that you are keeping track of objects based on where they are stored in memory. If your computer's word size is four bytes, then you will be hashing keys that are multiples of $4$. Needless to say that choosing $m$ to be a multiple of $4$ would be a terrible choice: you would have $3m/4$ buckets completely empty, and all of your keys colliding in the remaining $m/4$ buckets.

In general:

Every key in $K$ that shares a common factor with the number of buckets $m$ will be hashed to a bucket that is a multiple of this factor.

Therefore, to minimize collisions, it is important to reduce the number of common factors between $m$ and the elements of $K$. How can this be achieved? By choosing $m$ to be a number that has very few factors: a prime number.

No comments

Powered by Blogger.