Introduction To Hashing and the HashTable Class: Part 2

In previous articles I provided a partial introduction to Hashing and Hashtables so here is the next part of the article.
 
Searching For Data In A Hash Table
 
To search for data in a hash table, we need to compute the hash value of the key and then access that element in the array.
 
It is that simple. Here's the function:
  1. static bool InHash(string s, string[] arr) {  
  2. int hval = BetterHash(s, arr);  
  3. if (arr[hval] == s)  
  4.    return true;  
  5. else  
  6.    return false;  
  7. }  
This function returns True if the item is in the hash table and False otherwise. We don't even need to compare the time this function runs versus a sequential search of the array since this function clearly runs in less time, unless of course the data item is somewhere close to the beginning of the array.
 
Handling Collision
 
When working with hash tables, it is inevitable that you will encounter situations where the hash value of a key works out to a value that is already storing another key. This is called a collision and there are several techniques you can use when a collision occurs. These techniques include bucket hashing, open addressing and double hashing (among others).   (We will briefly cover each of these techniques.) 
Bucket Hashing  
 
When we originally defined a hash table, we stated that it is preferred that only one data value resides in a hash table element. This works great if there are no collisions, but if a hash function returns the same value for two data items, we have a problem.
                                                                   
One solution to the collision problem is to implement the hash table using buckets.
"A bucket is a simple data structure stored in a hash table element that can store multiple items."
 
In most implementations, this data structure is an array, but in our implementation we'll use an arraylist that will allow us not to worry about running out of space and to allocate more space. In the end, this will make our implementation more efficient.
 
To insert an item, we first use the hash function to determine which arraylist to store the item into. Then we check to see if the item is already in the arraylist. If it is we do nothing, if it's not, then we call the Add method to insert the item into the arraylist. 
 
To remove an item from a hash table, we again first determine the hash value of the item to be removed and go to that arraylist. We then check to ensure the item is in the arraylist and if it is, we remove it. 
 
Here's the code for a BucketHash class that includes a Hash function, an Add method and a Remove method: 
  1. public class BucketHash    
  2. {    
  3.     private const int SIZE = 101;    
  4.     ArrayList[] data;    
  5.     public BucketHash()    
  6.     {    
  7.         data = new ArrayList[SIZE];    
  8.         for (int i = 0; i <= SIZE - 1; i++)    
  9.         data[i] = new ArrayList(4);    
  10.     }    
  11.   
  12.     public int Hash(string s)     
  13.     {    
  14.         long tot = 0;    
  15.         char[] charray;    
  16.         charray = s.ToCharArray();    
  17.         for(int i = 0; i <= s.Length-1; i++)    
  18.         tot += 37 ∗ tot + (int)charray[i];    
  19.         tot = tot % data.GetUpperBound(0);    
  20.         if (tot < 0)    
  21.         tot += data.GetUpperBound(0);    
  22.         return (int)tot;    
  23.    }    
  24.     public void Insert(string item)     
  25.     {    
  26.               int hash_value;    
  27.                hash_value = Hash(value);    
  28.               if (data[hash_value].Contains(item))    
  29.                 data[hash_value].Add(item);    
  30.     }    
  31.       public void Remove(string item)    
  32.       {    
  33.       int hash_value;    
  34.       hash_value = Hash(item);    
  35.       if (data[hash_value].Contains(item))    
  36.       data[hash_value].Remove(item);    
  37.       }    
  38. }  
When using bucket hashing, the most important thing you can do is to keep the number of arraylist elements used as low as possible. This minimizes the extra work that must be done when adding items to or removing items from the hash table. In the preceding code, we minimize the size of the arraylist by setting the initial capacity of each arraylist to 1 in the constructor call. Once we have a collision, the arraylist capacity becomes 2 and then the capacity continues to double every time the arraylist fills up. With a good hash function, though, the arraylist shouldn't get too large.
 
The ratio of the number of elements in the hash table to the table size is called the load factor. Studies have shown that hash table performance is best when the load factor is 1.0, or when the table size exactly equals the number of elements.
 
Open Addressing 
 
Separate chaining decreases the performance of your hash table by using arraylists. An alternative to separate chaining for avoiding collisions is open addressing. An open addressing function looks for an empty cell in the hash table array to place an item. If the first cell tried is full, the next empty cell is tried and so on until an empty cell is eventually found.
 
There are the following two separate strategies for open addressing:
  1. Linear probing  (Linear probing uses a linear function to determine the array cell to try for an insertion.)
  2. Quadratic probing  (A quadratic function is used to determine which cell to attempt.)
Double Hashing 
 
This simple collision-resolution strategy is exactly what it says it is. If a acollision is found, the hash function is applied a second time and then probe at the distance sequence hash (item), 2hash (item), 4hash(item) and so on until an empty cell is found.
 
To make this probing technique work correctly, a few conditions must be met. First, the hash function chosen must not ever evaluate to zero that would lead to disastrous results (since multiplying by zero produces zero). Second, the table size must be prime. If the size isn't prime, then all the array cells will not be probed, again leading to chaotic results.
 
Double hashing is an interesting collision resolution strategy, but it has been shown in practice that quadratic probing usually leads to better performance.
 
We are now finished examining custom hash table implementations. For most applications using C#, you are better off using the built-in Hashtable class that is part of the .NET Framework library. We begin our discussion of this class in the next section.
 
Until then Happy Coding.


Similar Articles