Hash table is a data structure that combines an array with a linked list. It takes the random access ability of an array and combines it with dynamism of linked list.
If implemented well we are basically taking all the advantages of both the data structure to get insertion,deletion and search in constant time.
Hash table is a combination of two things:
- Hash function: it returns a positive integer that is called hash code.
- An Array: it is capable of storing data of the type we want to store
The basic idea of hash table is we take our data and run that data through the hash function, so the data is processed and return a number (called hash code) and with that number we just store the data we want to store in the data structure.
But how can we define our hash function? There is no one particular way to define a hash function. There are lot of good and bad hash function. And as we’re trying to get the operation close to constant time we’ll have to implement good hash function.
A good hash function should have:
- Use only the data being hashed -> we don’t want to incorporate anything else other than the data
- Use all the data being hashed -> we want to use all of the data
- Be deterministic -> every time we pass the same data we should get the same hash code
- Uniformly distribute data -> data should be spread out through out the able
- Generate very different hash code for very similar data
In the picture above we are taking a string ans adding ASCII value of each of the character and returning the sum mod the size of hashTable(i.e HASH_MAX) and that is our hash code and we are saving the string to that position of hash table.
What about Collision?
But if there is two input first- “aaop” and second- “paoa” both of them will return the hash code 7 if we use the above hash function. So, which ever data will save in the hash table that will be overwritten by the later one. We call this problem collision.
Collision occurs when two pieces of data run through the same hash function and returns the same hash code.
But we want to save both the data into the hash table. So, how do we do it? one way can be linear probing.
In linear probing, if we have a collision, we try to place data in the next consecutive element in the array until we find a vacancy.So, if we can not put any of the two strings in hash code 7 then we’ll try to put it in 8 and if that is also not vacant we’ll increase hash code until we find it. That means we are stretching away from constant time and leaning toward order of n with this process, and this problem is called clustering.
So, linear probing is subject to a problem called clustering. That means once there is a miss, two adjacent cells will contain data, making more likely that in the future cluster will grow.
The other problem is in our code we are still have room for specific amount of string(i.e HASH_MAX) that means we are still limited. We can only store as much data as we have location in the array.
So, how can we solve these problems? This is where chaining comes in play. And this is where we bring the linked list back to the picture. What if each element of array holding one piece of data, it held multiple pieces of data?
But that doesn’t make sense, we know each element of array can only one piece of data of a data type. But what if that data type is a linked list?So, what if every element of an array is a pointer to the head of a linked list? Then we can build these linked list and grow them arbitrarily.
So, we start to grow these chains out of the array locations. Now we can store an arbitrary amount of data into the hash table without ever running into collision.
And we know that when we insert something in the front of the linked list, that is a O(1) operation.
But we know search or find function in linked list takes O(n) time. But instead of having one linked list, we now have HASH_MAX(i.e size of array) number of linked list where HASH_MAX can be 10 or 1000 whatever the size of array is, then time complexity is O(n/10) or O(n/1000).
We know theoretically we disregard constants when we measure complexity, but in real world these things matter. We will notice this happens to run 10 times or 1000 times faster, because we’re distributing one long chain (linked list) across the 1000 smaller chains.So, each time we search through one of those chains we can ignore the other 999 chains we don’t care about.
As we are focusing on only one linked list which on average going to be thousand times shorter. So, we still are tending toward the average case of being constant time.
How Chaining works
We take the hash code and make it point to the node which contains the string. And in collision i.e if we try to put “paoa” in the hash table:
We make a new node with that data and put it in end of the list that was pointed by 7.
Implementation of Hash Table with Chaining
First thing we’ll do is — create a class called HashTable.
In this class at first we’ll declare a private variable called HASH_MAX that will be the size of hash table. Next we’ll create a class called linkedList. And after this will declare the hashTable array that is linkedList type.
Now will declare the constructor and functions we need. First function is the hash function that I’ve talked about earlier. Next we’ll declare the insertItem() function through which we’ll insert our hash table values. And the last two functions are numberOfItemsInIndex() that returns size of every index and printTable() will print the values of the hash table.
Now let’s see in details how these constructors and functions work.
In this constructor we’re just putting default values to every index of the hashTable. First we are creating a new linkedList and assigning it in hashTable’s every index, then we are giving a default data to the linked list of that index and giving default value to the next pointer.
This is the same hashFunction we used earlier. we’re just adding up all the ASCII values of the string and returning sum mod HASH_MAX.
For insertion, first we’ll pass the string through the hashFucntion and initialize the returned value in the index variable. Then we’ll check if any value is saved in that index or not. If the data element of the first element is equal to “empty”(that is out default value) then there is no value saved in this index, so we’ll just initialize the data of the first element of that list as the string we provided. But if there is one or more values saved in that linked list, then we’ll create a pointer called tmp and initialize it with hashTable[index] linkedList and then create a new node for this linked list called x and will initialize data of this node with the provided string and the next pointer of this node with NULL. After that we’ll run a loop till end of the list and when we’ll go to the last element, we’ll just initialize it’s next pointer with the new node (i.e. x) we created.
Thais fuction takes a index number as parameter and returns how many strings are saved in that linked list. First we check if the list is empty of not, if it is empty we’ll return 0 else we’ll increase the count then declare a linkedList type pointer called tmp and initialize it with linkedList of the provided index. Then we’ll just run loop till end of the list and increase the count variable and then return the value.