Magic of Hashtables

Magic of Hashtables

Hashtables, also known as dictionaries, are fundamental data structures used in programming for efficient storage and retrieval of key-value pairs. In this article, we will delve into the inner workings of hashtables, exploring how they are implemented and how they provide fast access to data. We will also cover important concepts such as hashing, collision resolution, and the time complexity of ash table operations. Throughout the article, we will provide code examples in Python to illustrate the concepts discussed. full code

What is a Hashtable:

Hash tables, also known as dictionaries, are fundamental data structures used in programming for efficient storage and retrieval of key-value pairs. If you want to look up a word fast, hashtables are your best friends, they are mainly used in spell checkers, compilers, code editors, dictionaries and etc. Operations such as insertion, deletion, and looking-up will run in O(1) which is amazing. but in worst-case scenarios, they can happen in O(n), but since that scenario rarely happens, we agree on O(1), for the initial code let's write it like this

class HashTable:
    def __init__(self, size=5):
        self.array = [None] * size        

the HashTable class is going to be initialized with a fixed-size array containing only None values

Are Hashmaps The Same as Hashtables?

No! A hash map is a more abstract concept that refers to a data structure that maps keys to values using hash functions. It does not specify the internal implementation details. Hash maps can be implemented using various underlying data structures, such as hash tables, balanced trees, or other techniques.

How Does a Hashtable Work:

a hash table consists of two parts, a key, and a value, the values are stored in an array, and the keys will point to the index in which their value is stored, and that's why they are so fast, when you know the index of an array, looking it up will take O(1). as an example consider this key-value pair name: MJ (the name is the key and MJ is the value), this name will somehow be resolved to an index, and MJ will be stored in an array in that index we got from the key name.

name:MJ

name -> 2

[None, None, MJ, None, None]        

but how does the key turn into an index? well, that's what a hash function does.

Hash Functions:

generally, hash functions get a key and return an index based on it.

but how should this function work, let's say we have this key-value pair 123456: MJ the first idea is to return the key itself like this

def __hash(self, key):
  return key         

but there is a problem, in this approach, we need an array of length 123456 to store a single item!!!, how should we fix it? let's say we want a hash table of length 10 and want to store this pair 123456: MJ, a better approach would be to calculate 123456 % 10 as the key, which is always less than 10

def __hash(self, key):
  return key % len(self.array)        

but there is another problem, what if we want to have strings like "name" or "title" as keys, we cant apply the same logic here. consider this pair name: MJ

well to solve this we can count the sum of each char's Unicode as the key, and then calculate key % array_size

def __hash(self, key):
      return sum([ord(i) for i in str(key)]) % len(self.array)        

this is perfectly fine now, and can give you a valid index in the proper range.

let's test it, consider these key-value pairs, for a hashtable of length 10


"name": "MJ"
"birth": 1999
123: "other"        

our __hash function will return these for each of them:

__hash("name") -> 7
__hash("birth") -> 7
__hash(123) -> 0        

and based on what our hash function returned, we will make this array

["other", None, None, None, None, None, None, "MJ", None, None, None]        

but where should 1999 go? it was also on index 7, won't that override "MJ"? yes it does, that's what we call a collision

Collisions:

collisions happen when two keys point to the same index of the array, to fix them we have two methods chaining and open addressing


Chaining:

in this method, we don't save values directly in the array, but we do it in some linked lists each connected to an index of the array, this way if for example the hash function returns 7 for two different keys, they will easily be stored in two nodes of the linked list connected to the index 7 of the array

for this we change the __init__ magic method like this:

    def __init__(self, size=5)
        self.array = [LinkedList() for i in range(size)]:        

now we are adding empty LinkedLists instead of None values, for simplicity we skip the implementation of the LinkedList Class, but it will be provided in the source code, you could also go this way:

    def __init__(self, size=5)
        self.array = [[] for i in range(size)]:        

but I'm not in the mood for this right now (I recommend you to do it as a challenge)

Insert:

class HashTable:
    ...
    def __setitem__(self, key, value):
        index = self.__hash(key)
        bucket = self.array[index]
        bucket.add_or_replace(key, value)         

this __setitem__ magic method gets a key and a value, first, it grabs the index from our hash function and stores the linked list connected to it in a variable called bucket then adds the value and the key in the linked list or replace if the key already exists

since the process of getting the bucket is going to be repeated multiple times, I would like to make a private function for it

class HashTable:
    ...
    def __get_bucket(self, key):
        index = self.__hash(key)
        bucket = self.array[index]
        return bucket        

and refactor my put function like this:

class HashTable:
    ....

    def __get_bucket(self, key):
        index = self.__hash(key)
        bucket = self.array[index]
        return bucket

    def __setitem__(self, key, value):
        bucket = self.__get_bucket(key)
        bucket.add_or_replace(key, value)        


Get:

in order to get a value by the key, we go like this:

class HashTable:
    ....

    def __getitem__(self,  key)
        bucket = self.__get_bucket(key)
        return bucket.get_value_by_key(key):        

this is pretty simple, we just get the linked list in the bucket variable and search inside of it with the help of the get_value_by_key method for the key and return the result

Delete:

class HashTable:
    ....

    def delete(self, key):
        bucket = self.__get_bucket(key)
        bucket.delete(key)        

it seems so clear, we are fetching the node by the key and deleting it, and of course, the underlying logic is being handled in the delete method.

Full HashTable Class:

this hash table uses the chaining algorithm

class HashTable:
    def __init__(self, size=5):
        self.array = [LinkedList() for i in range(size)]
    
    def __hash(self, key):
        return sum([ord(i) for i in str(key)]) % len(self.array)

    def __get_bucket(self, key):
        index = self.__hash(key)
        bucket = self.array[index]
        return bucket

    def __setitem__(self, key, value):
        bucket = self.__get_bucket(key)
        bucket.add_or_replace(key, value)

    def __getitem__(self,  key):
        bucket = self.__get_bucket(key)
        return bucket.get_value_by_key(key)

    def delete(self, key):
        bucket = self.__get_bucket(key)
        bucket.delete(key)        

and we can test it like this:

my_dct = HashTable()
my_dct["name"] = "Mohammad Javad"
my_dct["age"] = 23
my_dct["age"] = 24
my_dct["score"] = 100
my_dct["family name"] = "Ramezanpour"
my_dct["weight"] = 80
print(my_dct["score"])
my_dct.delete("score")
print(my_dct["age"])
print(my_dct["score"])        

this will result in:

100
24
None        

which prints the score 100 and the age 24 followed by a None, which is normal because we deleted the score (you can raise an exception if the key doesn't exist in __getitem__)

please consider that the LinkedList class is not implemented and it is beyond the scope of this article but I provided the full code in GitHub at the beginning of this article.

When Does O(n) Happen:

this happens when all of the keys we add end up in the same index, that way we will only have a long linked list and all operations like finding, insertion deletion and etc will take O(n) but as I told you because this rarely happens, we don't need to worry about it.

Other Ways to Handle Collisions:

Besides chaining we got another way to handle collisions which is called open addressing, which we are going to discuss now

Linear Probing:

this algorithm is one of the open addressing algorithms that starts from the index that the __hash method provides and iterates over the array's cells and it will store the value at the first empty cell it faces.

class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.array = [None] * self.size
    
    ...
        
    def __linear_probe(self, key):
        index = self.__hash(key)

        for i in range(self.size):
            current_index = (index + i) % self.size

            if self.array[current_index] is None:
                return current_index

            if self.array[current_index][0] == key:
                return current_index
 
       raise Exception("Hash table is full")          

this __linear_probe as I explained first grabs the index and then iterates over the array by the step of:

(index + i) % self.size        

from the index to find an empty cell and returns the index, that % self.size is to guarantee that the step is always a valid index and doesn't pass the size of the array

 class HashTable:
    ...  
 
    def __setitem__(self, key, value)
        index = self.__linear_probe(key)
        self.array[index] = (key, value)
    
    def __getitem__(self, key):
        index = self.__linear_probe(key)
        if self.array[index] is not None:
            return self.array[index][1]
        else:
            raise KeyError("Key not found")

    def delete(self, key):
        index = self.__linear_probe(key)
        if self.array[index] is not None:
            self.array[index] = None
        else:
            raise KeyError("Key not found"):        

these three methods are straightforward, they just get the proper index and do the thing that needs to be done

here is the full HashTable class with linear probing

class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.array = [None] * self.size

    def __hash(self, key):
        return sum([ord(i) for i in str(key)]) % self.size
    
    def __linear_probe(self, key):
        index = self.__hash(key)

        for i in range(self.size):
            current_index = (index + i) % self.size

            if self.array[current_index] is None:
                return current_index

            if self.array[current_index][0] == key:
                return current_index

        raise Exception("Hash table is full") 

    def __setitem__(self, key, value):
        index = self.__linear_probe(key)
        self.array[index] = (key, value)

    def __getitem__(self, key):
        index = self.__linear_probe(key)
        if self.array[index] is not None:
            return self.array[index][1]
        else:
            raise KeyError("Key not found")

    def delete(self, key):
        index = self.__linear_probe(key)
        if self.array[index] is not None:
            self.array[index] = None
        else:
            raise KeyError("Key not found")        

and you can test it the way we tested the HashTable class with the chaining algorithm.

but this method has a problem, after doing it several times we end up having some values beside each other called a cluster, as this cluster grows our search will take longer and longer, the next algorithm is coming to help us to solve this problem.

Quadratic Probing:

this is like linear probing but instead of iterating one by one, we iterate over the slots with the step of i to the power of 2, this way we will end up with wider range of full cells so we have more space for other keys. this is the full class:

class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.array = [None] * self.size


    def __hash(self, key):
        return sum([ord(i) for i in str(key)]) % self.size

    def __quadratic_probe(self, key):
        index = self.__hash(key)

        i = 0
        while (
            self.array[(index + i**2) % self.size] is not None
            and self.array[(index + i**2) % self.size][0] != key
        ):
            i += 1

        return (index + i**2) % self.size

    def __setitem__(self, key, value):
        index = self.__quadratic_probe(key)
        self.array[index] = (key, value)

    def __getitem__(self, key):
        index = self.__quadratic_probe(key)
        if self.array[index] is not None:
            return self.array[index][1]
        else:
            raise KeyError("Key not found")

    def delete(self, key):
        index = self.__quadratic_probe(key)
        if self.array[index] is not None:
            self.array[index] = None
        else:
            raise KeyError("Key not found")        

but this one has a problem as well, as you know, in order to find the cell we use the % operator to have a circular iteration over the array. this can end up in an infinite loop!! we should be careful about that

Double Hashing:

In the double hashing algorithm, we add another hashing function named __hash2, this __hash2 uses the largest prime number smaller than the table size to return an integer less than that prime number. then that integer will be used in the __double_hash_probe to calculate the step size over the slots with this expression hash1(key) + (i * hash2(key)) % table_size to find an empty slot or the retrieving key. Using a prime number helps to distribute the keys more evenly across the hash table and reduces the likelihood of clustering or collisions. The choice of the prime number can depend on various factors and may require some experimentation to find an optimal value for your specific use case. (I can't write anymore, LinkedIn article writer seems buggy) I hope it was helpful, feel free to ask any questions. Mohammad Javad

class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.array = [None] * self.size

    def __get_largest_prime(self):
        def is_prime(num):
            if num < 2:
                return False
            for i in range(2, int(num ** 0.5) + 1):
                if num % i == 0:
                    return False
            return True

        for i in range(self.size - 1, 1, -1):
            if is_prime(i):
                return i

        return 2

    def __hash(self, key):
        return sum([ord(i) for i in str(key)]) % self.size

    def __hash2(self, key):
        the_prime = self.__get_largest_prime()
        return the_prime - (self.__hash(key) % the_prime)

    def __double_hash_probe(self, key):
        index = self.__hash(key)
        hash2 = self.__hash2(key)

        i = 0
        while (
            self.array[(index + i * hash2) % self.size] is not None
            and self.array[(index + i * hash2) % self.size][0] != key
        ):
            i += 1

        return (index + i * hash2) % self.size

    def __setitem__(self, key, value):
        index = self.__double_hash_probe(key)
        self.array[index] = (key, value)

    def __getitem__(self, key):
        index = self.__double_hash_probe(key)
        if self.array[index] is not None:
            return self.array[index][1]
        else:
            raise KeyError("Key not found")

    def delete(self, key):
        index = self.__double_hash_probe(key)
        if self.array[index] is not None:
            self.array[index] = None
        else:
            raise KeyError("Key not found")        

要查看或添加评论,请登录

Mohammad Javad Ramezanpour的更多文章

社区洞察

其他会员也浏览了