Hashing with Hashmap & Consistent Hashing

Devashish
8 min readSep 27, 2021

If you are a computer programmer or from a similar domain, you might have heard of Hashing. So what is Hashing? Let’s understand,

Table of Content

Hashing

Hashing is the mechanism to convert an arbitrary string to a numeric value with a hash function. hash function returns a hash value in constant time which makes it the ideal choice for storing and retrieving values in constant time.

So how does it work in Hashmap?

The user wants to store some value with respect to a meaningful key. the user generates a hash for the key… But the hash value can be of a very diverse range, using hash directly as the index/bucket identifier is not a practical choice, as the default initial capacity of hashmap is 16. So here we consider using a modulo function.
For hashmap, the values are stored in Hash tables and the modulo of the hash value is calculated by table size(map capacity).

let’s take an example, we have a key “Cat”, the value we want to store is “meow”.
We pass the key to the hash function and get some hash 234.
Now using the value 234 directly as an index is not possible as the default size of the map is 16 only. So we calculate modulo 234%16 = 10
And we store key-value pairs at the 10th index.

Collisions

While storing values to indexes there might be the case when we calculate the hash (e.g. 330) and when we do modulo we get an index (10) where some value is already stored, How we’ll store value where a value is already stored? In that case, instead of storing values in the bucket, we store a start pointer of the linked list, In case of collision new value is added to the linked list.

Hash Map

From Java 8 TREEIFY_THRESHOLD constant was introduced with the default value 8. that helps in converting the linked list to a balanced binary tree to improve the lookup performance, in case number of elements in the linked list passes the threshold value.

A good hash algorithm distributes the equal number of elements throughout the buckets, so if the stored element count is 32 we can consider each bucket will have 2 elements so for retrieval of a value will take 2 lookups. similarly, if we increase the elements count to 10,000, it will require 625 lookups for retrieval, and so on the increase in number will keep on degrading the performance of the map. So we have to increase the capacity on increase in the number of elements, which is done based on load factor value. Let’s understand the load factor.

Load Factor

Load factor is the ratio of the number of elements with the capacity of the hashmap. It is the threshold value, once this threshold crosses the capacity of the hashmap is doubled to keep the operational complexity of map O(1).

The default load factor of a HashMap is 0.75f, while default load factor of C++ STL unordered_map is 1.0f

Initial Capacity and load factor can be provided while initializing the Hashmap.

Rehashing

Once the capacity of the hashmap is increased, It brings the overhead of rearranging all the key-value pairs in the hash table. It is because we use the modulo of map capacity to get the proper bucket index from the hash, but by the change in capacity, most of earlier stored pair locations will not be accessible and changed.

e.g. for key ‘xyz’, we calculated hash 413 and based on size 16 calculated bucket index was 413%16 == 13, but now if capacity is increased to 32 then 413%32 will be 29.

Once the number of entries increases the load factor to the threshold, it doubles the capacity of the map to maintain O(1) complexity for lookup.

Use case of Hashing in distributed systems

Hashing is also used in distributed databases systems to store/query data based on the hash of some key to a particular server.

Suppose, We have 3 database servers and data is distributed across this server based on the value of some key in the schema can also be called data key [when we partition the databases, we decide one key from schema based on which the partition is selected.]. Now user wants to save data to the table then the hash is calculated from the value of the data key, and then the modulo is calculated with the number of servers in this case 3. based on the modulo, the request is forwarded to the specific server.

Let’s understand the scenario where we need to scale up or scale down the system then we again see a problem of rehashing, we have to rehash the majority of data and shuffle them across the databases when the number of servers changed. In the case of hashmap, this was feasible because of everything happening at the same machine and in-memory, but in the case of distributed architecture, it is not feasible. to overcome this problem we use consistent hashing.

Consistent Hashing

Consistent Hashing is a distributed hashing mechanism that does not depend on the number of servers, despite that it works on a hash table and assigns the servers a position on a hash ring. this allows servers to scale without affecting the overall system.

When we implement distributed databases, instead of doing modulo, we consider using consistent hashing and calculate the hash for each server on server name/IP using the same hash function or another hash function with the same range of output, and place them on the hash ring.
We can consider assigning the requests in the hash ring in any of the conventions either clockwise or anti-clockwise. Once a new request comes and the hash for the data key is calculated and based on the clockwise or anti-clockwise convention the closest server is selected the request is sent to that server.

Now If any server gets excessive load or we need to scale the system up, we do that by adding a new server by dividing the existing range based on the server having a higher load. This way we do not modify or migrate data from all the servers instead we just re-distribute the data from either one or two of the servers to redistribute load to the new server.

Hash table
hash table

we’ll be considering anti- clockwise convention in a further example, so the hash table is maintained in descending order of hash values. which means for a request, we’ll select the server by iterating the hash table and select if the hash of the request is greater than the hash value of the server. If no such server is mapped means the request will go to the first server from the list(hash ring behavior).

Now let’s understand the same with picture[A.1], A request (suppose [hash#1722]) is needed to be sent to the closest server in the anti-clockwise direction, We iterate the hash table and compare the hash of the server, and request hash, at the end “Server A” is selected as the request hash is greater than the “Server A” hash only. Similarly, if we check for request [hash#510] we iterate through the table but couldn’t find the match so we select the “Server D”.

From the picture[A.1], we can understand “Server A” has a higher load, So to distribute that load we can add a new server.

In picture[A.2] We have added a new Server E, by evaluating the server's loads and capacity such that the load on servers gets equally distributed. Also, we added a new entry in the hash table for server E, hence all the requests with a hash value greater than 1650 will be handled by the new server. In this case, rehashing is reduced to the “Server A” only and It did not affect the overall system.

This way consistent hashing solves the problem of rehashing. But here we can notice another issue if the servers based on the hash values are not distributed evenly over the ring or in between a server goes down or is removed then some of the servers might face skewed load.

Even distribution of load

For such a case, we use a trick of using a number of server labels on the ring with the actual servers. the number of labels is decided based on a weight factor which depends on the situation and capacity of the server. If the servers are equally capable then the weight factor and number of labels are kept the same for all the servers.

So suppose we have considered the load factor for each server the same maybe 3 for example. So we can consider labels for all servers we [A0, A1, A2, B0, B1, B2, C0, C1, C2, D0, D1, D2].

Based on these labels, we calculate the hash values for each label and distribute them across the hash ring, For the same, we create a hash table where each label is mapped to the actual server. Any request coming to the label is considered for the respective server, this way the load is distributed across the ring, and the load from failure will be supposed to be equal.

Here in the picture[B.1], we can see load distribution happened evenly across the servers after adding the server labels on the hash ring. if In any of the servers fails or is offloaded then that server’s only data is rehashed and distributed across servers, and the entire load is not distributed across the servers instead of going to one particular server.

Let’s see the same in action in the below picture[B.2], we offloaded the ‘Server A’ and removed the server and its labels from the hash ring.

Here we can see in the image that the load from ‘Server A’ is not skewed to ‘Server D’ entirely instead it is distributed across ‘Server C’ and ‘Server D’. This way with consistent hashing we avoid skewed load on any particular server also minimize the rehashing in case of scaling up/down of the system.

Thanks for staying along this far, please share your thoughts and let me know if you find this article interesting, and please let me know if you find any misinformation or mistake in the article, I’ll try to correct that as soon as possible.

References:

https://www.youtube.com/watch?v=zaRkONvyGr8
https://www.toptal.com/big-data/consistent-hashing

--

--