Hashing is a fundamental operation in database management and plays a key role in the implementation of the core database data structures and algorithms. Hashing is also critical for database data collisions, which can reduce query performance. There are many known schemes to handle collisions, such as chaining, probing, and cuckoo hashing.

Hashing and hashing-based algorithms and data structures have many applications in computer science, such as machine learning, computer graphics, bioinformatics, and compilers.

A hash function generates codes that can be used to replace data inputs. Because these codes are shorter than the actual data and usually have a predetermined length, it is easier to discover and retrieve the original data.

Due to the random nature of standard hash functions, it is sometimes possible for two pieces of data to have the same hash value. As a result, collisions occur when a user searches for one thing and is redirected to multiple data items with the same hash value. Slower searches and poorer performance occur as a result of it taking much longer to find the right one.

Researchers at MIT and elsewhere wanted to use machine learning to build better hash functions that are used in many applications. Perfect hash functions are designed to sort data in a way that avoids collisions. However, they must be constructed specifically for each data set and take longer to compute than traditional hash functions. Using learned models instead of traditional hash functions can lead to half the number of collisions. Their experiments also showed that learned models were often computationally more efficient than perfect hash functions.

These findings, which will be presented at the International Conference on Very Large Databases, show how a hash function can be designed to significantly speed up queries across a large database. This method could speed up computer systems that scientists use to store and analyze DNA, amino acid sequences and other biological data.

Ibrahim Sabek, a postdoc in the MIT Data Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL), said: “What we found in this work is that in some situations we can make a better trade-off between the computation of the hash function and the collisions we face. We can slightly increase the computation time for the hash function, but at the same time we can certain situations greatly reduce collisions.”

He explains, “We might have a huge number of data entries, and each one has a different gap between it and the next, so it’s kind of hard to learn that.”

He also said, “At a certain threshold of submodels, you get enough information to build the approximation you need for the hash function. But after that it won’t lead to any more improvement in collision reduction.”

A traditional hash function generates a random number or code that corresponds to the lock where the key is kept. Still, there is a good chance that two keys will end up in the same place, resulting in collisions.

Based on their research, the researchers want to use learning models to design hash functions for different data types. They also plan to investigate learned hashing for databases that can be inserted or deleted. When data is updated in this way, the model must be updated. However, modifying the model while maintaining accuracy is a difficult problem.

Learned models can reduce the ratio of keys in a dataset that collide from 30% to 15%. They also achieve higher throughput while reducing run time by more than 30%. The researchers investigated the use of learned models for hashing. They found that the number of submodels had the greatest impact on the number of collisions. They want to use learned models to create hash functions for different forms of data and investigate learned hashing for databases.

The researcher said, “We want to encourage the community to use machine learning within more fundamental data structures and operations. Any kind of core data structure gives us the opportunity to use machine learning to capture data properties and achieve better performance. There is still a lot we can investigate.”

Magazine reference:

  1. Ibrahim Sabek, Kapil Vaidya, et al. Can learned models replace hash functions? PVLDB. DOI: 10.14778/3570690.3570702