Properties of a smilarity preserving hash function and their realization in sdhash

back to overview


Breitinger, F., & Baier, H. (2012). Properties of a smilarity preserving hash function and their realization in sdhash. Paper presented at the Information Security for South Africa (ISSA).

Publication type

Paper in Conference Proceedings


Finding similarities between byte sequences is a complex task and necessary in many areas of computer science, e.g., to identify malicious files or spam. Instead of comparing files against each other, one may apply a similarity preserving compression function (hash function) first and do the comparison for the hashes. Although we have different approaches, there is no clear definition / specification or needed properties of such algorithms available. This paper presents four basic properties for similarity pre- serving hash functions that are partly related to the properties of cryptographic hash functions. Compression and ease of computation are borrowed from traditional hash functions and define the hash value length and the performance. As every byte is expected to influence the hash value, we introduce coverage. Similarity score describes the need for a comparison function for hash values. We shortly discuss these properties with respect to three existing approaches and finally have a detailed view on the promising approach sdhash. However, we uncovered some bugs and other peculiarities of the implementation of sdhash. Finally we conclude that sdhash has the potential to be a robust similarity preserving digest algorithm, but there are some points that need to be improved.


Organizational Units

  • Institute of Information Systems
  • Hilti Chair for Data and Application Security

Original Source URL