Minhash LSH in Python and some SHA1 strangeness

Been exploring ekzhu/datasketch as a mechanism to accelerate fuzzy matching input strings. Internally the Minhash algorithm defaults to using sha1 for it’s hashing function . I find this confusing since from my understanding since SHA1s should produce output randomly different for even a single character being off.

The hash function is intended to reduce an arbitrary set of bytes into a 32-bit integer. It does so by utilizing the Python standard libraries hashlib sha1 to generate a binary string. Interestingly the GIL is held for shorter hashes and released once exceeding 2047 for the 3.10. The rightmost 4 bytes are interpreted as a little indian 32-bit integer.

Minhash.update/1 takes two values pre-generated labeled a and b from the permutations field. These are generated at Minhash._init_permutations if not specified when creating the object. permutations field contains a numpy array of the number of permutations. Each element is a tuple two unsigned 64-bit integers. The first between 1 and 2^61 exclusive while the second is 0 to 2^61 exclusive randomly generated. numpy.ndarray.T transposes dimensions of the array which I must be missing something as a single dimension array should not be transposed. Maybe the tuples do something I don’t expect?

>>> import numpy
>>> numpy.array([(1.,2.),(3.,4.)], dtype=numpy.uint64)
array([[1, 2],
       [3, 4]], dtype=uint64)
>>> numpy.array([(1.,2.),(3.,4.)], dtype=numpy.uint64).T
array([[1, 3],
       [2, 4]], dtype=uint64)
>>> numpy.array([(1.,2.),(3.,4.),(5.,6.)], dtype=numpy.uint64).T
array([[1, 3, 5],
       [2, 4, 6]], dtype=uint64)

Well that makes more sense with transposing rows and columns. The comma dereference operator in this case extracts each array separately. The algorithm makes a bit more sense now!

I am assuming each operation is applied to each element in the array when given a scalar as an operand. We then clamp our numbers back down to 32 bits. This serves as the mutated hash values to lookup with and the minimum of the previous and current are used. For the initial value the entire hash value is set _max_hash or all bits 0..31 set.

I still feel like I am missing something as even going through permutations seems to suggest neighbors for SHA1?

Taking a look from the query side

I’ve been using MinhashLSH.query/1 to search for similar matches. Taking a closer look there seem to be just a few key parts:

Traversal of all buckets and ranges. Upon creation each bucket is assigned a continuous range small enough to satisfy the false positive and negative tolerances using elements of the permutations. If not specified this will be brute forced from _optimal_param to get figure out the correct size for the buckets.
Querying the table to detect if a subset of the hash is there. For each entry within the bucket the algorithm will accumulate matching keys.

Testing again

Trying again with much simpler fuzzy matching, let us see what we get!

from datasketch import MinHash, MinHashLSH

permutations = 128


def build_hash(input):
    output = MinHash(num_perm=permutations)
    output.update(input.encode("utf-8"))
    return output


def main():
    index = MinHashLSH(num_perm=permutations, hashfunc=h, threshold=0.25)
    index.insert('test1', build_hash("food"))
    index.insert('test2', build_hash("foods"))
    index.insert('test3', build_hash("fool"))
    index.insert('test4', build_hash("fool"))
    index.insert('test5', build_hash("fool"))
    index.insert('test6', build_hash("fool"))
    index.insert('test7', build_hash("fool"))
    result = index.query(build_hash("foo"))
    result.sort()
    print(result)


main()

Hmm…..this shows the distance as being fairly far apart in retrospect. These do not easily match. I might have to revisit my utilization of this algorithm for fuzzy matching.

Stream of Consciousness

Mark Eschbach's random writings on various topics.