Minhash LSH in Python and some SHA1 strangeness
Categories: programming
Tags: data-science lsh
Been exploring ekzhu/datasketch as a mechanism to accelerate
fuzzy matching input strings. Internally the Minhash algorithm defaults to
using sha1
for it’s hashing function .
I find this confusing since from my understanding since SHA1s should produce output randomly different for even a single
character being off.
The hash function is intended to reduce an arbitrary set of bytes into a 32-bit integer. It does so by utilizing the Python standard libraries hashlib sha1 to generate a binary string. Interestingly the GIL is held for shorter hashes and released once exceeding 2047 for the 3.10. The rightmost 4 bytes are interpreted as a little indian 32-bit integer.
Minhash.update/1
takes two values pre-generated labeled a
and b
from the permutations
field. These are generated at Minhash._init_permutations
if not specified when creating the object. permutations
field contains a numpy
array of the number of permutations.
Each element is a tuple two unsigned 64-bit integers. The first between 1 and 2^61 exclusive while the second is 0 to
2^61 exclusive randomly generated.
numpy.ndarray.T
transposes dimensions of the array
which I must be missing something as a single dimension array should not be transposed.
Maybe the tuples do something I don’t expect?
>>> import numpy
>>> numpy.array([(1.,2.),(3.,4.)], dtype=numpy.uint64)
array([[1, 2],
[3, 4]], dtype=uint64)
>>> numpy.array([(1.,2.),(3.,4.)], dtype=numpy.uint64).T
array([[1, 3],
[2, 4]], dtype=uint64)
>>> numpy.array([(1.,2.),(3.,4.),(5.,6.)], dtype=numpy.uint64).T
array([[1, 3, 5],
[2, 4, 6]], dtype=uint64)
Well that makes more sense with transposing rows and columns. The comma dereference operator in this case extracts each array separately. The algorithm makes a bit more sense now!
I am assuming each operation is applied to each element in the array when given a scalar as an operand. We then clamp
our numbers back down to 32 bits. This serves as the mutated hash values to lookup with and the minimum of the previous
and current are used. For the initial value
the entire hash value is set _max_hash
or all bits 0..31 set.
I still feel like I am missing something as even going through permutations
seems to suggest neighbors for SHA1
?
Taking a look from the query side
I’ve been using MinhashLSH.query/1
to search for similar matches. Taking a closer look there seem to be just a few key parts:
- Traversal of all buckets and ranges. Upon creation each bucket is assigned a continuous range small enough to
satisfy the false positive and negative tolerances using elements of the permutations. If not specified this will be
brute forced from
_optimal_param
to get figure out the correct size for the buckets. - Querying the table to detect if a subset of the hash is there. For each entry within the bucket the algorithm will accumulate matching keys.
Testing again
Trying again with much simpler fuzzy matching, let us see what we get!
from datasketch import MinHash, MinHashLSH
permutations = 128
def build_hash(input):
output = MinHash(num_perm=permutations)
output.update(input.encode("utf-8"))
return output
def main():
index = MinHashLSH(num_perm=permutations, hashfunc=h, threshold=0.25)
index.insert('test1', build_hash("food"))
index.insert('test2', build_hash("foods"))
index.insert('test3', build_hash("fool"))
index.insert('test4', build_hash("fool"))
index.insert('test5', build_hash("fool"))
index.insert('test6', build_hash("fool"))
index.insert('test7', build_hash("fool"))
result = index.query(build_hash("foo"))
result.sort()
print(result)
main()
Hmm…..this shows the distance as being fairly far apart in retrospect. These do not easily match. I might have to revisit my utilization of this algorithm for fuzzy matching.