In my last article I wrote about min-hashing, and how we can calculate a similarity between 2 records, whether it be a string of names, or even a blurb of words in a page. In order to compute the jaccard similarity between a million min-hashed records is not feasible. That’s where locality sensitive hashing comes in. Locality sensitive hashing provides a dimension reduction in your search, and will increase the speed of your searches.
Locality Sensitive Hashing is magical in the sense that, you simply hash the data, create buckets of hash values, and then any identical data pairs, should hash to the same bucket. Let’s summarize how to actually execute this algorithm. Suppose you had 1 million records of addresses, and you wanted to find any duplicate addresses which were contained in the list. For example:
123 Somewhere Road..
1-23 Somewhere Street …
16 Ivy Court..
16 Ivy Court Circle
203 Home St. #1
203 Home Street Apt. 1
From looking at the addresses above, you can see variations of just one address. The first step would be to shingle each record, and build sets out of your shingle pairs. For each shingle choose a k value for the length of your shingles. Once you have a set representation of each record, then you can build a min hash and store them into an array. This min hash representation should be kept in main memory for fast access. Store your min hash in a 2 dimensional array, where the first dimension column represents the number of records, and the second dimension column represents the min hashes themselves. The link below shows an excellent implementation in C# for both LSH and min hash:
Each record will then have its own min hash values, and this is where the magic begins. For each record, you will hash the min hash values, and store it in a list of buckets. The one caveat here is, for each record, hash only the partial min hash values, because this way there is a higher probability that you will have similar records hash to the same bucket. A word of warning, you will get false positives, meaning you could get two pairs which hash to the same bucket, but are not similar. Aside from the locality sensitive hashing, once you get the candidate pairs, you should compute the jaccard similarity so that you can get a similarity ratio to evaluate if they are the same record. For example, you could have 2 records hash to the same bucket, but their jaccard similarity is only 20%. If that is the case, then you know that is a false positive. You also have the potential to get false negatives, meaning records which are similar but are never looked at. It all depends on the number of bands you split your min hash values into. When building your min hash algorithm, you can fine tune it using the number of bands, and hash functions. LSH will give you those candidate pairs that you need to test for similarity, it won’t give you an exact match but approximate match.
LSH is one of my favorite algorithms for data mining, and I have used it on various occasions. Currently, I am researching a way to implement LSH in the stock market to find similarities in different time periods, to forecast a high probability outcome.