Over the course of my career as a Computer Scientist, one of my favorite subjects in the field has been data analysis.  I love finding ways of representing data in a meaningful and useful way to determine a conclusion.  A problem I have constantly come across is trying to find similarities between two similar text strings.  There are many ways to search and come up with possible matches using SQL like, or even charIndex.  Although these are somewhat efficient, I’d like to introduce to you of another way of using statistics to compare similarity.  The searching will come in future posts.  For example, let’s say we have the following 2 strings:

A: “The cat jumped.”

B: “The jumped cat!”

Let’s compute the Jaccard coefficient. From wikipedia: The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of theintersection divided by the size of the union of the sample sets:

Let’s try and find the similarity between the sets I provided above by treating A and B as sets. But how do we do that? The first thing you need to do is create a set of shingles.  Your set of shingles will have a fixed size k, where k <= size of (A).  I can pick k = 2, but you can pick k to be whatever you want depending on the size of your set.  If your set A represented a web page, then k = 2 would be too small, so you would be safer going with k = 9 or 10.

The cat jumped.

A = { {Th},{he},{e },{ c},{ca},{at}, {t },{ j},{ju},{um},{mp},{pe},{ed},{d.} }

The jumped cat!

B= { {Th},{he},{e },{ j},{ju},{um},{mp},{pe},{ed},{d },{ c},{ca},{at},{t!} }

I am going to coin this term “do the shingle”.  Because I am using k = 2, our shingle sets are 2 characters long.   To build our sets, you’ll notice I used bigrams.  Notice I left the punctuation in there on purpose.  Now that we have our shingles, the hard part is over. We can now calculate the jaccard coefficient. In simple terms, the Jaccard Coefficient is equal to (A Intersect B) / (A Union B). These shingles have an intersection size of 12 and a union size of 14, which results in 12/14 =   .8571.

In conclusion, this tells us that sets A and B, have a similarity of 85%. By using the Jaccard coefficient, the ordering of a string doesn’t matter. This is important when analyzing plagiarism, for example, if you switch a couple of paragraphs around, depending on the size of your shingles, your Jaccard index won’t be affected. This isn’t a magic formula, because you would still need some analysis as to how close the similarity is.

Before I end this post, I pose the question, what if I had to search for duplicate strings within a set of 1 billion rows.  So for example, we start at row 1, and have a string, we compute the jaccard index for each record after and keep any records with a jaccard index greater than .80. Great, now I go to row 2 and have to start at row 1 again to compute the jaccard index, and then after.  Even with the computing power we have today a list of a million rows would take a very long time to complete.  I have an answer for this, but you will have to tune in to find out how.

Determining Similarity – Jaccard Index