Jaro Winkler – String Similarity Measurement for short strings

Jaro Winkler, a commonly measure of similarities between strings. To have a better understanding of all the methods, this post from joyofdata is super helpful and informative, also cirrius.

Python: Jelly Fish

R: strcmp – RecordLinkage

To start with any string similarity measurement, we need to talk about the basis of metric that we gonna use to quantify the similarities. The most commonly used one is Levenshtein distance(1965), and the distance behind Jaro Winkler needs the understanding of a different one called Jaro Distance.

In a nutshell, the Levenshtein distance is the number of operations(substitute, insert, delete) to transform one string into another. And the Jaro distance cannot be easily communicated without looking at the definition below.

d_jaro = 1/3 * ( m/|s1| + m/|s2| + (m-t)/m )

When we say two characters from each string matches, we mean they are the same letter and the difference between position is no more than floor(max(|s1|, |s2|)/2) – 1. Therefore, m is the number of matching characters and t is half the number of matching characters (but different sequence order like TE vs ET).

Then Jaro Winkler distance built a logic on top of Jaro distance which added some weight if they have the same prefix.

d_jaro_winkler = d_jaro + L * p * (1-d_jaro)

where L is the length of common prefix at the beginning of the string up to 4. p is a scaling factor not exceed 1/4. And the standard value is 0.1. I basically read through the explanation from Wikipedia here and try to repeat it in my own word.

They claimed that the Jaro Winkler distance is designed and best suited for short strings such as person names and it has a fixed scale from 0 to 1(perfect match).

2 thoughts on “Jaro Winkler – String Similarity Measurement for short strings”

Wonder what the difference between “JARO WINKLER – STRING DISTANCE” is vs. the edit distance algorithm developed Levenshtein in 1966?

datafireball says:

October 30, 2014 at 3:02 am

Hi Rich, there is a great question, I added some content to my post and in a nutch shell, their scale is different and JW is better suited for shorter strings. I also happened to come across this blog from joyofdata where probably worth a look.

Reply