To start with any string similarity measurement, we need to talk about the basis of metric that we gonna use to quantify the similarities. The most commonly used one is Levenshtein distance(1965), and the distance behind Jaro Winkler needs the understanding of a different one called Jaro Distance.
In a nutshell, the Levenshtein distance is the number of operations(substitute, insert, delete) to transform one string into another. And the Jaro distance cannot be easily communicated without looking at the definition below.
d_jaro = 1/3 * ( m/|s1| + m/|s2| + (m-t)/m )
When we say two characters from each string matches, we mean they are the same letter and the difference between position is no more than floor(max(|s1|, |s2|)/2) – 1. Therefore, m is the number of matching characters and t is half the number of matching characters (but different sequence order like TE vs ET).
Then Jaro Winkler distance built a logic on top of Jaro distance which added some weight if they have the same prefix.
d_jaro_winkler = d_jaro + L * p * (1-d_jaro)
where L is the length of common prefix at the beginning of the string up to 4. p is a scaling factor not exceed 1/4. And the standard value is 0.1. I basically read through the explanation from Wikipedia here and try to repeat it in my own word.
They claimed that the Jaro Winkler distance is designed and best suited for short strings such as person names and it has a fixed scale from 0 to 1(perfect match).