Levenshtein Distance

From Dedupe

Jump to: navigation, search

Contents

Definition

The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. Different costs/weights can be assigned to a replace, insert or delete operation.

Levenshtein distance is also known as edit distance.

Damerau-Levenshtein Distance

Damerau-Levenshtein distance is an extension of Levenshtein distance that counts transposition as a single edit operation. Strictly speaking, the Damerau-Levenshtein distance is equal to the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Damerau in his seminal paper not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. It is worth noting that Damerau concentrated on single-character misspellings.

Code

The levenshtein function can be used to compare alpha, numeric or alphanumeric strings!

PHP

From version 3.0 upwards php implemented the levenshtein() function in the form;

// $str1 and $str2 are the strings being compared
$str1 = "Forecast";
$str2 = "orcast";
// the levenshtein function will return a numeric 
// value representing the number of edit operations
// required to transpose str1 into str2
$ld = levenshtein($str1, $str2);

Other

The result

The numeric result from the Levenshtein distance function respresents the total cost of edits required to transpose str1 into str2. Consider the following examples:

str1 = Computational
str2 = Computxrbonel
levenshtein distance (str1, str2) = 4

str1 = hello
str2 = tests
levenshtein distance (str1, str2) = 4

Our brain tells us the first example is much more likely to be a match because str1 and str2 have more characters in common. If we divide the levenshtein distance by the length of the longest string we obtain a ratio of different : same characters. The problem is, we want 1 to represent a match and 0 to represent no match (currently 0 represents a match and 1 represents no match). To tackle this we invert the value- mathematically speaking:

Let n equal the length of str1 or str2, whichever is longest
Let d equal the levenshtein distance (str1, str2)
d will be a number between 0... n
d/n will be between 0... 1 (0 being an exact match, 1 being no match)
1 - (d/n) will be between 0... 1 (but now 0 is no match and 1 is an exact match)

See Also

References and papers

Personal tools
google ads