Levenshtein Distance
From Dedupe
Contents |
Definition
The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2. Different costs/weights can be assigned to a replace, insert or delete operation.
Levenshtein distance is also known as edit distance.
Damerau-Levenshtein Distance
Damerau-Levenshtein distance is an extension of Levenshtein distance that counts transposition as a single edit operation. Strictly speaking, the Damerau-Levenshtein distance is equal to the minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Damerau in his seminal paper not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. It is worth noting that Damerau concentrated on single-character misspellings.
Code
The levenshtein function can be used to compare alpha, numeric or alphanumeric strings!
PHP
From version 3.0 upwards php implemented the levenshtein() function in the form;
// $str1 and $str2 are the strings being compared $str1 = "Forecast"; $str2 = "orcast"; // the levenshtein function will return a numeric // value representing the number of edit operations // required to transpose str1 into str2 $ld = levenshtein($str1, $str2);
Other
The result
The numeric result from the Levenshtein distance function respresents the total cost of edits required to transpose str1 into str2. Consider the following examples:
str1 = Computational str2 = Computxrbonel levenshtein distance (str1, str2) = 4 str1 = hello str2 = tests levenshtein distance (str1, str2) = 4
Our brain tells us the first example is much more likely to be a match because str1 and str2 have more characters in common. If we divide the levenshtein distance by the length of the longest string we obtain a ratio of different : same characters. The problem is, we want 1 to represent a match and 0 to represent no match (currently 0 represents a match and 1 represents no match). To tackle this we invert the value- mathematically speaking:
Let n equal the length of str1 or str2, whichever is longest Let d equal the levenshtein distance (str1, str2) d will be a number between 0... n d/n will be between 0... 1 (0 being an exact match, 1 being no match) 1 - (d/n) will be between 0... 1 (but now 0 is no match and 1 is an exact match)
See Also
- Levenshtein distance at php.net
References and papers
- A technique for computer detection and correction of spelling errors by Fred J Damerau Image:Acm march1964.pdf

