Speed Quality
From Dedupe
| Revision as of 19:10, 15 June 2006 85.176.102.134 (Talk) (Added note about metric trees) ← Previous diff |
Revision as of 18:23, 17 June 2006 Solexx (Talk | contribs) (→Metric Trees) Next diff → |
||
| Line 46: | Line 46: | ||
| ==Metric Trees== | ==Metric Trees== | ||
| - | There are a lot of attempts at speeding up similarity search with specific tree structures. The oldest one is the Burkhard-Keller tree, others are (Multi) Vantage Point tree, Bisector tree and M-Tree. Drawback: You have to use a similarity function that fulfills a few conditions (the most important one being the triangular inequality). Probably the best function for this task is pure Levenshtein (only insertion, deletion and replacement single characters allowed, no transposition, every operation has equal cost). In my tests you can find strings with a length between 10 and 30 in a BK-tree of 300,000 other names of that length with at most 2 errors in about the tenth of a second. The runtime is dependant exponentially on the number of allowed errors, though. | + | There are a lot of attempts at speeding up similarity search with specific tree structures. The oldest one is the Burkhard-Keller tree, others are (Multi) Vantage Point tree, Bisector tree and M-Tree. Drawback: You have to use a similarity function that fulfills a few conditions (the most important one being the triangular inequality). Probably the best function for this task is pure Levenshtein (only insertion, deletion and replacement single characters allowed, no transposition, every operation has equal cost). In my tests you can find strings with a length between 10 and 30 in a BK-tree of 300,000 other names of that length with at most 2 errors in about the tenth of a second. The runtime is dependant exponentially on the number of allowed errors, though. ([[User:solexx]]) |
Revision as of 18:23, 17 June 2006
Exhaustively comparing two records is resource intensive, hence comparing a large number of records can be extremely time consuming.
For each comparison operation we need to check a number of fields for similarity using phonetic algorithms. Consider the following data:
id fname sname road -- ----- ----- ---- 01 Jim Bob Main 02 Dave Jones High 03 David Jonez Hi
Let n equal the number of records and f equal the number of fields. If we chose to compare every record there would be 30 operations (((n^2 - n) / 2) * (3f + 1))
- 3 record comparisons (record 1 with record 2 & 3, and record 2 with record 3)
((n^2 - n) / 2) - for each comparison there would be 10 operations
(3f + 1)- for each record we generate 3 phonetic keys
(2f) - for each field we compare the phonetic keys
(f) - we generate a match score for the pair of records based on the phonetic comparison results
(1)
- for each record we generate 3 phonetic keys
If we had say 10 records and were looking at 10 phonetic fields this would rocket to 1395 operations.
Keys
If we wish to speed up the process we can reduce the number of operations required per comparison. In order to do this we generate keys once for every record and store them for later use. Consider the same data we used previously but this time 9 operations were carried out initialy to produce keys
id fname sname road fkey skey rkey -- ----- ----- ---- ---- ---- ---- 01 Jim Bob Main xxxx xxxx xxxx 02 Dave Jones High xxxx xxxx xxxx 03 David Jonez Hi xxxx xxxx xxxx
Let n equal the number of records and f equal the number of fields. If we chose to compare every record there would be 21 operations ((nf) + (((n^2 - n) / 2) * (f + 1)))
- for each record there would be 3 phonetic keys produced - fkey, skey and rkey
(nf) - 3 record comparisons (record 1 with record 2 & 3, and record 2 with record 3)
((n^2 - n) / 2) - for each comparison there would be 4 operations
(f + 1)- for each field we compare the phonetic keys
(f) - we generate a match score for the pair of records based on the phonetic comparison results
(1)
- for each field we compare the phonetic keys
Now if we had 10 records and were looking at 10 phonetic fields this number would only increase to 280 operations.
--Ltickett 11:18, 20 April 2006 (BST)
- I haven't checked the above calculations, maths, logic etc and it needs tidying up!!
Potential Matches
If we wish to speed up the process we can reduce the number of comparisons. I can see two primary ways in which this could be achieved:
- Skip record comparisons where x field differs - as we know straight away these will not be a match (this could be gender, country, email address, or any field(s) you choose - if either record has a blank in that field the comparison would still probably need to take place)
- Only compare records where x fields are equal - as we know records where these fields differ will not be a match
Metric Trees
There are a lot of attempts at speeding up similarity search with specific tree structures. The oldest one is the Burkhard-Keller tree, others are (Multi) Vantage Point tree, Bisector tree and M-Tree. Drawback: You have to use a similarity function that fulfills a few conditions (the most important one being the triangular inequality). Probably the best function for this task is pure Levenshtein (only insertion, deletion and replacement single characters allowed, no transposition, every operation has equal cost). In my tests you can find strings with a length between 10 and 30 in a BK-tree of 300,000 other names of that length with at most 2 errors in about the tenth of a second. The runtime is dependant exponentially on the number of allowed errors, though. (User:solexx)

