Speed Quality

From Dedupe

(Difference between revisions)
Jump to: navigation, search
Revision as of 19:10, 15 June 2006
85.176.102.134 (Talk)
(Added note about metric trees)
← Previous diff
Revision as of 18:23, 17 June 2006
Solexx (Talk | contribs)
(Metric Trees)
Next diff →
Line 46: Line 46:
==Metric Trees== ==Metric Trees==
-There are a lot of attempts at speeding up similarity search with specific tree structures. The oldest one is the Burkhard-Keller tree, others are (Multi) Vantage Point tree, Bisector tree and M-Tree. Drawback: You have to use a similarity function that fulfills a few conditions (the most important one being the triangular inequality). Probably the best function for this task is pure Levenshtein (only insertion, deletion and replacement single characters allowed, no transposition, every operation has equal cost). In my tests you can find strings with a length between 10 and 30 in a BK-tree of 300,000 other names of that length with at most 2 errors in about the tenth of a second. The runtime is dependant exponentially on the number of allowed errors, though.+There are a lot of attempts at speeding up similarity search with specific tree structures. The oldest one is the Burkhard-Keller tree, others are (Multi) Vantage Point tree, Bisector tree and M-Tree. Drawback: You have to use a similarity function that fulfills a few conditions (the most important one being the triangular inequality). Probably the best function for this task is pure Levenshtein (only insertion, deletion and replacement single characters allowed, no transposition, every operation has equal cost). In my tests you can find strings with a length between 10 and 30 in a BK-tree of 300,000 other names of that length with at most 2 errors in about the tenth of a second. The runtime is dependant exponentially on the number of allowed errors, though. ([[User:solexx]])

Revision as of 18:23, 17 June 2006

Exhaustively comparing two records is resource intensive, hence comparing a large number of records can be extremely time consuming.

For each comparison operation we need to check a number of fields for similarity using phonetic algorithms. Consider the following data:

id fname sname road
-- ----- ----- ----
01 Jim   Bob   Main
02 Dave  Jones High
03 David Jonez Hi

Let n equal the number of records and f equal the number of fields. If we chose to compare every record there would be 30 operations (((n^2 - n) / 2) * (3f + 1))

  1. 3 record comparisons (record 1 with record 2 & 3, and record 2 with record 3) ((n^2 - n) / 2)
  2. for each comparison there would be 10 operations (3f + 1)
    1. for each record we generate 3 phonetic keys (2f)
    2. for each field we compare the phonetic keys (f)
    3. we generate a match score for the pair of records based on the phonetic comparison results (1)

If we had say 10 records and were looking at 10 phonetic fields this would rocket to 1395 operations.

Keys

If we wish to speed up the process we can reduce the number of operations required per comparison. In order to do this we generate keys once for every record and store them for later use. Consider the same data we used previously but this time 9 operations were carried out initialy to produce keys

id fname sname road fkey skey rkey
-- ----- ----- ---- ---- ---- ----
01 Jim   Bob   Main xxxx xxxx xxxx
02 Dave  Jones High xxxx xxxx xxxx
03 David Jonez Hi   xxxx xxxx xxxx

Let n equal the number of records and f equal the number of fields. If we chose to compare every record there would be 21 operations ((nf) + (((n^2 - n) / 2) * (f + 1)))

  1. for each record there would be 3 phonetic keys produced - fkey, skey and rkey (nf)
  2. 3 record comparisons (record 1 with record 2 & 3, and record 2 with record 3) ((n^2 - n) / 2)
  3. for each comparison there would be 4 operations (f + 1)
    1. for each field we compare the phonetic keys (f)
    2. we generate a match score for the pair of records based on the phonetic comparison results (1)

Now if we had 10 records and were looking at 10 phonetic fields this number would only increase to 280 operations.

--Ltickett 11:18, 20 April 2006 (BST)

I haven't checked the above calculations, maths, logic etc and it needs tidying up!!

Potential Matches

If we wish to speed up the process we can reduce the number of comparisons. I can see two primary ways in which this could be achieved:

  • Skip record comparisons where x field differs - as we know straight away these will not be a match (this could be gender, country, email address, or any field(s) you choose - if either record has a blank in that field the comparison would still probably need to take place)
  • Only compare records where x fields are equal - as we know records where these fields differ will not be a match

Metric Trees

There are a lot of attempts at speeding up similarity search with specific tree structures. The oldest one is the Burkhard-Keller tree, others are (Multi) Vantage Point tree, Bisector tree and M-Tree. Drawback: You have to use a similarity function that fulfills a few conditions (the most important one being the triangular inequality). Probably the best function for this task is pure Levenshtein (only insertion, deletion and replacement single characters allowed, no transposition, every operation has equal cost). In my tests you can find strings with a length between 10 and 30 in a BK-tree of 300,000 other names of that length with at most 2 errors in about the tenth of a second. The runtime is dependant exponentially on the number of allowed errors, though. (User:solexx)

google ads