Existing software
From Dedupe
Contents |
Microsoft SQL Server 2005
I understand Microsoft have introduced a fuzzy matching system to their latest release of SQL Server. You can read the full article at microsoft.com.
|
There are a points which I find interesting;
- The mechanism is purely token-drive - Microsoft aren't using any phonetic algorithms, should we take a leaf from their book?
- The consideration of relative frequencies - This sounds like a great idea, especially if you're dealing with a specific market (for example IT, you'll probably find most of the company name's contain the words Systems, Computers, Technologies etc but you wouldn't want these to cause false positives.
A further few quotes of possible interest;
|
The matching seems to be based on the ngram function.
I've taken a few screenshots of the simplest data flow I could setup in the SQL Server Business Intelligence Development Server (now there's a mouth full). Upon examining the results I was a little disappointed (I may have been under the impression it was doing something a little cleverer then it actually was).
Upon further investigation it seems SQL Server works on a customisable list of seperators (;,.# etc) and determines tokens based on these seperators (Whilst this is obviously pretty good for matching N.Wales with N Wales, I don't think it's much use for matching dentist for dentistry for example (I may be wrong?).
I suspect to harness the full potential of the fuzzy grouping in SQL Server a more complex flow could be setup (to tidy up records etc?) - I may revisit this again sometime in the future...
References & Papers
- Eliminating Fuzzy Duplicates in Data Warehouses by Rohit Ananthakrishna (of Cornell University), Surajit Chaudhuri and Venkatesh Ganti (of Microsoft) Image:Vldb02 delphi.pdf
- Robust and Efficient Fuzzy Match for Online Data Cleaning by Rajeev Motwani (of Stanford University), Surajit Chaudhuri, Kris Ganjam and Venkatesh Ganti (of Microsoft) Image:Sig03 FM.pdf
TBC
Uniserv
I think Uniserv is one of the most successful players in the field of address cleansing and merging. As far as I know, their licence terms make you pay per "hit" or something like that.
Silversmith Refiner
This product was formerly commercial, now GPL. No longer maintained by original author. Based on MS Access, awk, agrep, and shell scripts. See archived web site for a detailed description: Silversmith Refiner (on the author's home computer, which is sometimes turned off).
Sources and reference materials generally useful for dedupe projects: refiner.tar.bz2
Unfortunately, in order to protect the formerly proprietary code, convoluted encryption and obsfucation schemes were applied to the scripts. It also has a license activation feature that needs to be removed/circumvented in order to get the full function.

