Existing software

From Dedupe

Jump to: navigation, search

Contents

Microsoft SQL Server 2005

Fuzzy Data Flow
Enlarge
Fuzzy Data Flow
Fuzzy Config
Enlarge
Fuzzy Config
Job Log
Enlarge
Job Log

I understand Microsoft have introduced a fuzzy matching system to their latest release of SQL Server. You can read the full article at microsoft.com.

  • Fuzzy Lookup and Fuzzy Grouping use a custom, domain-independent distance function that takes into account the edit distance (for example, "hits" is distance 2 from "bit"), the number of tokens, token order, and relative frequencies. As a result, Fuzzy Lookup and Fuzzy Grouping achieve much finer discrimination than full-text searches because they capture a more detailed structure of the data.
  • Because they are purely token-driven, Fuzzy Lookup and Fuzzy Grouping do not have a language-dependent component like soundex.
  • Because they use more than just edit distance, Fuzzy Lookup and Fuzzy Grouping are not as easily misled by transpositions and can detect higher level patterns than an approach that uses only edit distance.
  • Fuzzy Lookup and Fuzzy Grouping are tightly integrated with SSIS, which makes them easily usable with little or no custom programming for ETL tasks with SQL Server 2005.

There are a points which I find interesting;

  • The mechanism is purely token-drive - Microsoft aren't using any phonetic algorithms, should we take a leaf from their book?
  • The consideration of relative frequencies - This sounds like a great idea, especially if you're dealing with a specific market (for example IT, you'll probably find most of the company name's contain the words Systems, Computers, Technologies etc but you wouldn't want these to cause false positives.

A further few quotes of possible interest;

  • Be aware, also, that Fuzzy Lookup indexes only a subset of the all the possible q-grams in a given record for efficiency reasons. Fuzzy Lookup may fail to find a match due to this sampling process, although matches will be found with a high degree of probability if the records contain many q-grams. For datasets which have attributes whose values are predominantly a single short token, one alternative, if Fuzzy Lookup is having trouble finding matches which you think it should find, is to set the Exhaustive component property to True. This will cause Fuzzy Lookup to ignore the ETI and instead directly compare the input record to each and every record in the reference table.
  • The token frequencies from the reference table. Highly frequent tokens are generally considered to give little information about the goodness of the match. Relatively rare tokens are considered characteristic of the row in which they occur.

The matching seems to be based on the ngram function.

I've taken a few screenshots of the simplest data flow I could setup in the SQL Server Business Intelligence Development Server (now there's a mouth full). Upon examining the results I was a little disappointed (I may have been under the impression it was doing something a little cleverer then it actually was).

Upon further investigation it seems SQL Server works on a customisable list of seperators (;,.# etc) and determines tokens based on these seperators (Whilst this is obviously pretty good for matching N.Wales with N Wales, I don't think it's much use for matching dentist for dentistry for example (I may be wrong?).

I suspect to harness the full potential of the fuzzy grouping in SQL Server a more complex flow could be setup (to tidy up records etc?) - I may revisit this again sometime in the future...

References & Papers

  • Eliminating Fuzzy Duplicates in Data Warehouses by Rohit Ananthakrishna (of Cornell University), Surajit Chaudhuri and Venkatesh Ganti (of Microsoft) Image:Vldb02 delphi.pdf
  • Robust and Efficient Fuzzy Match for Online Data Cleaning by Rajeev Motwani (of Stanford University), Surajit Chaudhuri, Kris Ganjam and Venkatesh Ganti (of Microsoft) Image:Sig03 FM.pdf

TBC


Uniserv

I think Uniserv is one of the most successful players in the field of address cleansing and merging. As far as I know, their licence terms make you pay per "hit" or something like that.

Silversmith Refiner

This product was formerly commercial, now GPL. No longer maintained by original author. Based on MS Access, awk, agrep, and shell scripts. See archived web site for a detailed description: Silversmith Refiner (on the author's home computer, which is sometimes turned off).

Sources and reference materials generally useful for dedupe projects: refiner.tar.bz2

Unfortunately, in order to protect the formerly proprietary code, convoluted encryption and obsfucation schemes were applied to the scripts. It also has a license activation feature that needs to be removed/circumvented in order to get the full function.

Personal tools
google ads