Main Page

From Dedupe

(Redirected from Dedupe:About)
Jump to: navigation, search
PHP Dedupe App (prototype by ltickett@gmail.com)
Enlarge
PHP Dedupe App (prototype by ltickett@gmail.com)

Contents

News

The php dedupe web app is really coming along, i've now posted a screenshot of the manual processing web gui --Ltickett 11:30, 5 July 2006 (BST)

After some comments and thought I've decided Visual Basic is probably NOT the way forward! This isn't to stop you or me writing functions/routines or even a final product using the language, but it does mean the focus will not be in that direction (initialy). More discussion about chosen/available Programming languages --81.6.246.41 22:02, 18 June 2006 (BST)

Up until now I don't think i've done a very good job of keeping on track- primarily due to the size and nature of this project! I think the only way to overcome this is by breaking down the project into smaller parts and setting achievable goals (at the moment one thing seems to lead to another and there is no real progression toward a final result). Once i've finished developing a few functions currently underway;

  • a dedupe php/mySQL script using edit distance and a customisable key
  • a country extraction routine

... (i thnk that's all)- i will try to cut the project up and determine which angle I want to tackle first (quite probably the deduplication of contact/company data). Once the main framework is in place hopefully it will be somewhat easy to progress. Any input greatly appreciated! --Ltickett 01:22, 18 June 2006 (BST)

The 3rd newsletter went out a couple of days ago see sourceforge.net --Ltickett 01:14, 18 June 2006 (BST)

Some disussion around a potential Web GUI has begun!--195.112.9.100 21:08, 12 June 2006 (BST)

Added Vb_functions page to describe visual basic functions which have been drafted and disucss them... --Ltickett 15:37, 1 June 2006 (BST)

Newsletter #2 has now been sent, available at sourceforge.net NT has translated the vb module into php for tidying up contact numbers (thanks!) this is now too available on Subversion. --Ltickett 16:51, 31 May 2006 (BST)

I've posted a draft visual basic module for tidying up contact numbers on Subversion, more info on the Data Types page. I've started some investigation into SQL Server 2005 fuzzy matching ability on the Existing software page. --Ltickett 16:56, 25 May 2006 (BST)

The latest update in the form of a mailing-list newsletter which you can view at sourceforge.net --ltickett 00:30, 22 May 2006 (BST)

View older news in the News Archive

Project: Dedupe - Fuzzy Matching & Deduplication

The project will be looking at data (the intention is to begin looking at customer name/address data but this may widen over time) and ways to intelligently detect duplicates using fuzzy matching methods and algorithms.

There appears to be a select few applications currently available (most of which at very high cost) to achieve the goal of removing duplicate data using a fuzzy mechanism. This may be considered a niche?

Information appears somewhat limited on the subject and very disjointed. The intention is to gather as much information on the subject, discuss and analyse the potential pros / cons of different techniques / algorithms (providing source code to allow them to be trialled, tweaked etc) with the end result potentially being documentation along with one (or a number of) scripts / applications.

One of the main points of focus is likely to be dirty data which requires work to allow comparison. This made further complex by the potential of working with international data where different rules may apply to formatting.

This project has now been activated on SourceForge.net

Mail me to get involed - ltickett@gmail.com

Routemap

If the truth be told, the direction of this project is still open to discussion. The way I see it... currently:

  • the wiki is being used to document as much as possible about phonetic algorithms and approximate string matching
  • code is being drafted for these algorithms and routines

Going forward there needs to be:

  • further discussion/analysis around which algorithms serve best under which circumstances and for which data
  • manually checked datasets to use as reference when tweaking/comparing automatic matching
  • code optimisations/bug fixes
  • summary of the wiki documented
  • a form of front-end produced to tie all the code together

Challenges & Phases

The way I see it the project can be broken into a number of problems/issues which it may OR may not be possible to tackle separately;

  • Which Programming languages should we use?
  • The End-to-end process (how do we join everything together?)
  • Fuzzy matching methods & algorithms (weigh up the pros/cons)
  • Rules & exceptions: Ltd = Limited, UK = United Kingdom... (can we overcome the adverse impact of abbreviation & acronym use?)
  • Speed Quality tradeoff (from my initial research it seems obvious for each additional comparison operation the computing time will increase exponentialy - how can we tackle this?)
  • Data Types (personal contact information, search terms?)
  • Dirty data (where records lack common format how are we to compare?)
  • Matching levels/weights: Person, Company, Address... (hopefully this will be an easy one BUT a little discussion is needed to understand user needs)
  • Beyond name/address data: Product Catalogue, Search Engine... (can we apply similar algorithms/methods to other types of data?)
  • Data I/O: CSV, MDB, SQL... (this isn't really a problem, and there's no need to reinvent the wheel BUT eventually a system will need implementing to enable import/export of as many file formats as is humanly possible)

--Xbootnek 23:25, 2 June 2006 (BST)

I'm curious on how the USPS CASS files might be used to standardize US addresses. USPS makes the Address DB available to software developers at an annual cost of around $2K. These guys in turn sell their package for $5K and up, or charge by the transaction.

I'm also in favor of having some manual control of data transformations: perhaps having an option to add lookup tables for specific column test text, and a dashboard for visual inspection and correction.

In the short term, *.xls or *.csv files would be easiest in terms of I/O. I have plenty of data to test with.

--84.12.176.60 11:33, 3 June 2006 (BST)

I guess this is similar to the UK PAF file? There's a little discussion on the Data Types page. Obviously fee based services aren't an option for open source software, but I would like to think there are some free (all be it weaker) alternatives?

Each of the above elements needs considerable discussion and analysis as well as coding!

Alternative Applications

The use of phonetic algorithms and fuzzy matching technique can be applied (and already is) in many situations and environments. For example;

  • Spell checkers
  • Search engines
  • Voice recognition systems

BrianM suggests the use of fuzzy matching techniques in mp3 id tags.

Can you think of any other situations where fuzzy matching could provide useful?

  • One of the origins of this field is signal processing (that's where levenshtein distance comes from, afaik)
  • optical character recognition (OCR)
  • computational biology (searching for / alignment of similar proteins/DNA). (User:solexx)

Existing Fuzzy Matching Software

There are applications already available to process data (fuzzy searching, deduping etc) from which we can hopefuly learn. Below i've named just a few;

  • Microsoft SQL Server 2005
  • TBC

FEBRLOpen source initiatice from Australia. Written in Python, and open source.

I intend to test some of these products and detail any learnings on the existing software page. If you've used any please add the details!

References & Sources

Contact

Lee (ltickett)

email: ltickett@gmail.com

msn: ltickett@totalise.co.uk

Personal tools
google ads