Data Types
From Dedupe
From a brief read, the microsoft documentation suggests SQL Server 2005 does not take the data-type into consideration when performing fuzzy matching/deduplication. See Existing software. More details/information will be provided once we've reviewed it.
Contents |
Personal Contact Data
The Person
If dealing with country specific contact data we have a number of advantages...
The Address
If dealing with country specific contact data we have a number of advantages, for example;
- We probably have a relatively small finite reference set (by this I mean we have a fixed number of states, counties, towns, roads, streets whatever you want to call them!)
- We probably have a pre-defined address layout (see example under UK Data)
- Our data is probably all in the same language (hopefully our language)
The opposite applies to multi-national data (see below)
UK Data
Information relating to the PAF (Postal Address File) is available on wikipedia. Further information relating to UK Postcodes is available on wikipedia.
US Data
Further information relating to US Zip Codes is available on wikipedia.
Multi-National Data
Multi-national data lacks all of the advantages of country specific data, common fuzzy techniques may prove less efficient. Difficulties may include;
- Dealing with foreign characters
Wikipedia has a number of articles relating to postal addresses accross the world (including Zip and Postcode formats for most countries) including the current list of ISO 3166-1 county codes.
Work has begun on a visual basic function to extract the country from a given file, see vb_country.bas on subversion (you will need internal.mdb in addition to the function).
The Contact Number(s)
Phone and fax numbers come in all shapes and sizes so a function will often be required to normalise them. I've been using a function within Microsoft Access for some time which i've published to Subversion. This is very specific to my requirements. Ideally we should modify the function to allow further customisation without needing a complete rewrite. Nt has translated this into php, again this needs some work but has potential (again published to Subversion)
The contact numbers may need normalising to allow matching and deduplication or it may just make the data smarter.
A list of the international dialing codes can be found on wikipedia
Other Data
Information about working with other data can go in here...

