Programming languages
From Dedupe
lTa6fO <a href="http://fpcxvhhmpbyq.com/">fpcxvhhmpbyq</a>, [url=http://zllkhptjysyw.com/]zllkhptjysyw[/url], [link=http://zjkhhomnqosw.com/]zjkhhomnqosw[/link], http://yirfbqizalgg.com/
h6tU5T <a href="http://fxulavcqecrk.com/">fxulavcqecrk</a>, [url=http://wlpdktrgqfgo.com/]wlpdktrgqfgo[/url], [link=http://vbgecastfqsw.com/]vbgecastfqsw[/link], http://ciqnuxosuunk.com/
Contents |
Discussion
Visual Basic
--Ltickett 14:05, 25 April 2006 (BST) My preferred language at the time of writing this article. I've already released a few functions over at sourceforge. An obvious weakness/limitation is that vb runs only on a handful of operating systems and to my knowledge is a client-side scripting language.
Visual Basic 5 is available as a free download from microsoft.com
So far the following vb functions have been released:
| function | link |
|---|---|
| levenshtein | vb_levenshtein.bas |
| soundex | vb_soundex.bas |
| caverphone | vb_caverphone.bas |
| ngram | vb_ngram.bas |
| replace | vb_replace.bas |
| recur_replace | vb_replace_recur.bas |
| tidy_number | vb_tidynumber.bas |
PHP
--Ltickett 14:05, 25 April 2006 (BST) Probably my preferred server-side language at the time of writing this article. Definite benefits include the already builtin phonetic and string matching functions.
MySQL
As mentioned elsewhere it might become useful to store the phonetic info to save computation-time in an extended table or database.
MySQL is an open source database, that gives the advantage of being extensible by User Defined Functions in my-preferred[TM] language. Prototypes of PHP or VisualBasic could later be implemented in C to acommodate for speed and the overhead of transferring whole datasets from SQL to PHP
There are some UDF to be found in the MySQL UDF-Registry
C/C#/C++
This is my first entry here, but I've been deduping using custom, self-written software for a while. One of the big things about this is that for large data sets (say half a million names/addresses), is that a lot of the DP algorithms, like DL-Distance run in O(n*m) time, and when you're comparing each record with each other record, or even a subset of that, you want something that runs quickly. Using C/C++ would enhance portability and could be wrapped in a simple set of functions to act as a DLL, *nix shared object, or even a PHP module (which will run significantly faster than a PHP script). C# could be wrapped as a web service through Mono on *nix or the MS.Net framework on Windows.
Most importantly though, the underlying algorithms should run as quickly as possible.
Connection Strings
Some common ODBC connection strings are listed below (the full list is available at connectionstrings.com)
SQL Server: "Driver={SQL Server};Server=Aron1;Database=pubs;Uid=sa;Pwd=asdasd;"
SQL Server 2005: "Driver={SQL Native Client};Server=Aron1;Database=pubs;UID=sa;PWD=asdasd;"
Access: "Driver={Microsoft Access Driver (*.mdb)};Dbq=C:\mydatabase.mdb;Uid=Admin;Pwd=;"
Oracle: "Driver={Microsoft ODBC for Oracle};Server=OracleServer.world;Uid=Username;Pwd=asdasd;"
MySQL: "DRIVER={MySQL ODBC 3.51 Driver};SERVER=data.domain.com;PORT=3306;DATABASE=myDatabase; USER=myUsername;PASSWORD=myPassword;OPTION=3;"
Excel: "Driver={Microsoft Excel Driver (*.xls)};DriverId=790;Dbq=C:\MyExcel.xls;DefaultDir=c:\mypath;"
Text: "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=c:\txtFilesFolder\;Extensions=asc,csv,tab,txt;"
FoxPro: "Driver={Microsoft dBASE Driver (*.dbf)};DriverID=277;Dbq=c:\mydbpath;"

