Dirty data
From Dedupe
The term Dirty Data refers to data which doesn't conform to standards and/or contains a lot of noise. Consider the below data extract;
"Jones, Jamie","106 St James Terrace" "Mr J Jones","106 Saint James Terrace" "Jamie Jones","106, St James Terrace"
It's likely the three records are the same yet they canot be considered equal as they've not been standardised. There are several issues here which need to be addressed separately.
Contents |
Names & Naming Conventions
Returning to my previous example;
"Jones, Jamie" "Mr J Jones" "Jamie Jones"
For obvious reasons we cannot extract the first name from "Mr J Jones" nor can we extract the sex from "Jones, Jamie" or "Jamie Jones". However we can try to normalise the data by;
- Removing special characters (anything non-alpha)
- Removing conventions and titles (e.g. Mrs, Dr, Sir, of, from)
- Rearranging the names alphabetically*
*Bizzarre as this may sound I think this overcomes a lot of problems. I've yet to provide any evidence as to the success of this method so would appreciate any feedback/comments.
Acronyms & Abbreviations
Is there a way to overcome the problems caused by acronyms and abbreviations without the use of a lookup table?
It has to be considered that company names, sometimes consist of abbreviations. So it might be of genuine meaning, if there isn't other information to match. Otherwise abbreviations might be simply skipped, as information less than e.g. three characters doesn't represent a meaningful information.
Optical Character Errors
When dealing with data such as part numbers or license plates we might need to consider alternate coding methods. For example any occurrence of; A,4,7 becomes A 8,B,S,5 becomes B 0,C,O becomes C
Now if we had two number plates AX02 YBC and 4XO2 Y8C both would be coded as AXC YBC and hence matched... Once two or more records are matched it would probably be a good idea to see how many characters are true matches as opposed to visual matches to give an indication of the likelihood of them actually being matches.
I just thought this up on the spot... so any comments/thoughts discussion etc?
Use a similarity function based on techniques from OCR error correction... Google may help here. I bet a quick hack based on comparing text and its 1337-speak representation would be adequate for most uses. For OCR errors in general, like rn versus m we need to find or create a database. There aren't many previous efforts to create such a list.

