When working with data quality and not at least data matching an ever recurring issue is compound words. We even have the issue when talking about terms related to data quality like is it called “meta data” or “metadata” and is it called “multi-domain MDM” or “multidomain MDM”. With MDM my spell checker likes the first option, but Gartner (the analyst firm) likes the last option.
In an international context the issue with compound words becomes much more frequent. In some languages like the other Germanic languages than English compound words are used much more. For example a street name as “Main Street” will be “Hauptstrasse” in German and “Hovedgade” in Danish.
If your first language has many compound words (like mine) you tend to use (and overuse) compound words even in English. I stumbled upon that when I was helping a family member looking for searching trends for “hair extensions”.
If you look at the regional interest in Google Insights the interest in “hair extensions” (figure 1) is big mostly in countries with English as first language while the interest in “hairextensions” (figure 2) is big mostly in countries having English as secondary or third language.
This is a really good point.
Companies often struggle with this issue when selecting a Master Account Name or other important data points in three facets:
1) Making a decision on a convention and standard (i.e. hair extensions or hairextensions)
2) Consolidating all instances of these scenarios under the selected convention using various matching techniques.
3) Enforcing an ongoing methodology that ensures those conventions and standards are maintained.
I would find it very interesting to hear your thoughts on not only deciding upon an ideal convention but how best to enforce that methodology.
Josh, thanks a lot for the comment.
One way is if there is a trusted external source. With company names it could be the public registered name found in a business directory. With street names an address directory. But I also have seen several versions of the truth between such directories.
Ultimately I think data governance is the way forward. I don’t like the term data owner, but a data steward for the domain in question decides based on balancing input from all data providers and data consumers.
However we still need a lot of data matching going on as all data aren’t born within your jurisdiction.
I couldn’t agree with you more. Data governance is the way forward.
Many organizations that I speak with on a daily basis mention that the difficulty in data governance is gaining buy-in from the data providers and consumers to the degree where real change can be made and enforced.
Often times a combination of that buy-in and support through technology provides an effective balance for all.
>But I also have seen several versions of the truth between such directories.
Henrik, I certainly second that e.g. helpIT systems ltd in the UK is down as “Help It Systems” (two words) on Royal Mail PAF but correct in Companies House data. Often you need to draw on one external dataset for one data item e.g. name, and another for other data e.g. address, even if they both contain name and address.
You’re right Steve. With company names there aren’t really any rules and you may have to settle for the best source for each attribute (for each country).