This morning I noticed this lightbulb joke in a tweet from @mortensax:
Besides finding it amusing I also related to it since I have used an example with light bulbs in a webinar about data matching as seen here:
The use of synonyms in Search Engine Optimization (SEO) is very similar to the techniques we use in data matching.
Here the problem is that for example these two product descriptions may have a fairly high edit distance (very different character by character), but are the same:
- Light bulb, A 19, 130 Volt long life, 60 W
- Incandescent lamp, 60 Watt, A19, 130V
while these two product descriptions have an edit distance of only one substitution of a character, but are not the same product (though being same category):
- Light bulb, 60 Watt, A 19, 130 Volt long life
- Light bulb, 40 Watt, A 19, 130 Volt long life
Working with product data matching is indeed very enlightening.
Always a challenge. I think the key to this is standardization before matching. You have to help the algorithms when you can. The standard description should be either light bulb or incandescent lamp, for example. A reference table would help standardize this and the other attributes. If the data can’t be changed for some reason, it’s not so uncommon to create new attributes that will only be used for matching.
Thanks Steve. I agree. These are exactly the techniques I know is used around for solving challenges in product master data matching.
Building databases containing compound fields such as the examples you have given here is extremely poor design. It is the type of effort that might be expected from a 1st year computing student, one that would give the lecturer the opportunity to point out the plethora of errors it contained and why any good database designer should, and could, avoid them.
Attempting to work on and sort such data is also 1st year student level. It brings to mind the wonderful Irish saying If I wanted to get to Dublin I wouldn’t start from here!
All of the embedded attributes need to be identified and extracted into a properly structured tables, where each column represents an attribute.
The next step is to define which attributes represent the Unique Identifier of the entity and then sort on these to highlight duplicate rows. In a more sophisticated solution, many of these columns will be foreign keys to other reference and domain entities.
Jim Harris has done an excellent video showing how to start this process at http://www.youtube.com/watch?v=bdzD4Kp7weg
Thanks for the example, Henrik.
Thanks John. I agree about the data modeling practice we need in handling this kind of data.
But I have come across multiple situations where even commercial software packages and home grown solutions made by veterans handle these data in compound fields. I have also been involved in the requirements needed in the best product master data hubs in the world. Not because the hub doesn’t handle the breakdown – it does. But data is coming in from outside your jurisdiction in great numbers from the solutions mentioned above (and spreadsheets and flat files of all kind).
And oh yes, I like Jim’s video too.