In master data management the two most prominent domains are:
In the quest for finding representations of parties actually being the same real world party and finding representations of products actually being the same real world product we typically execute fuzzy data matching of:
- Party names as person names and company names
- Product descriptions
However I have often seen party names being an integral part of matching products.
A product is most often being regarded as distinct not only based on the description but also based on the manufacturer. So besides being sharp on matching product descriptions for light bulbs you must also consider if for example the following manufacturer company names are the same or not:
- Koninklijke Philips Electronics N.V.
- Philips Electronic
A book is a product. The title of the book is the description. But also the author’s person name counts. So how do we collect the entire works made by the author:
- Hans Christian Andersen
- Andersen, Hans Christian
- H. C. Andersen
as all three representations are superb bad data?
A certain kind of teddy bears has a product description like “Plush magenta teddy bear”. But each bear may have a pet name like “Lots-O’-Huggin’ Bear” or just short “Lotso” as seen in the film “Toy Story 3”. And seriously: In real business I have worked with building a bear data model and the related data matching.
PS: For those who have seen Toy Story 3: Is that Lotso one or two real world entities?
When working with data and information quality we often use words as rubbish, poor, bad and other negative words when describing data that need to be enhanced in order to achieve better data quality. However, what is bad may have been good in the context where a particular set of data originated.
Right now I have some fun with author names.
An example of good and bad could be with an author I have used several times on this blog, namely the late fairy tale writer called in full name:
Hans Christian Andersen
When gazing through data you will meet his name represented this way:
Andersen, Hans Christian
This representation is fit for purpose of use for example when looking for a book by this author at a library, where you sort the fictional books by the surname of the author.
The question is then: Do you want to have the one representation, the other representation or both?
You may also meet his name in another form in another field than the name field. For example there is a main street in Copenhagen called:
H. C. Andersens Boulevard
This is the representation of the real world name of the street holding a common form of the authors name with only initials.
During the existence of this blog I have come to use two tags several times, namely the fairy tale author Hans Christian Andersen as an inspiration for data quality related subjects and the tag happy databases as a counterweight against that we may talk too much about all the bad data quality around.
In embracing these two tags the fairy tale The Snow Queen also starts in the very bad end.
An evil troll makes a magic mirror that has the power to distort the appearance of things reflected in it. It fails to reflect all the good and beautiful aspects of people and things while it magnifies all the bad and ugly aspects so that they look even worse than they really are; for example makes the loveliest landscapes look like “boiled spinach.” I think every child understands that metaphor.
We tend to do the same in the data quality realm. In order to make a case for data and information quality improvement we like to tell about trainwrecks like on the site edited by IAIDQ. And for the record, I am guilty as everyone else in reading, laughing and contributing to the mobbing when everyone else makes a mistake within data management.
I have earlier used the fairy tales of Hans Christian Andersen on this blog. This time it is the story about the princess on the pea.
The story tells of a prince who wants to marry a princess, but is having difficulty finding a suitable wife. Something is always wrong with those he meets, and he cannot be certain they are real princesses. One stormy night (always a harbinger of either a life-threatening situation or the opportunity for a romantic alliance in Andersen’s stories), a young woman drenched with rain seeks shelter in the prince’s castle. She claims to be a princess, so the prince’s mother decides to test their unexpected guest by placing a pea in the bed she is offered for the night, covered by 20 mattresses and 20 featherbeds. In the morning the guest tells her hosts—in a speech colored with double entendres—that she endured a sleepless night, kept awake by something hard in the bed; which she is certain has bruised her. The prince rejoices. Only a real princess would have the sensitivity to feel a pea through such a quantity of bedding. The two are married, and the pea is placed in the Royal Museum.
Buying a data quality tool is just as hard as it was for a prince to find a real princess in the good old days. How can you be certain that the tool is able to help you finding the difficult not obvious flaws hidden in your already stored data or the data streams coming in?
I think performing a test like the queen did in Andersen’s story is a must, and like the queen didn’t, don’t tell the vendor about the pea. Wait and see if the tool gets black and blue all over by the pea.
The short story (or fairy tale) The Little Match Girl (or The Litlle Match Seller) by Hans Christian Andersen is a sad story with a bad ending, so it shouldn’t actually belong here on this blog where I will try to tell success stories about data quality improvement resulting in happy databases.
However, if I look at the industry of making data matching tools (and data matching technology is a large part of data quality tools) I wonder if the future has ever been that bright.
There are many tools for data matching out there.
Some tool vendors have been acquired by big players in the data management realm as:
- IBM acquired Accential Software
- SAS Institute acquired DataFlux
- Informatica acquired Similarity Systems and Identity Systems
- Microsoft acquired Zoomix
- SAP acquired Fuzzy Informatik and Business Objects that acquired FirstLogic
- Experian acquired QAS
- Tibco acquired Netrics
(the list may not be complete, just what immediately comes to my mind).
The rest of the pack is struggling with selling matches in the cold economic winter.
There is another fairy tale similar to The Little Match Girl called The Star Money collected by the Brothers Grimm. This story has a happy ending. Here the little girl gives here remaining stuff away for free and is rewarded with money falling down from above. Perhaps this is like The Coming of Age of Open Source as told in a recent Talend blog post?
Well, open source is first expected to break the ice in the Frozen Quadrant in 2012.
Since engaging in the social media community around data and information quality I have noticed quite a lot of mobbing going on pointed at data quality tools. The sentiment seems to be that data quality tools are no good and will play only a very little role, if any, in solving the data and information quality conundrum.
I like to think of data quality tools as being like the cygnet (the young swan) in the fairy tale “The Ugly Duckling” by Hans Christian Andersen. An immature clumsy flapper in the barnyard. And sure, until now tools have generally not been ready to fly, but been mostly situated in the downstream corner of the landscape.
Since last September I have been involved in making a new data quality tool. The tool is based on the principles described in the post Data Quality from the Cloud.
We have now seen the first test flights in the real world and I am absolutely thrilled about the testimonial sayings. Examples:
- “It (the tool) is lean”. I like that since lean is a production practice that considers the expenditure of resources for any goal other than the creation of value for the end customer to be wasteful.
- “It is gold”. I like to consider that as a calculated positive business case.
- “It is the best thing happened in my period of employment”. I think happy people are essential to data quality.
Paraphrasing Andersen: I never dreamed there could be so much happiness, when I was working with ugly ducklings.
The title of the fairy tale “The Ugly Duckling” by Hans Christian Andersen was originally supposed to be the more positive “The Young Swan” (or “The Cygnet”) , but as Andersen did not want to spoil the element of surprise in the protagonist’s transformation, he discarded it for “The Ugly Duckling”.
In a blog post called “Why Isn’t Our Data Quality Worse?” posted today (or last night local Iowa time) Jim Harris examines the psychology term “negativity bias” that explains how bad evokes a stronger reaction than good in the human mind.
Surely, data quality improvement evangelism is most often based on the strong force of badness. Always describing how bad data is everywhere. Bashing executives who don’t get it. Only as a nice positive surprise in the end we tell how our product/consultancy will transform the ugly duckling into a beautiful swan.
My latest blog post with a truly positive angle called “What a Lovely Day” is almost 2 months old. So I promise myself the next post will have the title “The Young Swan” (or “The Cygnet”) and will be extremely positive about data quality improvement.