Lots of Product Names

In master data management the two most prominent domains are:

  • Parties and
  • Products

In the quest for finding representations of parties actually being the same real world party and finding representations of products actually being the same real world product we typically execute fuzzy data matching of:

  • Party names as person names and company names
  • Product descriptions

However I have often seen party names being an integral part of matching products.

Some examples:

Manufacturer Names:

A product is most often being regarded as distinct not only based on the description but also based on the manufacturer. So besides being sharp on matching product descriptions for light bulbs you must also consider if for example the following manufacturer company names are the same or not:

  • Koninklijke Philips Electronics N.V.
  • Phillips
  • Philips Electronic

Author Names:

A book is a product. The title of the book is the description. But also the author’s person name counts. So how do we collect the entire works made by the author:

  • Hans Christian Andersen
  • Andersen, Hans Christian
  • H. C. Andersen

as all three representations are superb bad data?

Bear Names:

A certain kind of teddy bears has a product description like “Plush magenta teddy bear”. But each bear may have a pet name like “Lots-O’-Huggin’ Bear” or just short “Lotso” as seen in the film “Toy Story 3”. And seriously: In real business I have worked with building a bear data model and the related data matching.

PS: For those who have seen Toy Story 3: Is that Lotso one or two real world entities?  

Bookmark and Share

Superb Bad Data

When working with data and information quality we often use words as rubbish, poor, bad and other negative words when describing data that need to be enhanced in order to achieve better data quality. However, what is bad may have been good in the context where a particular set of data originated.

Right now I have some fun with author names.

An example of good and bad could be with an author I have used several times on this blog, namely the late fairy tale writer called in full name:

Hans Christian Andersen

When gazing through data you will meet his name represented this way:

Andersen, Hans Christian

This representation is fit for purpose of use for example when looking for a book by this author at a library, where you sort the fictional books by the surname of the author.

The question is then: Do you want to have the one representation, the other representation or both?

You may also meet his name in another form in another field than the name field. For example there is a main street in Copenhagen called:

H. C. Andersens Boulevard

This is the representation of the real world name of the street holding a common form of the authors name with only initials.

Bookmark and Share

The Snow Queen

During the existence of this blog I have come to use two tags several times, namely the fairy tale author Hans Christian Andersen as an inspiration for data quality related subjects and the tag happy databases as a counterweight against that we may talk too much about all the bad data quality around.

In embracing these two tags the fairy tale The Snow Queen also starts in the very bad end.

An evil troll makes a magic mirror that has the power to distort the appearance of things reflected in it. It fails to reflect all the good and beautiful aspects of people and things while it magnifies all the bad and ugly aspects so that they look even worse than they really are; for example makes the loveliest landscapes look like “boiled spinach.” I think every child understands that metaphor.

We tend to do the same in the data quality realm. In order to make a case for data and information quality improvement we like to tell about trainwrecks like on the site edited by IAIDQ. And for the record, I am guilty as everyone else in reading, laughing and contributing to the mobbing when everyone else makes a mistake within data management.

Bookmark and Share

The Princess and the Pea

I have earlier used the fairy tales of Hans Christian Andersen on this blog. This time it is the story about the princess on the pea.

The story tells of a prince who wants to marry a princess, but is having difficulty finding a suitable wife. Something is always wrong with those he meets, and he cannot be certain they are real princesses. One stormy night (always a harbinger of either a life-threatening situation or the opportunity for a romantic alliance in Andersen’s stories), a young woman drenched with rain seeks shelter in the prince’s castle. She claims to be a princess, so the prince’s mother decides to test their unexpected guest by placing a pea in the bed she is offered for the night, covered by 20 mattresses and 20 featherbeds. In the morning the guest tells her hosts—in a speech colored with double entendres—that she endured a sleepless night, kept awake by something hard in the bed; which she is certain has bruised her. The prince rejoices. Only a real princess would have the sensitivity to feel a pea through such a quantity of bedding. The two are married, and the pea is placed in the Royal Museum.

Buying a data quality tool is just as hard as it was for a prince to find a real princess in the good old days. How can you be certain that the tool is able to help you finding the difficult not obvious flaws hidden in your already stored data or the data streams coming in?

I think performing a test like the queen did in Andersen’s story is a must, and like the queen didn’t, don’t tell the vendor about the pea. Wait and see if the tool gets black and blue all over by the pea.

Bookmark and Share

The Little Match Girl

The short story (or fairy tale) The Little Match Girl (or The Litlle Match Seller) by Hans Christian Andersen is a sad story with a bad ending, so it shouldn’t actually belong here on this blog where I will try to tell success stories about data quality improvement resulting in happy databases.

However, if I look at the industry of making data matching tools (and data matching technology is a large part of data quality tools) I wonder if the future has ever been that bright.

There are many tools for data matching out there.

Some tool vendors have been acquired by big players in the data management realm as:

  • IBM acquired Accential Software
  • SAS Institute acquired DataFlux
  • Informatica acquired Similarity Systems and Identity Systems
  • Microsoft acquired Zoomix
  • SAP acquired Fuzzy Informatik and Business Objects that acquired FirstLogic
  • Experian acquired QAS
  • Tibco acquired Netrics

(the list may not be complete, just what immediately comes to my mind).

The rest of the pack is struggling with selling matches in the cold economic winter.

There is another fairy tale similar to The Little Match Girl called The Star Money collected by the Brothers Grimm. This story has a happy ending. Here the little girl gives here remaining stuff away for free and is rewarded with money falling down from above. Perhaps this is like The Coming of Age of Open Source as told in a recent Talend blog post?

Well, open source is first expected to break the ice in the Frozen Quadrant in 2012.

Bookmark and Share

Data Quality Tools: The Cygnets in Information Quality

Since engaging in the social media community around data and information quality I have noticed quite a lot of mobbing going on pointed at data quality tools. The sentiment seems to be that data quality tools are no good and will play only a very little role, if any, in solving the data and information quality conundrum.

I like to think of data quality tools as being like the cygnet (the young swan) in the fairy tale “The Ugly Duckling” by Hans Christian Andersen. An immature clumsy flapper in the barnyard. And sure, until now tools have generally not been ready to fly, but been mostly situated in the downstream corner of the landscape.

Since last September I have been involved in making a new data quality tool. The tool is based on the principles described in the post Data Quality from the Cloud.

We have now seen the first test flights in the real world and I am absolutely thrilled about the testimonial sayings. Examples:

  • “It (the tool) is lean”.  I like that since lean is a production practice that considers the expenditure of resources for any goal other than the creation of value for the end customer to be wasteful.
  • “It is gold”. I like to consider that as a calculated positive business case.
  • “It is the best thing happened in my period of employment”. I think happy people are essential to data quality.

Paraphrasing Andersen: I never dreamed there could be so much happiness, when I was working with ugly ducklings.

Bookmark and Share

The Ugly Duckling

The title of the fairy tale “The Ugly Duckling” by Hans Christian Andersen was originally supposed to be the more positive “The Young Swan” (or “The Cygnet”) , but as Andersen did not want to spoil the element of surprise in the protagonist’s transformation, he discarded it for “The Ugly Duckling”.

In a blog post called “Why Isn’t Our Data Quality Worse?” posted today (or last night local Iowa time) Jim Harris examines the psychology term “negativity bias” that explains how bad evokes a stronger reaction than good in the human mind.

Surely, data quality improvement evangelism is most often based on the strong force of badness. Always describing how bad data is everywhere. Bashing executives who don’t get it. Only as a nice positive surprise in the end we tell how our product/consultancy will transform the ugly duckling into a beautiful swan.    

My latest blog post with a truly positive angle called “What a Lovely Day” is almost 2 months old. So I promise myself the next post will have the title “The Young Swan” (or “The Cygnet”) and will be extremely positive about data quality improvement.

Bookmark and Share

Data Quality and Common Sense

My favourite story is the fairytale “The Emperor’s new clothes” by Hans Christian Andersen.

Hans_Christian_AndersenIn this tale an emperor hires two swindlers (aka consultants) who offer him the finest dress from the most beautiful cloth. This cloth, they tell him, is invisible to anyone who is either stupid or unfit for his position. In fact there is no cloth at all, but no one (but at the end a little child) dares to say.

The Data Quality discipline is tormented by belonging to both the business side and the technology side of practice. This means that we have to live with the buzzwords and the smartness coming from both the management consultants and the technology consultants and vendors – including myself.

So you really have to believe in a lot of things and terms said in order not to look stupid or unfit for your position.

A way to cope with this is to look behind all the fine terms and recognize that most things said and presented is just another way of expressing common sense. Some examples:

Business Process: What you do at work – e.g. selling some stuff and putting data about it into a database so it’s ready for invoicing.

Referential Integrity Error: When you sold something not in the database. You may pick another item from the current list. Bad Change Management: When someone tells you to do it in another way. Now.

Organisational Resistance: When you find that way completely ridiculous because no one tells you why.

Fuzzy logic: This is about the common nature of most questions in life. Statements are not absolutely true or absolutely false but somewhere in between depending on the angle from where you observe.

Business Intelligence: When someone puts your data along with some other data into a new context visualised in a graph in order to replace human gut feeling.

Poor Enterprise Wide Data Quality: The invoicing went well. The decision made from the graph didn’t. 

Data Governance: Meetings and documents about what went wrong with the data and how we can do better.

My experience is that the most successful data quality improvements is made when it is guided by common sense and expressed as being that. From there you may find great inspiration and practical skills and tools in each area of expertise.