The Letter Ø

This blog is written in English. Therefore the letters used are normally restricted to A to Z.  

The English alphabet is one of many alphabets using Latin (or Roman) letters. Other alphabets like the Russian uses Cyrillic letters. Then there are other script systems in the world which besides alphabets are abjads, abugidas, syllabic scripts and symbol scripts. Learn more about these in the post Script Systems.

My last surname is “Sørensen”. This word contains the lower case letter “ø” which is “Ø” in upper case. This letter is part of two alphabets: The Danish/Norwegian and the Faroese. Sometimes data has to be transformed into the English alphabet. Then the letter “ø” may be transformed to either “o” or “oe”. So my last surname will be either “Sorensen” or “Soerensen” in the English alphabet.

The town part of my address is “København”. The word “København” is what we call an endonym, which is the local word for place or a person. The opposite of endonym is exonym. The English exonym for “København” is “Copenhagen” which of course only has letters from the English alphabet. The Swedish exonym for “København” is “Köpenhamn”. Here we have a variant of “Ø” being “Ö”. The letter “Ö” exists in a lot of alphabets as Swedish, German, Hungarian and Turkish.

Usually “Ø” is transformed to “Ö” between Danish/Norwegian and these alphabets. The other way we usually accept the letter “Ö” in Danish/Norwegian master data.

These issues are of course a problem area in data quality, data matching and master data management. And with the complexity only between alphabets using Latin characters there is of course much more land to cover when including Cyrillic and Greek letters and then the other scripts systems with their hierarchical elements.

Bookmark and Share

Electronic Data Processing

A comment on my last blog post took me back to the days when I started working with Information Technology (IT). At that time our métier actually wasn’t called IT but EDP (Electronic Data Processing) – at least that was the case in my home country Denmark where we used the local TLA being EDB (Elektronisk Data Behandling).

I have earlier touched the long standing discussion about if “data quality” should be rebranded as “information quality” for example in the post called new blog name, as this should also require a new name for this blog.

The words data and information are indeed used very randomly around. In MDM (Master Data Management) we have two main domains being Customer Data Integration (CDI) and Product Information Management (PIM). Wonder if customer data is old school and product information is new school?

Bookmark and Share

Business and Pleasure

The data quality and master data management (MDM) realm has many wistful songs about unrequited love with “the business”.

This morning I noticed yet a tweet on twitter expressing the pain:

Here Gartner analyst Ted Friedman foresees the doom of MDM if we don’t get at least the traction from “the business” that BI (Business Intelligence) is getting.

In my eyes everything we do in Information Technology is about “the business”. Even computer games and digital entertainment is a core part of the respective industries. I also believe that IT is part of “the business”.

“The rest of the business” does see that some disciplines belong in the IT realm. This goes for database management, programming languages and network protocols. These disciplines are not doomed at all because it is so. “The rest of the business” couldn’t work today without these things around.

Certainly I have seen some IT based disciplines and related tools emerged and then been doomed during my years in the IT business. Anyone remembers case tools?   

With case tools I remember great expectations about business involvement in application design. But according to Wikipedia the main problems with case tools are (were): Inadequate standardization, unrealistic expectations, slow implementation and weak repository controls.

In other words: “The rest of the business” never really got in touch with the case tools because they didn’t work as supposed.

The business traction we see around BI (and the enabling tools) now is in my eyes very much about that the tools have matured, actually works, have become more user friendly and seems to create useful results for “the rest of the business”.

Data quality tools and MDM tools must continue to follow that direction too, because for sure: Data Quality tools and MDM tools does not solve any severe problems internally in the IT part of “the business”.

It’s my pleasure being part of that.

Bookmark and Share

Survival of the Fit Enough

When working with data quality and master data management at the same time you are constantly met with the challenge that data quality is most often defined as data being fit for the purpose of use, but master data management is about using the same data for multiple purposes at the same time.

Finding the right solution to such a challenge within an organization isn’t easy, because it despite all good intentions is difficult to find someone in the business with an overall answer to that kind of problems as explained in the blog post by David Loshin called Communications Gap? Or is there a Gap between Chasms?

An often used principle for overcoming these issues may (based on Darwin) be seen as “survival of the fittest”. You negotiate some survivorship rules between “competing” data providers and consumers and then the data being the fittest measured by these rules wins. All other data gets the KISS of death. Most such survivorship rules are indeed simple often based on a single dimension as timeliness, completeness or provenance.

Recently the phrase “survival of the fittest” in evolution theory has been suggested to be changed to “survival of the fit enough” because it seems that many times specimens haven’t competed but instead found a way into empty alternate spaces.

It seems that master data management and related data quality is going that way too. Data that is fit enough will survive in the master data hub in alternate spaces where the single source of truth exists in perfect symbioses with multiple realities.

Bookmark and Share

Product Placement

This wasn’t actually meant as a blog post series about the place entity in multi-domain master data management. But I think I have been carried away by my work, so now it is.

Places probably are most common related to the party domain as seen in the previous post called A Place in Time. But places certainly also have multiple relations to the product domain then forming a P trinity of parties, products and places in multi-domain master data management as seen in the post Your Place or My Place?

As with most things in the product domain also the product-place relations usually are very industry specific.

Some of the product-place relations I have worked with come from these industries:

Insurance

The fees you have to pay for some insurance products are related to the place where you live. In order to having the right fees (and for a lot of other reasons) an insurance company needs to analyze data based on the product-place relations. This may by the way go very wrong as told in the post A Really Bad Address.

Hospitality

Your product is a place where the selling attributes includes both the properties belonging to the place itself and the properties of the places being nearby.

Real Estate

Do I have to say more than three words: Location, Location, Location.

Your product-place relations

Tell me about what product-place relations you have worked with?

Bookmark and Share

Your Place or My Place?

We, and that’s including myself, often talk about multi-domain master data management as a marriage between party master data management (also called Customer Data Integration abbreviated as CDI) and Product Master Data Management (also called Product Information Management abbreviated as PIM).

The third most common master data domain is locations (or places). I like the term place, because then we have a P trinity: Parties, Products and Places. However there may be a fourth P involved, as I read a post today by Steven Jones of Capgemini telling that multi-domain MDM is a Pointless question.

The Premise of the Pointlessness is that Party and Product is an IT Perspective. The rest of the business sees the world from mainly either a customer centric perspective or a supply centric perspective.

I agree about that these perspectives exists too and actually made a blog post recently on sell side vs buy side master data quality.

I don’t agree about that this is an (pointless) IT versus business question, obviously also because I have a hard time recognizing the great divide between IT and business. From my perspective is IT a part of the business just like sales, marketing and purchase is it too. And from a product vendor perspective in the MDM realm you actually address the conjunction of business and technological needs a bit opposite to either being a database manager vendor aimed mostly at the IT part of business or a CRM or SCM vendor aimed mostly at the sales or purchase part of business.

Multi-domain MDM isn’t in my perspective a pointless place, but a meeting place between IT and all the other places in business and the core business entities being parties, products and places.

Bookmark and Share

We Will Become More Open

Yesterday I read a post called Taking Stock Of DQ Predictions For 2011 by Clarke Patterson of Informatica Corporation. Informatica is a well established vendor within data integration, data quality and master data management. The post is based on post called Six Data Management Predictions for 2011 by Steve Sarsfield of Talend. Talend is an open source vendor within data integration, data quality and master data management.

One of the six predictions for 2011 is: Data will become more open.

Steves (open source based) take on this is:

“In the old days good quality reference data was an asset kept in the corporate lockbox. If you had a good reference table for common misspellings of parts, cities, or names for example, the mind set was to keep it close and away from falling into the wrong hands.  The data might have been sold for profit or simply not available.  Today, there really is no “wrong hands”.  Governments and corporations alike are seeing the societal benefits of sharing information. More reference data is there for the taking on the internet from sites like data.gov and geonames.org.  That trend will continue in 2011.  Perhaps we’ll even see some of the bigger players make announcements as to the availability of their data. Are you listening Google?”

Clarkes (propriety software based) take is as follows:

“As data becomes more open, data quality tools will need to be able to handle data from a greater number of sources used for a broader number of purposes.  Gone are the days of single domain data manipulation.  To excel in this new, open market, you’ll need a data quality tool that can profile, cleanse and monitor data regardless of domain, that is also locale-aware and has pre-built rules and reference data.”

I agree with both views which by the way are on each of The Two Sides To The IT Coin – Data Centric IT vs Process Centric IT as explained by Robin Bloor in another recent post on the blog by data integration vendor Pervasive Software.

Steves and Clarkes perspectives are also close to me as my 2011 to do list includes:

  • Involvement in a solution called iDQ (instant Data Quality). The solution is about how we can help system users doing data entry by adding some easy to use technology that explores the cloud for relevant data related to the entry being done.
  • Helping enhancing a hot MDM hub solution with further data quality and multi-domain capabilities.

Bookmark and Share

Citizen ID and Biometrics

As I have stated earlier on this blog: The solution to the single most frequent data quality problem being party master data duplicates is actually very simple: Every person (and every legal entity) gets a unique identifier which is used everywhere by everyone.

Some countries, like Denmark where I live, has a unique Citizen ID (National identification number). Some countries are on the way like India with the Aadhaar project. But some of the countries with the largest economies in the world like United Kingdom, Germany and United States don’t seem to getting it in the near future.

I think United Kingdom was close lately, but as I understand it the project was cancelled. As seen in a tweet from a discussion on twitter today the main obstacles were privacy considerations and costs:

A considerable cost in the suggested project in United Kingdom, and also as I have seen in discussions for a US project, may be that an implementation today should also include biometric technology.

The question is however if that is necessary.

If we look at the systems in force today for example in Scandinavia they were implemented +40 years ago, and the Swedish citizen ID was actually implemented without digitalization in 1947. There are discussions going on about biometrics also as this is inevitable for issuing passports anyway. In the mean time the systems however continues to make a lot of data quality prevention and party master data management a lot easier than else around the world without having biometrics as a component.

No doubt about that biometrics will solve some problems related to fraud and so. But these are rare exceptions. So the cost/benefit analysis for enhancing an existing system with biometrics seems to be negative.

I guess the alleged need for biometric may have something to do with privacy considerations in a strange way: Privacy considerations are often overruled by the requirements for fighting terrorism – and here you need biometrics in identity resolution.

Bookmark and Share

Superb Bad Data

When working with data and information quality we often use words as rubbish, poor, bad and other negative words when describing data that need to be enhanced in order to achieve better data quality. However, what is bad may have been good in the context where a particular set of data originated.

Right now I have some fun with author names.

An example of good and bad could be with an author I have used several times on this blog, namely the late fairy tale writer called in full name:

Hans Christian Andersen

When gazing through data you will meet his name represented this way:

Andersen, Hans Christian

This representation is fit for purpose of use for example when looking for a book by this author at a library, where you sort the fictional books by the surname of the author.

The question is then: Do you want to have the one representation, the other representation or both?

You may also meet his name in another form in another field than the name field. For example there is a main street in Copenhagen called:

H. C. Andersens Boulevard

This is the representation of the real world name of the street holding a common form of the authors name with only initials.

Bookmark and Share

Matching Light Bulbs

This morning I noticed this lightbulb joke in a tweet from @mortensax:

Besides finding it amusing I also related to it since I have used an example with light bulbs in a webinar about data matching as seen here:

The use of synonyms in Search Engine Optimization (SEO) is very similar to the techniques we use in data matching.

Here the problem is that for example these two product descriptions may have a fairly high edit distance (very different character by character), but are the same:

  • Light bulb, A 19, 130 Volt long life, 60 W
  • Incandescent lamp, 60 Watt, A19, 130V

while these two product descriptions have an edit distance of only one substitution of a character, but are not the same product (though being same category):

  • Light bulb, 60 Watt, A 19, 130 Volt long life
  • Light bulb, 40 Watt, A 19, 130 Volt long life

Working with product data matching is indeed very enlightening.

Bookmark and Share