Valuable Inaccuracy

These days I’m involved in an activity in which you may say that we by creating data with questionable quality are making better information quality.

The business case is within public transit. In this particular solution passengers are using chip cards when boarding busses, but are not using the cards when alighting. This is a cheaper and smoother solution than the alternative in electronic ticketing, where you have both check-in and check-out. But a major drawback is the missing information about where passengers alighted, which is very useful information in business intelligence.

So what we do is that we where possible assume where the passenger alighted. If the passenger (seen as a chip card) within a given timeframe boarded another bus at a stop point which is on or near a succeeding stop point on the previous route, then we assume alighting was at that stop point though not recorded.

Two real life examples of doing so is where the passenger makes an interchange or where the passenger later on a day goes back from work, school or other regular activity.

An important prerequisite however is that we have good data quality regarding stop point locations, route assignments and other master data and their relations.    

Bookmark and Share

Script Systems

This Friday my blog post was called Follow Friday diversity. In my hope to reach for more equalized worldwide interaction I wonder if writing in English with roman (latin) characters is enough?

Take a look at the diversity in script systems around the world:

Alphabets

In an alphabet, each letter corresponds to a sound. These are also referred to as phonographic scripts. Examples of Alphabets: Roman (Latin); Cyrillic; Greek

Abjads

Abjads consist exclusively of consonants. Vowels are omitted from most words, because they are obvious for native speakers, and are simply inserted when speaking. In addition, Abjads are normally written from right to left. Examples of Abjads: Hebrew; Arabic

Abugidas

Abugidas are characteristic for scripts in India and Ethiopia. In this style, only the consonants are normally written, and standard vowels are assumed. If a different vowel is required, it is indicated with a special mark. Abugidas form an intermediate level between alphabetic and syllabic scripts. Examples of Abugidas: Hindi (Devanagari); Singhalese

Syllabic Scripts

Like alphabets, syllabic scripts are another type of phonographic script. In a syllabic script, each character stands for a syllable. Examples of Syllabic Scripts: Japanese (Hiragana, Katakana); Cherokee

Symbol Scripts

In symbolic scripts, each character is an ideogram standing for a complete word. Compound terms or concepts are composed of multiple symbols. Symbolic scripts are also called logographic scripts. Examples of Symbolic Scripts: Chinese; Japanese (Kanji)

Source: Worldmatch® Comparing International Data by Omikron Data Quality – full version here.

Bookmark and Share

Quality Data Integration

As late as yesterday I was involved in yet a data quality issue that wasn’t caused by that the truth wasn’t known, but caused by that that truth wasn’t known in all the different databases within an enterprise and of course exactly not (thanks Murphy) by that application that needed that information due to a new requirement. Yep, the column was there alright, but it wasn’t updated, because until yesterday it didn’t need to be.

The data architecture in most enterprises isn’t perfect at all. Through the information technology history of that enterprise many different systems has been deployed ranging from core operational applications, data warehouses and lately also web frontends.

It’s not that we don’t know about how master data management can help, how service oriented architecture (principles) is a must and how important it is to document the data flows within the enterprise. But gee, even for a modest sized organization this is huge and even if we strived to do it right, when we succeeded, the real world has moved.

Well, back to business. What do we do? I think we will:

  • Make a quick fix that solves the business problem to the delight of the business users
  • Perhaps prioritize up that sustainable technical solution we planned some while ago

Have a nice day everyone. I think it is going to be just fine.

Bookmark and Share

Data Quality Tools: The Cygnets in Information Quality

Since engaging in the social media community around data and information quality I have noticed quite a lot of mobbing going on pointed at data quality tools. The sentiment seems to be that data quality tools are no good and will play only a very little role, if any, in solving the data and information quality conundrum.

I like to think of data quality tools as being like the cygnet (the young swan) in the fairy tale “The Ugly Duckling” by Hans Christian Andersen. An immature clumsy flapper in the barnyard. And sure, until now tools have generally not been ready to fly, but been mostly situated in the downstream corner of the landscape.

Since last September I have been involved in making a new data quality tool. The tool is based on the principles described in the post Data Quality from the Cloud.

We have now seen the first test flights in the real world and I am absolutely thrilled about the testimonial sayings. Examples:

  • “It (the tool) is lean”.  I like that since lean is a production practice that considers the expenditure of resources for any goal other than the creation of value for the end customer to be wasteful.
  • “It is gold”. I like to consider that as a calculated positive business case.
  • “It is the best thing happened in my period of employment”. I think happy people are essential to data quality.

Paraphrasing Andersen: I never dreamed there could be so much happiness, when I was working with ugly ducklings.

Bookmark and Share

Out of Facebook

Some while ago it was announced that Facebook signed up member number 500,000,000.

If you are working with customer data management you will know that this doesn’t mean that 500,000,000 distinct individuals are using Facebook. Like any customer table the Facebook member table will suffer from a number of different data quality issues like:

  • Some individuals are signed up more than once using different profiles.
  • Some profiles are not an individual person, but a company or other form of establishment.
  • Some individuals who created a profile are not among us anymore.

Nevertheless the Facebook member table is a formidable collection of external reference data representing the real world objects that many companies are trying to master when doing business-2- consumer activities.

For those companies who are doing business-2-business activities a similar representation of real world objects will be the +70,000,000 profiles on LinkedIn plus profiles in other social business networks around the world which may act as external reference data for the business contacts in the master data hubs, CRM systems and so on.

Customer Master Data sources will expand to embrace:

  • Traditional data entry from field work like a sales representative entering prospect and customer master data as part of Sales Force Automation.
  • Data feed and data integration with traditional external reference data like using a business directory. Such integration will increasingly take place in the cloud and the trend of governments releasing public sector data will add tremendously to this activity.
  • Self registration by prospects and customers via webforms.
  • Social media master data captured during social CRM and probably harvested in more and more structured ways as a new wave of exploiting external reference data.

Doing “Social Master Data Management” will become an integrated part of customer master data management offering both opportunities for approaching a “single version of the truth” and some challenges in doing so.

Of course privacy is a big issue. Norms vary between countries, so do the legal rules. Norms vary between individuals and by the individuals as a private person and a business contact. Norms vary between industries and from company to company.

But the fact that 500,000,000 profiles has been created on Facebook in a very few years by people from all over world shows that people are willing to share and that much information can be collected in the cloud. However no one wants to be spammed by sharing and indeed there have been some controversies around how data in Facebook is handled. 

Anyway I have no doubt that we will see less data entering clerks entering the same information in each company’s separate customer tables and that we increasingly will share our own master data attributes in the cloud.

Bookmark and Share

Out-of-Africa

Besides being a memoir by Karen Blixen (or the literary double Isak Dinesen) Out-of-Africa is a hypothesis about the origin of the modern human (Homo Sapiens). Of course there is a competing scientific hypothesis called Multiregional Origin of Modern Humans. Besides that there is of course religious beliefs.

The Out-of-Africa hypothesis suggests that modern humans emerged in Africa 150,000 years ago or so. A small group migrated to Eurasia about 60,000 years ago. Some made it across the Bering Strait to America maybe 40,000 years ago or maybe 15,000 years ago. The Vikings said hello to the Native Americans 1,000 years ago, but cross Atlantic movement first gained pace from 500 years ago, when Columbus discovered America again again.

½ year ago (or so) I wrote a blog post called Create Table Homo_Sapiens. The comment follow up added to the nerdish angle with discussing subjects as mutating tables versus intelligent design and MAX(GEEK) counting.

But on the serious side comments also touched the intended subject about making data models reflect real world individuals.

Tables with persons are the most common entity type in databases around. As in the Out-of-Africa hypothesis it could have been as a simple global common same structural origin. But that is not the way of the world. Some of the basic differences practiced in modeling the person entity are:

  • Cultural diversity: Names, addresses, national ID’s and other basic attributes are formatted differently country by country and in some degree within countries. Most data models with a person entity are build on the format(s) of the country where it is designed.
  • Intended purpose of use: Person master data are often stored in tables made for specific purposes like a customer table, a subscriber table a contact table and so on. Therefore the data identifying the individual is directly linked with attributes describing a specific role of that individual.
  • “Impersonal” use: Person data is often stored in the same table as other party master types as business entities, projects, households et cetera.

Many, many data quality struggles around the world is caused by how we have modeled real world – old world and new world – individuals.

Bookmark and Share

360° Share of Wallet View

I have found this definition of Share of Wallet on Wikipedia:

Share of Wallet is the percentage (“share”) of a customer’s expenses (“of wallet”) for a product that goes to the firm selling the product. Different firms fight over the share they have of a customer’s wallet, all trying to get as much as possible. Typically, these different firms don’t sell the same but rather ancillary or complementary product.

Measuring your share of given wallets – and your performance in increasing it – is a multi-domain master data management exercise as you have to master both a 360° view of customers and a 360° view of products.

With customer master data you are forced to handle uniqueness (consolidate duplicates) of customers and handle hierarchies of customers, which is further explained in the post 360° Business Partner View.

With product master data you are not only forced to categorize your own products and handle hierarchies  within, but you also need to adapt to external categorizations in order to getting access to external data available for spending probably on a high level for a segment of customers but sometimes even possible down to the single customer.

Location master data may be important here for geographical segmentations and identification.

My educated guess is that companies will increasing rely on having better data quality and master data management processes and infrastructure in order to measure precise shares of wallets and thereby gain advantages in a stiff competition rather than relying on gut feelings and best guesses.

Bookmark and Share

Linked Data Quality

The concept of linked data within the semantic web is in my eyes a huge opportunity for getting data and information quality improvement done.

The premises for that is described on the page Data Quality 3.0.

Until now data quality has been largely defined as: Fit for purpose of use.

The problem however is that most data – not at least master data – have multiple uses.

My thesis is that there is a breakeven point when including more and more purposes where it will be less cumbersome to reflect the real world object rather than trying to align fitness for all known purposes.

If we look at the different types of master data and what possibilities that may arise from linked data, this is what initially comes to my mind:

Location master data

Location data has been some of the data types that have been used the most already on the web. Linking a hotel, a company, a house for sale and so on to a map is an immediate visual feature appealing to most people. Many databases around however have poor location data as for example inadequate postal addresses. The demand for making these data “mappable” will increase to near unavoidable, but fortunately the services for doing so with linked data will help.

Hopefully increased open government data will help solve the data supply issue here.

Party master data

Linking party master data to external data sources is not new at all, but unfortunately not as widespread as it could be. The main obstacle until now has been smooth integration into business processes.

Having linked data describing real world entities on the web will make this game a whole lot easier.

Actually I’m working on implementations in this field right now.

Product master data

Traditionally the external data sources available for describing product master data has been few – and hard to find. But surely, at lot of data is already out there waiting to be found, categorized, matched and linked.

Bookmark and Share

Going Upstream in the Circle

One of the big trends in data quality improvement is going from downstream cleansing to upstream prevention. So let’s talk about Amazon. No, not the online (book)store, but the river. Also as I am a bit tired about that almost any mention of innovative IT is about that eShop.

A map showing the Amazon River drainage basin may reveal what may go to be a huge challenge in going upstream and solve the data quality issues at the source: There may be a lot of sources. Okay, the Amazon is the world’s largest river (because it carries more water to the sea than any other river), so this may be a picture of the data streams in a very large organization. But even more modest organizations have many sources of data as more modest rivers also have several sources.

By the way: The Amazon River also shares a source with the Orinoco River through the natural Casiquiare Canal, just as many organizations also shares sources of data.

Some sources are not so easy to reach as the most distant source of the Amazon being a glacial stream on a snowcapped 5,597 m (18,363 ft) peak called Nevado Mismi in the Peruvian Andes.

Now, as I promised that the trend on this blog should be about positivity and success in data quality improvement I will not dwell at the amount of work in going upstream and prevent dirty data from every source.

I say: Go to the clouds. The clouds are the sources of the water in the river. Also I think that cloud services will help a lot in improving data quality in a more easy way as explained in a recent post called Data Quality from the Cloud.

Finally, the clouds over the Amazon River sources are made from water evaporated from the Amazon and a lot of other waters as part of the water cycle. In the same way data has a cycle of being derived as information and created in a new form as a result of the actions made from using the information.

I think data quality work in the future will embrace the full data cycle: Downstream cleansing, upstream prevention and linking in the cloud.

Bookmark and Share