Five Flavors of Big Data

We are often talking about big data as if it is one kind of data while in fact we need separate approaches to handling for example data quality issues with different sorts of big data.

Big Data Quality
Join the Big Data Quality group on LinkedIn

In the following I will go through some different types of big data and share some observations related to data quality.

Social data

The most mentioned type of big data I guess is social data and the opportunity to listen to Twitter streams and Facebook status updates in order to get better customer insight is an often stated business case for analyzing big data.

However, everyone who listens to those data will be aware of the tremendous data quality problems in doing that as told in the post Crap, Damned Crap and Big Data.

Sensor data

Another often mentioned type of big data is sensor data and as examined in the post Social Data vs Sensor Data these are somewhat different from social data with less complex data quality issues but not in all free of data quality flaws as reported in the post Going in the Wrong Direction.

Web logs

Following the clicks from people surfing the internet is a third type of big data. This kind of big data shares characteristics from both social data and sensor data as they are human generated as social data but more fact oriented as sensor data.

Big transaction data

Even traditional transaction data in huge volume are treated as big data but of course inherits the same data quality challenges as all transaction data as even that data are structured we may have trouble with having the right relations to the who, what, where and when in the transactions. And that isn’t easier with large volumes.

Big reference data

When reference data grows big we also meet big complexity. Try for example to build a reference data set with all the valid postal addresses in the world. Several standardizing bodies have a hard time making a common model for that right now. Learn about other examples of big reference data and the related complexity in the post Big Reference Data Musings.

Bookmark and Share

The Good, Better and Best Way of Avoiding Duplicates

Having duplicates in databases is the most prominent data quality issue around and not at least duplicates in party master data is often pain number one when assessing the impact of data quality flaws.

A duplicate in the data quality sense is two or more records that don’t have exactly the same characters, but are referring to the same real world entity. I have worked with these three different approaches to when to fix the duplicate problem:

  • Downstream data matching
  • Real time duplicate check
  • Search and mash-up of internal and external data

Downstream Data Matching

The good old way of dealing with duplicates in databases is having data matching engines periodically scan through databases highlighting the possible duplicates in order to facilitate merge/purge processes.

Finding the duplicates after they have lived their own lives in databases and already have attached different kind of transactions is indeed not optimal, but sometimes it’s the only option as explained in the post Top 5 Reasons for Downstreet Cleansing.

Real Time Duplicate Check

The better way is to make the match at data entry where possible. This approach is often orchestrated as a data entry process where the single element or range of elements is checked when entered. For example the address may be checked against reference data and a phone number may be checked for adequate format for the country in question. And then finally when a proper standardized record is submitted, it is checked whether a possible duplicate exist in the database.

Search and Mash-Up of Internal and External Data

The best way is in my eyes a process that avoids entering most of the data that is already in the internal databases and taking advantage of data that already exists on the internet as external reference data sources.

iDQ mashup
instant Data Quality

The instant Data Quality concept I currently work with requires the user to enter as few data as possible for example through a rapid addressing entry, a Google like search for a name, simply typing a national identification number or in worst case combining some known facts. After that the system makes a series of fuzzy searches in internal or external databases and presents the results as a compact mash-up.

The advantages are:

  • If the real world entity already exists you avoid the duplicate and avoid entering data again. You may at the same time evaluate accuracy against external reference data.
  • If the real world entity doesn’t exist in internal data you may pick most of the data from external sources and that way avoiding typing too much and at the same time ensuring accuracy.

Bookmark and Share

The Relocation Event

relocationWhen maintaining party master data one of the challenges is to have the data about the physical address, and sometimes the physical addresses, of a registered party up to date.

You may learn about that your customer, supplier, employee or whatever party you are keeping on record has moved in many ways. Most common are:

  • The person or organization in question is so kind to tell you so. For some purposes for example in the utility sector this event is a future event that triggers a whole workflow of actions.
  • You get the message via a subscription to external reference data for example using available National Change of Address (NCOA) services and services related to business directories and citizen registries.
  • Your mail to a person or organization is returned from postal services often with no information about the new address, so this means investigation work ahead.

Capability to handle this important issue in party master data management (MDM) embracing all the above mentioned scenarios is essential for many enterprises and doing it on an international scale with the different sources and services available in different countries is indeed a daunting task.

Handling the relocation event is a core functionality in the master data service (iDQ™ MDM Edition) I’m currently working with. There’s lot to do in this quest, so I better move on.

Bookmark and Share

What Should be Driving Data Quality: Fear or Greed?

Today I attended a nice little event at the British Computer Society. The event was called “Data Surgery” and had sessions with combined presentations and discussions around data management. Among presenters were Julian Schwarzenbach with his beavers and squirrels from the data zoo and Martin “Johari” Doyle of DQ Global discussing data quality.

wet floorIn the data quality session I attended the good old subject of selling data quality was touched and not surprisingly the fear factor was mentioned as a way to go.

While I agree that fear of failure in the form of bad reputation and financial loss is a working concept I have also seen that data quality initiatives based on fear doesn’t stick too long. Similar thoughts were expressed in the Data Quality Pro post called Taking The ‘Fear’ Factor Out Of Data Quality By Duane Smith. Herein Duane says:

“Selling your data quality initiative based on fear may have a short-term pay back, but I believe it will ultimately fail in the longer term.”

euro notesThe opposite approach to relying on fear is counting on greed. That means making better profit by improving data quality. It’s a more sustainable way I think but indeed predicting ROI from a data quality initiative is very hard as examined on the blog page called ROI.

So, most often we fear counting on greed and falls back to greeting the fear.

Bookmark and Share

MDM is all about Software Brands

LinkedIn is a great social service for professionals. I often read descriptions of LinkedIn with the sentiment that LinkedIn is a recruitment platform. However, in my opinion LinkedIn is much more than that. To me LinkedIn is more about networking, knowledge sharing, social marketing and social selling.

But that said, recruiters are certainly very active on LinkedIn. I guess it happens to me every week that I’m contacted on LinkedIn by a recruiter with a MDM (Master Data Management) job.

MDM BrandsThe opening is practically always like this:

“We are looking for a candidate with experience with <brand>….”, where <brand> is Informatica, Oracle, IBM, SAP and other well known brands in the MDM sphere.

As I don’t guess the recruiters make up the top requirement themselves, this number one requirement probably comes from the hiring organization. So to users of MDM, MDM is all about the software brand. Never mind people and processes. That’s easy. Technology is the hard part, not at least mastering the master data technology that was bought after a thorough selection process.

Bookmark and Share

Matching for Multiple Purposes

In a recent post on the InfoTrellis blog we have the good old question in data matching about Deterministic Matching versus Probabilistic Matching.

The post has a good walk through on the topic and reaches this conclusion:

“So, which is better, Deterministic Matching or Probabilistic Matching?  The question should actually be: ‘Which is better for you, for your specific needs?’  Your specific needs may even call for a combination of the two methodologies instead of going purely with one.”

On a side note the author of the post is MARIANITORRALBA. I had to use my combined probabilistic and deterministic in-word parsing supported and social media connected data matching capability to match this concatenated name with the Linked profile of an InfoTrellis employee called Marian Itorralba.

This little exercise brings me to an observation about data matching that is, that matching party master data, not at least when you do this for several purposes, ultimately is identity resolution as discussed in the post The New Year in Identity Resolution.

HierarchyFor that we need what could be called hierarchical data matching.

The reason we need hierarchical data matching is that more and more organizations are looking into master data management and then they realize that the classic name and address matching rules do not necessarily fit when party master data are going to be used for multiple purposes. What constitutes a duplicate in one context, like sending a direct mail, doesn’t necessary make a duplicate in another business function and vice versa. Duplicates come in hierarchies.

One example is a household. You probably don’t want to send two sets of the same material to a household, but you might want to engage in a 1-to-1 dialogue with the individual members. Another example is that you might do some very different kinds of business with the same legal entity. Financial risk management is the same, but different sales or purchase processes may require very different views.

This matter is discussed in the post and not at least the comments of the post called Hierarchical Data Matching.

Bookmark and Share

The Postal Address Hierarchy

Using postal addresses is a core element in many data quality improvement and master data management (MDM) activities.

HierarchyAs touched many times on this blog postal addresses are formatted very differently around the world. However they may all be arranged in a sort of hierarchy, where there are up to 6 general levels being:

  • Country
  • Region
  • City or district
  • Thoroughfare (street) or block
  • Building number
  • Unit within building

In addition to that the postal code (postcode or zip code) is part of many address formats. Seen in the hierarchical light the postal code is a tricky concept as it may identify a city, district, thoroughfare, a single building or even a given unit within or section of a building. The latter is true for my company address in the United Kingdom, where we have a very granular postcode system.

Country

As discussed in the post The Country List even the top level of a postal address hierarchy isn’t a simple list fit for every purpose. Some issues are:

  • There are different sources with different perceptions of which are the countries on this planet
  • What we regard as countries comes in hierarchies
  • Several coding systems are available

Region

The region is an element in some address formats like the states in the United States and the provinces in Canada, while other countries like Germany that is divided into quite independent Länder do not have the region as a required part of the postal address. The same goes for Swiss cantons.

City or district

I once read that if you used the label city in a web form in Australia, you would get a lot of values like: “I do not live in a city”.

Anyway this level is often (but as mentioned certainly not always) where the postal code is applied. The postal code district may be a single town with surroundings, several villages or a district within a big city.

Thoroughfare (street) or block

Most countries use thoroughfares as streets, roads, lanes, avenues, mews, boulevards and whatever they are called around. Beware that the same street may have several spellings and even several names.

Japan is a counterexample of the use of thoroughfares, as here it’s the blocks between the thoroughfares that are part of the postal address.

Building number

Usually this element will be an integer. However formats with a letter behind the integer (example: 21 A) or a range of integers (example: 21-23) are most annoying. And then this British classic: One Main Grove. OMG.

Unit within a building

This element may or may not be present in a postal address depending on if the building is a single family house or company site, the postal delivery sees it as such or you may actually indicate where within the building the delivery goes or you go. The ups and downs of this level are examined in the post A Universal Challenge.

Bookmark and Share

Think global from day one

The title of this post is taken from a blog post by Hans Peter Bech. The post is called Entering a Foreign Market – The 9 Steps to Success for Software Companies.

Decimal_mark

In the post Hans Peter says:

“German software companies having access to 7% of world demand and US based companies with a domestic market representing 38% of world demand often ignore the global perspective until forced to face the challenge. That’s very fortunate for the smaller companies from the smaller countries!”

This observation from the software market in general certainly also applies to software for data quality improvement and master data management as examined in the post 255 Reasons for Data Quality Diversity.

If you are a software company in the data management space the meaning of thinking global may apply to various activities as:

  • How the product is designed in respect to handling data from all over the world. Here thinking global from day one is crucial.
  • How the product is marketed to a world-wide audience. Here the global approach could wait a bit.

On the latter matter I have teased one of the magic quadrant data quality tool vendors, Trillium Software, for having used a date format only used in the United States on their blog. Maybe it’s a small matter and just me who is sensitive to this normal glitch. Anyway I’m pleased to congratulate Trillium Software on their new blog design with a world-wide fit date format. Check out the blog, which is a good one indeed, here.

Bookmark and Share

Know Your Fan

A variant of the saying “Know Your Customer” for a football club will be “Know Your Fan” and indeed fans are customers when they buy tickets. If they can.

FC Copenhagen

FC Copenhagen cruised into stormy waters when they apparently cancelled all purchases for the upcoming Champions League (European soccer club paramount tournament) clashes against Real Madrid, Juventus and Galatasaray if the purchasers didn’t have a Danish sounding name. The reason was to prevent mixing fans of the different clubs, but surely this poorly thought screening method wasn’t received well among the FC Copenhagen fans not called Jensen, Nielsen or Sørensen.

The story is told in English here on Times of India.

Actually methods of verifying identities are available and cheap in Denmark so I’m surprised to see FC Copenhagen caught offside in this situation.

Bookmark and Share

Time to Turn Your Product Master Data Management Social?

Yesterday’s post on this blog had the title Time To Turn Your Customer Master Data Management Social? In a true Multi-Domain MDM spirit it is of course also timely to ask if it is time to turn your product master data management social.

Social PIMHere are a few ways to go when thinking social into product master data management:

Making product data lively

Kimmo Kontra had a blog post called With Tiger’s clubs, you’ll golf better – and what it means to Product Information Management. Herein Kimmo examined how stories around products help with selling products. Kimmo concluded that within master data management there is going to be a need for storing and managing stories.

So while traditional product master data management is about having the right hard facts about products consistent across multiple channels, and having the right images and other rich media consistent as well, in the social era you will also need to include the right and consistent stories when the multiple channels embraces social media.

Sharing product data

How do we ensure that we share the same product information, including the same stories, across the ecosystem of product manufacturers, distributors, retailers and end users?

During recent times I have followed a new cloud service called Actualog. Actualog is aiming at doing exactly that with emphasis on the daunting task of sharing product data in an international environment with different measurement systems, languages, alphabets and script systems.

Listening to big data

As discussed in the post Big Data and Multi Domain Master Data Management a prerequisite for getting sense out of analyzing social data (and other big data sources) is, that you not only have a consistent view of the product data related to products that you sell yourself, but also have a consistent view of competing products and how they relate to your products.

Therefore social product master data management requires you to extend the volume of products handled by your product information management solution probably in alternate product hierarchies.

Bookmark and Share