Small Data with Big Impact

In an ongoing discussion on LinkedIn there are some good points on: How important is data quality for big data compared to data quality for small data?

A repeated sentiment in the comments is that data quality for small data is going to be more important with the rise of big data.

The small data we are talking about here is first and foremost master data.

Master Data Challenges with Big Data

As with traditional transaction data master data is also describing the who, what, where and when of big data.

If we are having issues with completeness, timeliness and uniqueness in our master data any prediction based on big data matched with master data is going to be as chaotic as weather forecasts.

big small dataWe also need to expand the range of entities embraced by our master data management implementations as exemplified in the post Social MDM and Future Competitive Intelligence.

Matching Big Data with Master Data

Some of the issues in matching big data with master data I have stumbled upon are:

  • Who: How do we link the real world entities reflected in our traditional systems of record with the real world entities behind who’s talking in systems of engagement? This question was touched in post Making Sense with Social MDM.
  • What: How do we manage our product hierarchies and product descriptions so they fulfill both (different) internal purposes and external usage? More on this in the post Social PIM.
  • Where: How do we identify a given place? If you think this is easy, why not read the post Where is the Spot?
  • When: Date and time comes in many formats and relating events to the wrong schedule may have us  Going in the Wrong Direction.

How: You may for example follow this blog. Subscription is in the upper right corner 🙂

Bookmark and Share

Timeliness of Times

One of my several current engagements is within public transit.

I have earlier written about Real World Alignment issues in public transit (in my culture) as well as the special Multi-Entity Master Data Quality challenges there is in this specific industry.

Usually we talk about party master data and product master data as the most common domains of master data and sometimes we add places (locations) as the third domain in a P trinity of “parties, products and places” or perhaps a W trinity of “who, what and where”.

The when dimension, the times where events are taking place, is most often seen as belonging to the transaction side of life in the databases.

However in public transit you certainly also have timetables as an important master data domain. The service provided by a public transit authority or operator is described as belonging to a certain timeframe where a given combination of services is valid. An example is the “Summer Schedule 2011”.

An other industry with a time depending master data domain I have seen is education, where the given services (lessons) usually are described as belonging to a semester.

Wonder if you have met other master data types that is more belonging to the “when” domain than the “who, what and where” domains?  Did you have any problems with the timeliness of times?

Bookmark and Share

Multi-Entity Master Data Quality

Master Data is the core entities that describe the ongoing activities in an organization being:

  • Business partners (who)
  • Products (what)
  • Locations (where)
  • Timetables (when)

Many Master Data Management and Data Quality initiatives is in first place only focused on a single entity type, but sooner or later you are faced with dealing with all entity types and the data quality issues that arises from combining data from each entity type.

In my experience business partner data quality issues are in many ways similar cross all different industry verticals while product master data challenges may be different in many ways when comparing companies in various industry verticals. The importance of location data quality is very different, so are the questions about timetable data quality.

A journey in a multi-entity master data world

My latest experience in multi-entity master data quality comes from public transportation.

The most frequent business partner role here is of course the passengers. By the way (so to speak): A passenger may be a direct customer but the payer may also be someone else. But it doesn’t really change anything with the need for data quality whether the passenger is defined as a customer or not, you will regardless of that have to solve problems with uniqueness and real world alignment.

The product sold to a passenger is in the first place a travel document like a single ticket or an electronic card holding a season pass. But the service worth something for the passenger is a ride from point A to point B, which in many cases is delivered as a trip consisting of a series of rides from point A via point C (and D…) to point B. Having consistent hierarchies in reference data is a must when making data fit for multiple purposes of use in disciplines as fare collection, scheduling and so on.

Locations are mainly stop points including those at the start and end of the rides. These are identified both by a name and by geocoding – either as latitude and longitude on a round globe or by coordinates in a flat representation suitable for a map (on a screen). The distance between stops is important for grouping stops in areas suitable for interchange, e.g. bus stops on each side of a road or bus stops and platforms at a rail station. Working with the precision dimension of data quality is a key to accuracy here.

Timetables changes over time. It is essential to keep track of timetable validity in offline flyers, websites with passenger information, back office systems and on-board bus computers. Timeliness is as ever vital here.

Matching transactions made by drivers and passengers in numerous on-board computers, by employees in back office systems and coming from external sources with the various master data entities that describes the transaction correctly is paramount in an effective daily operation and the foundation for exploiting the data in order to make the right decisions for future services.

Bookmark and Share

Double Falshood

Always remember to include Shakespeare in a blog, right?

Now, it is actually disputable if Shakespeare has anything to do with the title of this blog post. Double Falshood is the (first part of the) title of a play claimed to be based on a lost play by Shakespeare (and someone else). The only fact that seems to be true in this story is that the plot of the play(s) is based on an episode in Don Quixote by Cervantes.  “The Ingenious Hidalgo Don Quixote of La Mancha”, which is the full name of the novel, is probably best known for the attack on the windmills by don Quijote (the Spanish version of the name).

All this confusion about sorting out who, what, when and where, and the feeling of tilting at windmills, seems familiar in the daily work in trying to fix master data quality.

And indeed “double falsehood” may be a good term for the classic challenge in the data quality kind of deduplication, which is to avoid false positives and false negatives at the same time.

Now, back to work.

Master Data Quality: The When Dimension

Often we use the who, what and where terms in defining master data opposite to transaction data, like saying:

  • Transaction data accurately identifies who, what, where and when and
  • Master data accurately describes who, what and where

Who is easily related to our business partners, what to the products we sell, buy and use – where is the locations of the events.

In some industries when is also easily related to master data entities like in public transportation a time table valid for a given period. Also a fiscal year in financial reporting belongs to the when side of things.

But when is also a factor in improving and preventing data quality related to our business partners, products and locations and assigned categories because the description of these entities do change over time.

This fact is named as “slowly changing dimensions” when building data warehouses and attempting to make sense of data with business intelligence.

But also in matching, deduplication and identity resolution the “when” dimension matters. Having data with the finest actuality doesn’t necessary lead to a good match as you may compare with data not having the same actuality. Here history tracking is a solution by storing former names, addresses, phones, e-mail addresses, descriptions, roles and relations.

Clouds_and_their_shadowsSuch a complexity is often not handled in master data containers around – and even less in matching environments.

My guess is that the future will bring public accessible reference data in the cloud describing our master data entities with a rich complexity including the when – the time – dimension and capable matching environments around.

Bookmark and Share