Big Data Fitness

A man with one watch knows what time it is, but a man with two watches is never quite sure. This old saying could be modernized to, that a person with one smart device knows the truth, but a person with two smart devices is never quite sure.

An example from my own life is measuring my daily steps in order to motivate me to be more fit. Currently I have two data streams coming in. One is managed by the app Google Fit and one is managed by the app S Health (from Samsung).

This morning a same time shot looked like this:

Google Fit:

google-fit

S Health:

s-health

So, how many steps did I take this morning? 2,047 or 2413?

The steps are presented on the same device. A smartphone. They are though measured on two different devices. Google Fit data are measured on the smartphone itself while S Health data are measured on a connected smartwatch. Therefore, I might not be wearing these devices in the exact same way. For example, I am the kind of Luddite that do not bring the phone to the loo.

With the rise of the Internet of Things (IoT) and the expected intensive use of the big data streams coming from all kinds of smart devices, we will face heaps of similar cases, where we have two or more sets of data telling the same story in a different way.

A key to utilize these data in the best fit way is to understand from what and where these data comes. Knowing that is achieved through modern Master Data Management (MDM).

At Product Data Lake we in all humbleness are supporting that by sharing data about the product models for smart devices and in the future by sharing data about each device as told in the post Adding Things to Product Data Lake.

1st Party, 2nd Party and 3rd Party Master Data

Until now, much of the methodology and technology in the Master Data Management (MDM) world has been about how to optimize the use of what can be called first party master data. This is master data already collected within your organization and the approaches to MDM and the MDM solutions offered has revolved around federating internal silos and obtain a single source of truth within the corporate walls.

Besides that third-party data has been around for many years as described in the post Third-Party Data and MDM. Use of third party data in MDM has mainly been about enriching customer and supplier master data from business directories and in some degree utilizing standardized pools of product data in various solutions.

open doorUsing third party data for customer and supplier master data seems to be a very good idea as exemplified in the post Using a Business Entity Identifier from Day One. This is because customer and supplier master looks pretty much the same to every organization. With product master data this is not case and that is why third party sources for product master data may not be fully effective.

Second party data is data you get directly from the external source. With customer and supplier master data we see that approach in self-registration services. My recommendation is to combine self-registration and third party data in customer and supplier on-boarding processes. With product master data I think leaning mostly to second party connections in business ecosystems seems like the best way forward. There is more on that in a discussion on the LinkedIn  MDM – Master Data Management Group.

Bookmark and Share

Using a Business Entity Identifier from Day One

One of the ways to ensure data quality for customer – or rather party – master data when operating in a business-to-business (B2B) environment, is to on-board new entries using an external defined business entity identifier.

By doing that, you tackle some of the most challenging data quality dimensions as:

  • Uniqueness, by checking if a business with that identifier already exist in your internal master data. This approach is superior to using data matching as explained in the post The Good, Better and Best Way of Avoiding Duplicates.
  • Accuracy, by having names, addresses and other information defaulted from a business directory and thus avoiding those spelling mistakes that usually are all over in party master data.
  • Conformity, by inheriting additional data as line-of-business codes and descriptions from a business directory.

Having an external business identifier stored with your party master data helps a lot with maintaining data quality as pondered in the post Ongoing Data Maintenance.

Busienss Entity IdentifiersWhen selecting an identifier there are different options as national IDs, LEI, DUNS Number and others as explained in the post Business Entity Identifiers.

At the Product Data Lake service I am working on right now, we have decided to use an external business identifier from day one. I know this may be something a typical start-up will consider much later if and when the party master data population has grown. But, besides being optimistic about our service, I think it will be a win not to have to fight data quality issues later with guarantied increased costs.

For the identifier to use we have chosen the DUNS Number from Dun & Bradstreet. The reason is that this currently is the only worldwide covered business identifier. Also, Dun & Bradstreet offers some additional data that fits our business model. This includes consistent line-of-business information and worldwide company family trees.

Bookmark and Share

Starting up at the age of 56

It is never too late to start up, I have heard. So despite I usually brag about having +35 years of experience in the intersection of business and IT and a huge been done list in Data Quality and Master Data Management (MDM) which can get me nice consultancy engagements, a certain need on the market has been puzzling in my head for some time.

Before that, when someone asked me what to do in the MDM space I told them to create something around sharing master data between organisations. Most MDM solutions are sold to a given organization to cover the internal processes there. There are not many solutions out there that covers what is going on between organizations.

But why not do that myself? – with the help of some younger people.

FirstLogoSaveYou may have noticed, that I during the last year have been writing about something called the Product Data Lake. This has until recently mostly just been a business concept that could be presented on power point slides. So called slideware. But now it is becoming real software being deployed in the cloud.

Right now a gifted team in Vietnam, where I also am this week, is building the solution. We aim to have it ready for the first trial subscribers in August 2016. We will also be exhibiting the solution in London in late September, where we will be at the Start-up Alley in the combined Customer Contact, eCommerce and Technology for Marketing exhibition.

At home in Denmark, some young people are working on our solution too as well as the related launching activities and social media upbeat. This includes a LinkedIn company page. For continuous stories about our start-up, please follow the Product Data Lake page on LinkedIn here.

Bookmark and Share

Did You Mean Potato or Potahto?

As told in the post Where the Streets have Two Names one aspect of address validation is the fact, that in some parts of the world, a given postal address can be presented in more than one language.

I experienced that today when using Google Maps for directions to a Master Data Management (MDM) conference in Helsinki, Finland. When typing in the address I got this message:

Helsinki

The case is that the two addresses proposed by Google Maps are exactly the same address, just spelled in Swedish and Finnish, the two official languages used in this region.

I think Google Maps is an example of a splendid world-wide service. But even the best world-wide services sometimes don’t match local tailored services. This is in my experience the case when it comes to address management solutions as address validation and assistance whether they come as an integrated part of a Master Data Management (MDM) solution, a stand-alone data quality tool or a general service as Google Maps.

Using a Data Lake for Reference Data

TechTarget has recently published a definition of the term data lake.

In the explanation it is mentioned that the term data lake is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. The explanation also states that: “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”

A data lake is an approach to overcome the known big data characteristics being volume, velocity and variety, where probably the former one being variety is the most difficult to overcome with a traditional data warehouse approach.

If we look at traditional ways of using data warehouses, this has revolved around storing internal transaction data linked to internal master data. With the raise of big data there will be a swift to encompassing more and more external data. One kind of external data is reference data, being data that typically is born outside a given organization and data that has many different purposes of use.

Big reference dataSharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0. The data lake approach may very well work for big reference data as it may for other flavours of big data.

The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25th June 2015 with a webinar about Big Reference Data.

Bookmark and Share

Is big data all about analytics?

My answer to the question in the title of this blog post is NO. In my eyes big data is not just data warehouse 3.0. It is also data quality 3.0.

The concept of the data lake is growing in popularity in the big data world and so are the counts of warnings about your data lake becoming a data swamp, a data marsh or a data cesspool. Doing analytic work on a nice data lake sounds great. Doing it in a huge swamp, a large marsh or a giant cesspool does not sound so nice.

Figure 1In nature a lake stays fresh by having good upstream supply of water and a downstream system as well. In kind of the same way your data lake should not be a closed system or a dump within your organization.

Sharing data with the outside must be a part of your big data approach. This goes for including traditional flavours of big data as social data and sensor data as well what we may call big reference data being pools of global data and bilateral data as explained on this blog on the page called Data Quality 3.0.

The BrightTalk community on Big Data and Data Management has a formidable collection of webinars and videos on big data and data management topics. I am looking forward to contribute there on the 25th June 2015 with a webinar about Big Reference Data.

Bookmark and Share