Marco Polo and Data Provenance

Besides being a data geek I am also interested in pre-modern history. So it’s always nice when I’m able to combine data management and history.

A recurring subject in historian circles is a suspicion saying that Explorer Marco Polo never actually went to China.

As said in the linked article from The Telegraph: “It is more likely that the Venetian merchant adventurer picked up second-hand stories of China, Japan and the Mongol Empire from Persian merchants whom he met on the shores of the Black Sea – thousands of miles short of the Orient”.

When dealing with data and ramping up data quality a frequent challenge is that some data wasn’t captured by the data consumer – not even by the organization using the data. Some of the data stored in company databases are second-hand data and in some cases the overwhelming part of data is captured outside the organization.

As with the book telling about Marco Polo’s (alleged) travels called “Description of the World” this doesn’t mean that you can’t trust anything. But maybe some data are mixed up a bit and maybe some obvious data are missing.

I have earlier touched this subject in the post Outside Your Jurisdiction and identified second-hand data as one of the Top 5 Reasons for Downstream Cleansing.

Bookmark and Share

2 thoughts on “Marco Polo and Data Provenance

  1. John Owens Dunedin 10th August 2011 / 09:03

    Hi Henrik

    My assertion is that ALL data within an enterprise is “captured” by the enterprise!

    While it may be true that all data might not originate within an enterprise, as soon as an enterprise decides to bring that data in-house, it is in effect “capturing” it and, as such, must ensure that it has in place all of the necessary means of assuring the quality of that data.

    If an enterprise is collecting questionable data from external sources it would be totally irresponsible to import that data into its operational applications.

    Any data that does not meet the data quality requirements of the enterprise must be cleansed and sanitised before it deemed fit for use. If it cannot be be, then it should be dumped.

    It does not matter where the data comes from, the enterprise is responsible for ensuring that it meets all data quality requirements before it “captures” it. If it does not, it should set it free!


    • Henrik Liliendahl Sørensen 10th August 2011 / 09:33

      Thanks John. You are right, maybe ”entered” is a better word than ”captured” for telling about where data was originally born.

      I actually also had a current engagement in mind where we are setting up rules and automated processes for cleansing large volumes of incoming product master data in order to make the data fit for purposes within the organization who wants to store the data and expose the data to online customers.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s