Data Warehouse vs Data Lake, Take 2

The differences between a data warehouse and a data lake has been discussed a lot as for example here and here.

To summarize, the main point in my eyes is: In a data warehouse the purpose and structure is determined before uploading data while the purpose with and structure of data can be determined before downloading data from a data lake. This leads to that a data warehouse is characterized by rigidity and a data lake is characterized by agility.

take-2Agility is a good thing, but of course, you have to put some control on top of it as reported in the post Putting Context into Data Lakes.

Furthermore, there are some great opportunities in extending the use of the data lake concept beyond the traditional use of a data warehouse. You should think beyond using a data lake within a given organization and vision how you can share a data lake within your business ecosystem. Moreover, you should consider not only using the data lake for analytical purposes but commence on a mission to utilize a data lake for operational purposes.

The venture I am working on right now have this second take on a data lake. The Product Data Lake exists in the context of sharing product information between trading partners in an agile and process driven way. The providers of product information, typically manufacturers and upstream distributors, uploads product information according to the data management maturity level of that organization. This information may very well for now be stored according to traditional data warehouse principles. The receivers of product information, typically downstream distributors and retailers, download product information according to the data management maturity level of that organization. This information may very well for now end up in a data store organized by traditional data warehouse principles.

As I have seen other approaches for sharing product information between trading partners these solutions are built on having a data warehouse like solution between trading partners with a high degree of consensus around purpose and structure. Such solutions are in my eyes only successful when restricted narrowly in a given industry probably within a given geography for a given span of time.

By utilizing the data lake concept in the exchange zone between trading partners you can share information according to your own pace of maturing in data management and take advantage of data sharing where it fits in your roadmap to digitalization. The business ecosystems where you participate are great sources of data for both analytical and operational purposes and we cannot wait until everyone agrees on the same purpose and structure. It only takes two to start the tango.

Bookmark and Share

Connecting Product Information

In our current work with the Product Data Lake cloud service, we are introducing a new way to connect product information that are stored at two different trading partners.

When doing that we deal with three kinds of product attributes:

  • Product identification attributes
  • Product classification attributes
  • Product features

Product identification attributes

The most common used notion for a product identification attribute today is GTIN (Global Trade Item Number). This numbering system has developed from the UPC (Universal Product Code) being most popular in North America and the EAN (International Article Number formerly European Article Number).

Besides this generally used system, there are heaps of industry and geographical specific product identification systems.

In principle, every product in a given product data store, should have a unique value in a product identification attribute.

When identifying products in practice attributes as a model number at a given manufacturer and a product description are used too.

Product classification attributes

A product classification attribute says something about what kind of product we are talking about. Thus, a range of products in a given product data store will have the same value in a product classification attribute.

As with product identification, there is no common used standard. Some popular cross-industry classification standards are UNSPSC (United Nations Products and Service Code®) and eCl@ss, but many other standards exists too as told in the post The World of Reference Data.

Besides the variety of standards a further complexity is that these standards a published in versions over time and even if two trading partners use the same standard they may not use the same version and they may have used various versions depending on when the product was on-boarded.

Product features

A product feature says something about a specific characteristic of a given product. Examples are general characteristics as height, weight and colour and specific characteristics within a given product classification as voltage for a power tool.

Again, there are competing standards for how to define, name and identify a given feature.

pdl-tagsThe Product Data Lake tagging approach

In the Product Data Lake we use a tagging system to typify product attributes. This tagging system helps with:

  • Linking products stored at two trading partners
  • Linking attributes used at two trading partners

A product identification attribute can be tagged starting with = followed by the system and optionally the variant off the system used. Examples will be ‘=GTIN’ for a Global Trading Item Number and ‘=GTIN-EAN13’ for a 13 character EAN number. An industry geographical tag could be ‘=DKVVS’ for a Danish plumbing catalogue number (VVS nummer). ‘=MODEL’ is the tag of a model number and ‘=DESCRIPTION’ is the tag of the product description.

A product classification tag starts with a #. ‘#UNSPSC’ is for a United Nations Products and Service Code where ‘#UNSPSC-19’ indicates a given main version.

A product feature is tagged with the feature id, an @ and the feature (sometimes called property) standard. ‘EF123456@ETIM’ will be a specific feature in ETIM (an international standard for technical products). ‘ABC123@ECLASS’ is a reference to a property in eCl@ss.

Bookmark and Share

Putting Context into Data Lakes

The term data lake has become popular along with the raise of big data. A data lake is a new of way of storing data that is more agile than what we have been used to in data warehouses. This is mainly based on the principle that you should not have thought through every way of consuming data before storing the data.

This agility is also the main reason for fear around data lakes. Possible lack of control and standardization leads to warnings about that a data lake will quickly develop into a data swamp.

LakeIn my eyes we need solutions build on the data lake concept if we want business agility – and we do want that. But I also believe that we need to put data in data lakes in context.

Fortunately, there are many examples of movements in that direction. A recent article called The Informed Data Lake: Beyond Metadata by Neil Raden has a lot of good arguments around a better context driven approach to data lakes.

As reported in the post Multi-Domain MDM 360 and an Intelligent Data Lake the data management vendor Informatica is on that track too.

In all humbleness, my vision for data lakes is that a context driven data lake can serve purposes beyond analytical use within a single company and become a driver for business agility within business ecosystems like cross company supply chains as expressed in the LinkedIn Pulse post called Data Lakes in Business Ecosystems.

Bookmark and Share

A Quick Tour around the Product Data Lake

The Product Data Lake is a cloud service for sharing product data in the eco-systems of manufacturers, distributors, retailers and end users of product information.

PDL tour 01As an upstream provider of products data, being a manufacturer or upstream distributor, you have these requirements:

  • When you introduces new products to the market, you want to make the related product data and digital assets available to  your downstream partners in a uniform way
  • When you win a new downstream partner you want the means to immediately and professionally provide product data and digital assets for the agreed range
  • When you add new products to an existing agreement with a downstream partner, you want to be able to provide product data and digital assets instantly and effortless
  • When you update your product data and related digital assets, you want a fast and seamless way of pushing it to your downstream partners
  • When you introduce a new product data attribute or digital asset type, you want a fast and seamless way of pushing it to your downstream partners.

The Product Data Lake facilitates these requirements by letting you push your product data into the lake in your in-house structure that may or may not be fully or partly compliant to an international standard.

PDL tour 02

As an upstream provider, you may want to push product data and digital assets from several different internal sources.

The product data lake tackles this requirement by letting you operate several upload profiles.

PDL tour 03

As a downstream receiver of product data, being a downstream distributor, retailer or end user, you have these requirements:

  • When you engage with a new upstream partner you want the means to fast and seamless link and transform product data and digital assets for the agreed range from the upstream partner
  • When you add new products to an existing agreement with an upstream partner, you want to be able to link and transform product data and digital assets in a fast and seamless way
  • When your upstream partners updates their product data and related digital assets, you want to be able to receive the updated product data and digital assets instantly and effortless
  • When you introduce a new product data attribute or digital asset type, you want a fast and seamless way of pulling it from your upstream partners
  • If you have a backlog of product data and digital asset collection with your upstream partners, you want a fast and cost effective approach to backfill the gap.

The Product Data Lake facilitates these requirements by letting you pull your product data from the lake in your in-house structure that may or may not be fully or partly compliant to an international standard.

PDL tour 04

In the Product Data Lake, you can take the role of being an upstream provider and a downstream receiver at the same time by being a midstream subscriber to the Product Data Lake. Thus, Product Data Lake covers the whole supply chain from manufacturing to retail and even the requirements of B2B (Business-to-Business) end users.

PDL tour 05

The Product Data Lake uses the data lake concept for big data by letting the transformation and linking of data between many structures be done when data are to be consumed for the first time. The goal is that the workload in this system has the resemblance of an iceberg where 10% of the ice is over water and 90 % is under water. In the Product Data Lake manually setting up the links and transformation rules should be 10 % of the duty and the rest being 90 % of the duty will be automated in the exchange zones between trading partners.

PDL tour 06

TwoLine Blue

Bookmark and Share

Excellence vs Excel

We all use Excel though we know it is bad. It is a user friendly and powerful tool, but there are plenty of stories out there where Excel has caused so much trouble like this one from Computerworld in 2008 when the credit crunch struck.

I guess all people who works in data management curses Excel. Data kept in Excel is a pain  – you know where – as it is hard to share, you never know if you have the latest version, nice informative colouring disappears when transforming, narrow columns turns into rubbish, different formatting usually makes it practically impossible to combine two sheets and heaps of other not so nice behaviours.

Even so, Excel is still the most used tool for many crucial data management purposes as for example reported in the post The True Leader in Product MDM.

Excel is still a very frequent used option when it comes to exchanging data as touched by Monica McDonnell of Informatica in a recent blog post on Four Technology Approaches for IDMP Data Management.

Probably, the use of Excel as a mean to exchange data between organizations is the field where it will be most difficult to eliminate the dangerous use of Excel. The problem is that the alternative usually is far too rigid. The task of achieving consensus between many organizations on naming, formatting and all the other tedious stuff makes us turn to Excel.

Excellence vs Excel

When working with data quality within data management we may wrongly strive for perfection. We should rather strive for excellence, which is something better than the ordinary. In this case Excel is the ordinary. As Harriet Braiker said: “Striving for excellence motivates you; striving for perfection is demoralizing.”

In order to be excellent, though not perfect, in data sharing, we must develop solutions that are better than Excel without being too rigid. Right now, I am working on a solution for sharing product data being of that kind. The service is called the Product Data Lake.

The World of Reference Data

Google EarthReference Data Management (RDM) is an evolving discipline within data management. When organizations mature in the reference data management realm we often see a shift from relying on internally defined reference data to relying on externally defined reference data. This is based on the good old saying of not to reinvent the wheel and also that externally defined reference data usually are better in fulfilling multiple purposes of use, where internally defined reference data tend to only cater for the most important purpose of use within your organization.

Then, what standard to use tend to be a matter of where in the world you are. Let’s look at three examples from the location domain, the party domain and the product domain.

Location reference data

If you read articles in English about reference data and ensuring accuracy and other data quality dimensions for location data you often meet remarks as “be sure to check validity against US Postal Services” or “make sure to check against the Royal Mail PAF File”. This is all great if all your addresses are in the United States or the United Kingdom. If all your addresses are in another country, there will in many cases be similar services for the given country. If your address are spread around the world, you have to look further.

There are some Data-as-a-Service offerings for international addresses out there. When it comes to have your own copy of location reference data the Universal Postal Union has an offering called the Universal POST*CODE® DataBase. You may also look into open data solutions as GeoNames.

Party reference data

Within party master data management for Business-to-Business (B2B) activities you want to classify your customers, prospects, suppliers and other business partners according to what they do, For that there are some frequently used coding systems in areas where I have been:

  • Standard Industrial Classification (SIC) codes, the four-digit numerical codes assigned by the U.S. government to business establishments.
  • The North American Industry Classification System (NAICS).
  • NACE (Nomenclature of Economic Activities), the European statistical classification of economic activities.

As important economic activities change over time, these systems change to reflect the real world. As an example, my Danish company registration has changed NACE code three times since 1998 while I have been doing the same thing.

This doesn’t make conversion services between these systems more easy.

Product reference data

There are also a good choice of standardized and standardised classification systems for product data out there. To name a few:

  • TheUnited Nations Standard Products and Services Code® (UNSPSC®), managed by GS1 US™ for the UN Development Programme (UNDP).
  • eCl@ss, who presents themselves as: “THE cross-industry product data standard for classification and clear description of products and services that has established itself as the only ISO/IEC compliant industry standard nationally and internationally”. eCl@ss has its main support in Germany (the home of the Mercedes E-Class).

In addition to cross-industry standards there are heaps of industry specific international, regional and national standards for product classification.

Bookmark and Share

The Shortcut to Lapland

11th of November and it’s time for the first x-mas post on this blog this year. My London gym is to blame for this early start.

Santa’s residence is disputed. As told in the post Multi-Domain MDM, Santa Style one option is Lapland.

Yesterday this yuletide challenge was included in an eMail in my inbox:

Lapland

Nice. Lapland is in Northern Scandinavia. Scandinavia belongs to that half of the world where comma is used as decimal mark as shown in the post Your Point, My Comma.

So while the UK born gym members will be near fainting doing several thousands of kilometers, I will claim the prize after easy 3 kilometers and 546 meters on the cross trainer.

Bookmark and Share

Putting Two Things in One Field

A very common data quality issue is when a field in a data record is populated with more than one piece of information.

Sometimes this is done as a work around, because we have a piece of information,  but we haven’t a field with that distinct purpose of use. Then we find a more or less related existing field where in we can squeeze this additional piece of information.

But we also have some very common cases where this bad habit is required by external business rules or wide spread tradition.

Legal formsLegal Form in Company Names

This example is examined in the post Legal Forms from Hell.

One should think that it is time for changing the bad (legal demanded) practice of mixing legal forms with company names and serve the original purpose in another more data quality friendly way.

An Address Line

An address line will typically hold a couple of elements as a street (thoroughfare) name, a house number and maybe some kind of unit identification.

By the way the order of street name and house number is opposite in approximately two equal parts of the world, with the exception of places where numbering within blocks between streets is the standard.

Education in Person Name

You can put professor in front of your name and even MBA – Master of Business Administration!! – after your name in the name field.

In the next few days I will put AFCM (Accidental Field Content Misuser) after my name.

Bookmark and Share

Fitness, Data Quality, Big Data and IT Projects

This weekend I’m in Copenhagen where I, opposite to when in London, enjoy a bicycle ride.

In the old days I had a small cycle computer that gave you a few key performance indicators about your ride as time of riding, distance covered, average and maximum speed. Today you can use an app on your smartphone and along the way have current figures displayed on your smartwatch.

As explained in the post American Exceptionalism in Data Management the first thing I do when installing an app is to change Fahrenheit to Celsius, date format to an useable one and in this context not at least miles to kilometers.

The cool thing is that the user interface on my smartwatch reports my usual speed in kilometer per hour as miles per hour making me 60 % faster than I used to be. So next year I will join Tour de France making Jens Voigt (aka Der Alte) look like a youngster.

Viking tour
A Viking tour around Roskilde and Vallø Borgring. Click for report with a wonderful mixup of date formats.

Using such an app is also a good example of why we have big data today. The app tracks a lot of data as detailed route on map with x, y and z coordinates, split speed per kilometer and other useful stuff. Analyzing these data tells me Tour de France maybe isn’t a good idea. After what I thought was 100 miles, but was 100 kilometers, my speed went from slow to grandpa.

That’s a bit like IT projects by the way. Regardless of timeframe, they slows down in progress after 80 % of plan has been covered.

Bookmark and Share

American Exceptionalism in Data Management

The term American exceptionalism is born in the political realm but certainly also applies to other areas including data management.

As a lot of software and today cloud services are made in the USA, the rest of world has some struggle with data standards that only or in high degree applies to the United States.

Some of the common ones are:

celcius fahrenheitFahrenheit

In the United States Fahrenheit is the unit of temperature. The rest of the world (with a few exceptions) use Celsius. Fortunately many applications has the ability of switching between those two, but it certainly happens to me once in a while that I uninstall a new exciting app because it only shows temperature in Fahrenheit, and to me 30 degrees is very hot weather.

Month-Day-Year

The Month-Day-Year date format is another American exceptionalism in data management. When dates are kept in databases there is no problem, as databases internally use a counter for a date. But as soon as the date slips into a text format and are used in an international sense, no one can tell if 10/9/2014 is the 10th September as it is seen outside the United States or 9th October as it is seen inside the United States. For example it took LinkedIn years before the service handled the date format accordingly to their international spread, at there are still mix-ups.

State

Having a state as part of a postal address is mandatory in the United States and only shared with a few other countries as Australia and Canada, though the Canadians calls the similar concept a province. The use of a mandatory state field with only US states present is especially funny when registering online for a webinar about an international data quality solution.

Bookmark and Share