A common technique used when assessing data quality is data profiling. For example you may count different measures as number of fields in a table that have null values or blank values, distribution of filled length of a certain field, average values, highest values, lowest values and so on.
If we look at the most prominent entity types in master data management being customers and products you may certainly also profile your customer tables and product tables and indeed many data profiling tutorials use these common sort of tables as examples.
However, in real life profiling an entire customer table or product table will often be quite meaningless. You need to dig into the hierarchies in these data domains to get meaningful measures for your data quality assessment.
Customer master data
In profiling customer master data you must consider the different types of party master data as business entities, department entities, consumer entities and contact entities, as the demands for completeness will be different for each type. If your raw data don’t have a solid categorization in place, a prerequisite for data profiling will often be to make such a categorization before going any further.
If your customer data model isn’t too simple, as explained in post A Place in Time, your location data (like shipping addresses, billing addresses, visiting addresses) will be separated from your customer naming and identification data. This hierarchical structure must be considered in your data profiling.
For international customer data there will also be different demands and possibilities for completeness of customer data elements.
Depending on your industry and way of doing business there may also be different demands for customer data related to different industry verticals, demographic groups and data sourced in different channels. However this may be a slippery ground, as current and not at least future requirements for multiple uses of the same master data may change the picture.
Product master data
For most businesses the requirements for completeness and other data profiling measures will be very different depending on the product type.
Some requirements will only apply to a small range of products; other requirements apply to a broader range of products.
All in all the data profiling requirements is an integrated part of hierarchy management for product master data which make a very strong case for having data profiling capabilities implemented as part of a product information management (PIM) solution.
Multi-Domain Master Data Management
For master data management solutions embracing both customer data integration (CDI) and product information management (PIM) integrated capabilities for profiling customer master data, location master data and product master data as part of hierarchy management makes a lot of sense.
As improving data quality isn’t a one-off activity but a continuous program, so is the part being measuring the completeness of your master data of any kind.
A very interesting post in deed. Gave me lots of ideas and a couple of new perspectives on profiling.
One thing that continuously strikes me though is that we in the DQ business aren’t very good at distinguishing at the two quite different approaches to applying data profiling: 1) Exploration and analysis and 2) Monitoring and quality assurance. Way too often the requirements of both activities are being presented as a single use-case which I don’t think is the case.
For example you mention that categorization is a prerequisite to doing data profiling. I don’t think that’s true, because profiling is also the activity of making the categorization. On the other hand it IS true that categorization is a prerequisite to do monitoring and quality assurance. The first scenario is applying data profiling to get to know “what and how”, the second scenario is applying data profiling to ensuring some level of quality.
Do you agree with that point, I wonder? Other then that I found your post on profiling very exciting – a returning subject for me and something I’m of course very interested in 🙂
Kasper, thanks a lot for commenting and not at least asking questions.
I agree that there, as in most disciplines, are poles, and I think you pointed at those being in data profiling. However, both from a business perspective and a technological perspective we also see a lot of things going on between those poles. But it is fair to say that my pitch on data profiling and master data management is in the monitoring and quality assurance end of things.
For the case about categorizing types of party master data profiling may be applied as part of the investigation. However I have often seen that for example if the field “company name” is filled with characters, that doesn’t necessary mean that it is a name of a company being in there. So again it’s a matter of if this it solved by using what may be called data profiling (with a lot of synonyms and other stuff involved) or we may call it identity resolution.
Thanks again for an excellent comment.
Definately agree. And identity resolution is one of the more tricky “categories” that most profiling tools lack… For now at least, but I think it’s a perfect candidate to add into the profiling mix.
By category I more or less presumed that you where also talking about eg. product or customer categories such as product price ranges, product types, contact nationality and so on – details that can be discovered easily using profiling and thus used as a useful insight into “what are we dealing with” for the rest of the DQ project.
Thanks David Loshin for following up on the subject.
Henrik, I appreciate that dialog pertaining to PIM as it will broaden the the discussions. PIM MDM as a segement is also multi-domain in that we manage supplier / vendor business entities and contact entities in addition to managing product classification, technical attribute descriptions, unit of measure, lead time, warranty, terms of warranty, and other industry standards such as UNSPSC or ECCN.
Thanks for joining Jackie. True, and I have enjoyed getting my PIM whereabouts revitalized the recent couple of months (in a Multi-Domain scope).