Product Data Quality

The data quality tool industry has always had a hard time offering capabilities for solving the data quality issues that relates to product data.

Customer data quality issues has always been the challenges addressed as examined in the post The Future of Data Quality Tools, where the current positioning from the analyst firm Information Difference was discussed. The leaders as Experian Data Quality, Informatica and Trillium (now part of Syncsort) always promote their data quality tools with use cases around customer data.

Back some years Oracle did have a go for product data quality with their Silver Creek Systems acquisition as mentioned by Andrew White of Gartner in this post. The approach from Silver Creek to product data quality can be seen in this MIT Information Quality Industry Symposium presentation from the year before. However, today Oracle is not even present in the industry report mentioned above.

Multi-Domain MDM and Data Quality DimensionsWhile data quality as a discipline with the methodology and surrounding data governance may be very similar between customer data and product data, the capabilities needed for tools supporting data cleansing, data quality improvement and prevention of data quality issues are somewhat different.

Data profiling is different, as it must be very tightly connected to product classification. Deduplication is useful, but far from in same degree as with customer data. Data enrichment must be much more related to second party data than third party data, which is most useful for customer and other party master data.

Regular readers of this blog will know, that my suggestion for data quality tool vendors is to join Product Data Lake.

Encompassing Relational, Document and Graph the Best Way

The use of graph technology in Master Data Management (MDM) has been a recurring topic on this blog as the question about how graph approaches fits with MDM keeps being discussed in the MDM world.

Multi-Domain MDM GraphRecently Salah Kamel, the CEO at the agile MDM solution provider Semarchy, wrote a blog post called Does MDM Need Graph?

In here Salah states: “A meaningful graph query language and visualization of graph relationships is an emerging requirement and best practice for empowering business users with MDM; however, this does not require the massive redesign, development, and integration effort associated with moving to a graph database for MDM functionality”.

In his blog post Salah discusses how relationships in the multi-domain MDM world can be handled by graph approaches not necessarily needing a graph database.

At Product Data Lake, which is a business ecosystem wide product information sharing service that works very well besides Semarchy MDM inhouse solutions, we are on the same page.

Currently we are evaluating how graph approaches are best delivered on top of our document database technology (using MongoDB). The current use cases in scope are exploiting related products in business ecosystems and how to find a given product with certain capabilities in a business ecosystem as examined in the post Three Ways of Finding a Product.

The Future of Data Quality Tools

When looking at the data quality tool market it is interesting to observe, that the tools available does pretty much the same and that all of them are pretty good at what they do today.

A visualization of this is the vendor landscape in the latest Information Difference Data Quality Landscape:

Data Quality Landscape 2017

As you see, the leaders as Experian Data Quality, Informatica, Trillium and others are assembling at the right edge. But that is due to market strength. Else the bunch is positioned pretty much equal.

This report does in my eyes also mention some main clues about where the industry is going.

One aspect is that: “Some data quality products are stand-alone, while others link to separate master data or data governance tools with varying degrees of smoothness.”

Examples among the leaders are Informatica, with data quality, MDM, PIM and other data management tools under the same brand, and Trillium with their partnership with the top data governance vendor Collibra. We will see more of that.

Another aspect is that: “Although name and address is the most common area addressed in data quality, product data is another broad domain requiring different approaches.”

I agree with Andy Hayler of Information Difference about that product data needs a different treatment as discussed in the post Data Quality for the Product Domain vs the Party Domain.

Three Ways of Finding a Product

One goal of Product Information Management (PIM) is to facilitate that consumers of product information can find a product they are looking for. Facilitating that includes feasible functionality and optimal organization of data.

Search

There is a whole industry making software that helps with searching for products as touched in the post Search and if you are lucky you will find.

However, even the best error tolerant and super elastic search engines are dependent on the data to search on and are challenged by differences in the taxonomy used by the one who searches and the taxonomy used in the product data.

As we are being better at providing more and more data about products that also makes issues in searching, as we are getting more and more hits of which many are irrelevant for the intention of a given search.

Drill down

You can start by selecting in what main group of products you are looking for something and then drill down through a more and more narrow classification.

Again, this approach is challenged by different perspectives of product grouping and even if we are looking for standards, there are too many of them as described in the post Five Product Classification Standards.

Traverse

The term traverse has (or will) become trendy with the introduction of graph technology. By using graph technology in Product Information Management (PIM) you will have a way of overcoming the challenges related to using search or drill down when looking for a product.

Big HammerFinding a product has in many use cases the characteristic of that we know some pieces of information and want to find a product that match those pieces of information, but often expressed in a different way. This fit very well with the way graph technology works by having a given set of root nodes from where we traverse through edges and nodes (also called vertices) until we end at reachable nodes of the wanted type.

In doing that we will be able to translate between different wording, classifications and languages.

At Product Data Lake we are currently exploring – or should I say traversing – this space. I will very much welcome your thoughts on this subject.

MDM Summit Europe 2017 Preview

Next week we have the Master Data Management (and Data Governance) Summit Europe 2017 in London. I am looking forward to be there.

MDMDG2017The Sponsors

Some of the sponsors I am excited to catch up with are:

  • Semarchy, as they have just released their next version multi-domain (now promoted as multi-vector) MDM (now promoted as xDM) offering emphasizing on agility, smartness, intelligence and being measurable.
  • Uniserv, as they specialize in hosted customer MDM on emerging technology infused with their proven data quality capabilities and at the same time are open to coexistence with other multi-domain MDM services.
  • Experian Data Quality, as they seem to be a new entry into the MDM world coming from very strong support for party and location data quality, however with a good foundation for supporting the whole multi-domain MDM space.

The Speakers

This year there are a handful of Danish speakers. Can’t wait to listen to:

  • Michael Bendixen of Grundfos pumping up the scene with his Data Governance Keynote on Key Factors in Successful Data Governance
  • Charlotte Gerlach Sylvest of Coloplast on taking care of Implementing Master Data Governance in Large Complex Organisations
  • Birgitte Yde and Louise Pagh Covenas of ATP telling how they watch after my pension money while being on a Journey Towards a New MDM System
  • Erika Bendixen of Bestseller getting us dressed up for Making Master Data Fashionable by Transforming Information Chaos into a Governance-Driven Culture.

10 Analyst Firms in the MDM Space

When working with Master Data Management (MDM) it is always valuable to follow the analyst firms that are active on this subject and the related subjects as data quality, data governance and data management in general. You can learn from their insights – and disagreements – on the matters. Here are 10 analyst firms I follow:

Gartner, the large analyst firm known for their magic quadrants, hype cycles and cool vendor lists. There is a lot of brain power in this firm and they have never been caught in admitting a mistake. Quite a lot of posts on this blog mentions Gartner.

Forrester, another firm with heaps of analysts. Forrester has though been less prominent in the MDM world since Robert Karel left for Informatica. However, there are lots of wider insights to gain from as mentioned in the post Ecosystems are The Future of Digital and MDM.

The MDM Institute, which basically is Aaron Zornes, known as the Father Christmas of MDM. Aaron Zornes was the inspirational source in my recent post called MDM as Managed Service.

The Information Difference, headed by Andy Hayler. They publish a yearly MDM landscape report latest referenced on this blog in the post Emerging Database Technologies for Master Data.

Bloor Group has occasionally made reports about MDM latest mentioned on this blog in the post The MDM Market Wordle.

Ventana Research has been especially active around Product Information Management (PIM) as seen in the recent press release on their Product Information Management Research.

Intelligent Business Strategies, run by Mike Ferguson. No nonsense, plain English insights from the around the UK Midlands. Home page here.

Constellation Research, the Silicon Valley perspective. Home page here.

The Group of Analysts has published a series of interviews with MDM and PIM notabilities as for example this one with Richard Hunt of Agility Multichannel on Content Gravity.

Aberdeen Group, a company you as a MDM vendor can hire to put numbers on your blog as for example Stibo Systems did here.

Analysts

5 Data Management Mistakes to Avoid during Data Integration Projects

mistake-876597_1920

I am very pleased to welcome today’s guest blogger. Canada based Maira Bay de Souza of Product Data Lake Technologies shares her view on data integration and the mistakes to avoid doing that:

Throughout my 5 years of working with Data Integration, Data Migration and Data Architecture, I’ve noticed some common (but sometimes serious) mistakes related to Data Management and Software Quality Management. I hope that by reading about them you will be able to avoid them in your future Data Integration projects.

 1 Ignoring Data Architecture

Defining the Data Architecture in a Data Integration project is the equivalent of defining the Requirements in a normal (non-data-oriented) software project. A normal software application is (most of the times) defined by its actions and interactions with the user. That’s why, in the first phase of software development (the Requirements Phase), one of the key steps is creating Use-Cases (or User Stories). On the other hand, a Data Integration application is defined by its operations on datasets. Interacting with data structures is at the core of its functionality. Therefore, we need to have a clear picture of what these data structures look like in order to define what operations we will do on them.

 It is widely accepted in normal software development that having well-defined requirements is key to success. The common saying “If you don’t know where you’re going, any road will get you there” also applies for Data Integration applications. When ETL developers don’t have a clear definition of the Data Architecture they’re working with, they will inevitably make assumptions. Those assumptions might not always be the same as the ones you, or worse, your customer made.

(see here and here for more examples on the consequences of not finding software bugs early in the process due to by badly defined requirements)

 Simple but detailed questions like “can this field be null or not?” need to be answered. If the wrong decision is made, it can have serious consequences. Most senior Java programmers like me are well aware of the infamous “Null Pointer Exception“. If you feed a null value to a variable that doesn’t accept null (but you don’t know that that’s the case because you’ve never seen any architecture specification), you will get that error message. Because it is a vague message, it can be time-consuming to debug and find the root cause (especially for junior programmers): you have to open your ETL in the IDE, go to the code view, find the line of code that is causing the problem (sometimes you might even have to run the ETL yourself), then find where that variable is located in the design view of your IDE, add a fix there, test it to make sure it’s working and then deploy it in production again. That also means that normally, this error causes an ETL application to stop functioning altogether (unless there is some sort of error handling). Depending on your domain that can have serious, life-threatening consequences (for example, healthcare or aviation), or lead to major financial losses (for example, e-commerce).

 Knowing the format, boundaries, constraints, relationships and other information about your data is imperative to developing a high quality Data Integration application. Taking the time to define the Data Architecture will prevent a lot of problems down the road.

2 Doing Shallow Data Profiling

Data profiling is another key element to developing good Data Integration applications.

 When doing data profiling, most ETL developers look at the current dataset in front of them, and develop the ETL to clean and process the data in that dataset. But unfortunately that is not enough. It is important to also think about how the dataset might change over time.

 For example, let’s say we find  a customer in our dataset with the postal code in the city field. We then add an instruction in the ETL for when we find that specific customer’s data, to extract the postal code from the city field and put it in the postal code field. That works well for the current dataset. But what if next time we run the ETL another customer has the same problem? (it could be because the postal code field only accepts numbers and now we are starting to have Canadian customers, who have numbers and letters in the postal code, so the user started putting the postal code in the city field)

Not thinking about future datasets means your ETL will only work for the current dataset. However, we all know that data can change over time (as seen in the example above) – and if it is inputted by the user, it can change unpredictably. If you don’t want to be making updates to your ETL every week or month, you need to make it flexible enough to handle changes in the dataset. You should use data profiling not only to analise current data, but also to deduce how it might change over time.

Doing deep data profiling in the beginning of your project means you will spend less time making updates to the Data Cleaning portion of your ETL in the future.

 3 Ignoring Data Governance

 This point goes hand-in-hand with my last one.

 A good software quality professional will always think about the “what if” situations when designing their tests (as opposed to writing tests just to “make sure it works”). In my 9 years of software testing experience, I can’t tell you how many times I asked a requirements analyst “what if the user does/enters [insert strange combination of actions/inputs here]?” and the answer was almost always “the user will never do that“. But the reality is that users are unpredictable, and there have been several times when the user did what they “would never do” with the applications I’ve tested.

The same applies to data being inputted into an ETL. Thinking that “data will never come this way” is similar to saying “the user will never do that“. It’s better to be prepared for unexpected changes in the dataset instead of leaving it to be fixed later on, when the problem has already spread across several different systems and data stores. For example, it’s better to add validation steps to make sure that a postal code is in the right format, instead of making no validation and later finding provinces in the postal code field. Depending on your data structures, how dirty the data is and how widespread the problem is, the cost to clean it can be prohibitive.

This also relates to my first point: a well-defined Data Architecture is the starting point to implementing Data Governance controls.

 When designing a high quality Data Integration application, it’s important to think of what might go wrong, and imagine how data (especially if it’s inputted by a human) might be completely different than you expect. As demonstrated in the example above, designing a robust ETL can save hours of expensive manual data cleaning in the future.

 4 Confusing Agile with Code-And-Fix

 A classic mistake in startups and small software companies (especially those ran by people without a comprehensive education or background in Software Engineering) is rushing into coding and leaving design and documentation behind. That’s why the US Military and CMU created the CMMI: to measure how (dis)organized a software company is, and help them move from amateur to professional software development. However, the compliance requirements for a high maturity organization are impractical for small teams. So things like XP, Agile, Scrum, Lean, etc have been used to make small software teams more organized without getting slowed down by compliance paperwork.

Those techniques, along with iterative development, proved to be great for startups and innovative projects due to their flexibility. However, they can also be a slippery slope, especially if managers don’t understand the importance of things like design and documentation. When the deadlines are hanging over a team’s head, the tendency is always to jump into coding and leave everything else behind. With time, managers start confusing agile and iterative development with code-and-fix.

 Throughout my 16 years of experience in the Software Industry, I have been in teams where Agile development worked very well. But I have also been in teams where it didn’t work well at all – because it was code-and-fix disguised as Agile. Doing things efficiently is not the same as skipping steps.

Unfortunately, in my experience this is no different in ETL development. Because it is such a new and unpopular discipline (as opposed to, for example, web development), there aren’t a lot of software engineering tools and techniques around it. ETL design patterns are still in their infancy, still being researched and perfected in the academic world. So the slippery slope from Agile to code-and-fix is even more tempting.

 What is the solution then? My recommendation is to use the proven, existing software engineering tools and techniques (like design patterns, UML, etc) and adapt them to ETL development. The key here is to do something. The fact that there is a gap in the industry’s body of knowledge is no excuse for skipping requirements, design, or testing, and jumping into “code-and-fix disguised as Agile“. Experiment, adapt and find out which tools, methodologies and techniques (normally used in other types of software development) will work for your ETL projects and teams.

5 Not Paying Down Your Technical Debt

The idea of postponing parts of your to-do list until later because you only have time to complete a portion of them now is not new. But unfortunately, with the popularization of agile methodologies and incremental development, Technical Debt has become an easy way out of running behind schedule or budget (and masking the root cause of the problem which was an unrealistic estimate).

As you might have guessed, I am not the world’s biggest fan of Technical Debt. But I understand that there are time and money constraints in every project. And even the best estimates can sometimes be very far from reality – especially when you’re dealing with a technology that is new for your team. So I am ok with Technical Debt, when it makes sense.

However, some managers seem to think that technical debt is a magic box where we can place all our complex bugs, and somehow they will get less complex with time. Unfortunately, in my experience, what happens is the exact opposite: the longer you owe technical debt (and the more you keep adding to it), the more complex and patchy the application becomes. If you keep developing on top of – or even around – an application that has a complex flaw, it is very likely that you will only increase the complexity of the problem. Even worse, if you keep adding other complex flaws on top of – or again, even around – it, the application becomes exponentially complex. Your developers will want to run away each time they need to maintain it. Pretty soon you end up with a piece of software that looks more like a Frankenstein monster than a clean, cohesive, elegant solution to a real-world problem. It is then only a matter of time (usually very short time) before it stops working altogether and you have no choice but to redesign it from scratch.

This (unfortunately) frequent scenario in software development is already a nightmare in regular (non-data-oriented) software applications. But when you are dealing with Data Integration applications, the impact of dirty data or ever-changing data (especially if it’s inputted by a human), combined with the other 4 Data Management mistakes I mentioned above, can quickly escalate this scenario into a catastrophe of epic proportions.

So how do you prevent that from happening? First of all, you need to have a plan for when you will pay your technical debt (especially if it is a complex bug). The more complex the required change or bug is, the sooner it should be dealt with. If it impacts a lot of other modules in your application or ecosystem, it is also important to pay it off sooner rather than later. Secondly, you need to understand why you had to go into technical debt, so that you can prevent it from happening again. For example, if you had to postpone features because you didn’t get to them, then you need to look at why that happened. Did you under-estimate another feature’s complexity? Did you fail to account for unknown unknowns in your estimate? Did sales or your superior impose an unrealistic estimate on your team? The key is to stop the problem on its tracks and make sure it doesn’t happen again. Technical Debt can be helpful at times, but you need to manage it wisely.

 I hope you learned something from this list, and will try to avoid these 5 Data Management and Software Quality Management mistakes on your next projects. If you need help with Data Management or Software Quality Management, please contact me for a free 15-min consultation.

Maira holds a Bsc in Computer Science, 2 software quality certifications and over 16 years of experience in the Software Industry. Her open-mindedness and adaptability have allowed her to thrive in a multidisciplinary career that includes Software Development, Quality Assurance and Project Management. She has taken senior and consultant roles at Fortune 20 companies (IBM and HP), as well as medium and small businesses. She has spent the last 5 years helping clients manage and develop software for Data Migration, Data Integration, Data Quality and Data Consistency. She is a Product Data Lake Ambassador & Technology Integrator through her startup Product Data Lake Technologies.