Data Modelling and Data Quality

There are intersections between data modelling and data quality. In examining those we can use a data quality mind map published recently on this blog:

Data modelling and data quality

Data Modelling and Data Quality Dimensions:

Some data quality dimensions are closely related to data modelling and a given data model can impact these data quality dimensions. This is the case for:

  • Data integrity, as the relationship rules in a traditional entity-relation based data model fosters the integrity of the data controlled in databases. The weak sides are, that sometimes these rules are too rigid to describe actual real-world entities and that the integrity across several databases is not covered. To discover the latter one, we may use data profiling methods.
  • Data validity, as field definitions and relationship rules controls that only data that is considered valid can enter the database.

Some other data quality dimensions must be solved with either extended data models and/or alternative methodologies. This is the case for:

  • Data completeness:
    • A common scenario is that for example a data model born in the United States will set the state field within an address as mandatory and probably to accept only a value from a reference list of 50 states. This will not work in the rest of world. So, in order to not getting crap or not getting data at all, you will either need to extend the model or loosening the model and control completeness otherwise.
    • With data about products the big pain is that different groups of products require different data elements. This can be solved with a very granular data model – with possible performance issues, or a very customized data model – with scalability and other issues as a result.
  • Data uniqueness: A common scenario here is that names and addresses can be spelled in many ways despite that they reflect the same real-world entity. We can use identity resolution (and data matching) to detect this and then model how we link data records with real world duplicates together in a looser or tighter way.

Emerging technologies:

Some of the emerging technologies in the data storing realm are presenting new ways of solving the challenges we have with data quality and traditional entity-relationship based data models.

Graph databases and document databases allows for describing and operating data models better aligned with the real world. This topic was examined in the post Encompassing Relational, Document and Graph the Best Way.

In the Product Data Lake venture I am working with right now we are also aiming to solve the data integrity, data validity and data completeness issues with product data (or product information if you like) using these emerging technologies. This includes solving issues with geographical diversity and varying completeness requirements through a granular data model that is scalable, not only seen within a given company but also across a whole business ecosystem encompassing many enterprises belonging to the same (data) supply chain.

Three Remarkable Observations about Reltio

The latest entry on The Disruptive Master Data Management Solutions List is Reltio. I have been following Reltio for more than 5 years and have had the chance to do some hands on lately.

In doing that, I think there are three observations that makes the Reltio Cloud solution a remarkable MDM offering.

More than Master Data

While the Reltio solution emphasizes on master data the platform can include the data that revolves around master data as well. That means you can bring transactions and big data streams to the platform and apply analytics, machine learning, artificial intelligence and those shiny new things in order to go from a purely analytical world for these disciplines to exploit these data and capabilities in the operational world.

The thinking behind this approach is that you can not get a 360-degree on customer, vendor and other party roles as well as 360-degree on products by only having a snapshot compound description of the entity in question. You also need the raw history, the relationships between entities and access to details for various use cases.

In fact, Reltio provides not just operational MDM, but through a module called Reltio IQ also brings continuously mastered data, correlated transactions into an Apache Spark environment for analytics and Machine Learning. This eliminates the traditional friction of synchronizing data models between MDM and analytical environments. It also allows for aggregated results to be synchronized back into the MDM profiles, by storing them as analytical attributes. These attributes are now available for use in operational context, such as marketing segmentation, sales recommendations, GDPR exposure and more.

Multiple Storing Capabilities

There is an ongoing debate in the MDM community these days about if you should use relational database technology or NoSQL technology or graph technology? Reltio utilizes all three of them for the purposes where each approach makes the most sense.

Reference data are handled as relational data. The entities are kept using a wide column store, which is a technique encompassing scalability known from pure column stores but with some of the structure known from relational databases. Finally, the relationships are handled using graph techniques, which has been a recurring subject on this blog.

Reltio calls this multi-model polyglot persistence, and they embrace the latest technologies from multiple clouds such as AWS and Google Cloud Platform (GCP) under the covers.

Survival of the Fit Enough

One thing that MDM solutions do is making a golden record from different systems of records where the same real-world entity is described in many ways and therefore are considered duplicate records. Identifying those records is hard enough. But then comes the task of merging the conflicting values together, so the most accurate values survive in the golden record.

Reltio does that very elegantly by actually not doing it. Survivorship rules can be set up based on all the needed parameters as recency, provenance and more and you may also allow more than one value to survive as touched in the post about the principle of Survival of the Fit Enough.

In Reltio there is no purge of the immediately not surviving values. The golden record is not stored physically. Instead Reltio keeps one (or even more than one) virtual golden record(s) by letting the original source records stay. Therefore, you can easily rollback or update the single view of the truth.

The Reltio platform allows survivorship rules to be customized in rulesets for an unlimited number of roles and personas. In effect supporting multiple personalized versions of the truth. In an operational MDM context this allows sales, marketing, compliance, and other teams to see the data values that they care about most, while collaborating continuously in what Reltio calls the Self-Learning Enterprise.

Going beyond operational MDM

 

Trending Topic: Graph and MDM

Using graph data stores and utilizing the related capabilities has become a trending topic in the Master Data Management (MDM) space. This opportunity was first examined 5 years ago here on the blog in the post Will Graph Databases become Common in MDM? It seems so.

Recently David Borean, Chief Data Science Officer at the disruptive MDM vendor AllSight, wrote the blog post The real reason why Master Data Management needs Graph. In here David confirms the common known understanding of that graph databases are superior compared to relational databases when it comes to handle relationships within master data. But David also brings up how graph databases can support multiple versions of the truth.

graph MDMSeveral other vendors as Semarchy and Reltio are emphasizing on graph in MDM in their market messaging.

Aaron Zornes of The MDM Institute is another proponent of using graph technology within MDM as mentioned over at The Disruptive MDM Solutions blog in the post MDM Fact or Fiction: Who Knows?

What do you think: Will graph databases really brake through in MDM soon? Will it be as stand alone graph technology (as for example from neo4j) or embedded in MDM vendor portfolios?

Encompassing Relational, Document and Graph the Best Way

The use of graph technology in Master Data Management (MDM) has been a recurring topic on this blog as the question about how graph approaches fits with MDM keeps being discussed in the MDM world.

Multi-Domain MDM GraphRecently Salah Kamel, the CEO at the agile MDM solution provider Semarchy, wrote a blog post called Does MDM Need Graph?

In here Salah states: “A meaningful graph query language and visualization of graph relationships is an emerging requirement and best practice for empowering business users with MDM; however, this does not require the massive redesign, development, and integration effort associated with moving to a graph database for MDM functionality”.

In his blog post Salah discusses how relationships in the multi-domain MDM world can be handled by graph approaches not necessarily needing a graph database.

At Product Data Lake, which is a business ecosystem wide product information sharing service that works very well besides Semarchy MDM inhouse solutions, we are on the same page.

Currently we are evaluating how graph approaches are best delivered on top of our document database technology (using MongoDB). The current use cases in scope are exploiting related products in business ecosystems and how to find a given product with certain capabilities in a business ecosystem as examined in the post Three Ways of Finding a Product.

Three Ways of Finding a Product

One goal of Product Information Management (PIM) is to facilitate that consumers of product information can find a product they are looking for. Facilitating that includes feasible functionality and optimal organization of data.

Search

There is a whole industry making software that helps with searching for products as touched in the post Search and if you are lucky you will find.

However, even the best error tolerant and super elastic search engines are dependent on the data to search on and are challenged by differences in the taxonomy used by the one who searches and the taxonomy used in the product data.

As we are being better at providing more and more data about products that also makes issues in searching, as we are getting more and more hits of which many are irrelevant for the intention of a given search.

Drill down

You can start by selecting in what main group of products you are looking for something and then drill down through a more and more narrow classification.

Again, this approach is challenged by different perspectives of product grouping and even if we are looking for standards, there are too many of them as described in the post Five Product Classification Standards.

Traverse

The term traverse has (or will) become trendy with the introduction of graph technology. By using graph technology in Product Information Management (PIM) you will have a way of overcoming the challenges related to using search or drill down when looking for a product.

Big HammerFinding a product has in many use cases the characteristic of that we know some pieces of information and want to find a product that match those pieces of information, but often expressed in a different way. This fit very well with the way graph technology works by having a given set of root nodes from where we traverse through edges and nodes (also called vertices) until we end at reachable nodes of the wanted type.

In doing that we will be able to translate between different wording, classifications and languages.

At Product Data Lake we are currently exploring – or should I say traversing – this space. I will very much welcome your thoughts on this subject.

Will Graph Databases become Common in MDM?

One of my pet peeves in data quality for CRM and ERP systems is the often used way at looking at entities, not at least party entities, in a flat data model as told in the post A Place in Time.

Party master data, and related location master data, will eventually be modeled in very complex models and surely we see more and more examples of that. For example I remember that I long time ago worked with the ERP system that later became Microsoft Dynamics AX.  Then I had issues with the simplistic and not role aware data model. While I’m currently working in a project using the AX 2012 Address Book it’s good to see that things have certainly developed.

This blog has quite a few posts on hierarchy management in Master Data Management (MDM) and even Hierarchical Data Matching. But I have to admit that even complex relational data models and hierarchical approaches in fact don’t align completely with the real world.

In a comment to the post Five Flavors of Big Data Mike Ferguson asked about graph data quality. In my eyes using graph databases in master data management will indeed bring us closer to the real world and thereby deliver a better data quality for master data.

I remember at this year’s MDM Summit Europe that Aaron Zornes suggested that a graph database will be the best choice for reflecting the most basic reference dataset being The Country List. Oh yes, and in master data too you should think then, though I doubt that the relational database and hierarchy management will be out of fashion for a while.

So it could be good to know if you have seen or worked with graph databases in master data management beyond representing a static analysis result as a graph database.

GraphDatabase_PropertyGraph
Wikiopedia article on graph database

Bookmark and Share