Data discovery is a term probably most mentioned in relation to business intelligence and data science. I this context data discovery can be seen as a more experimental and preliminary activity that can lead to a more continuous and integrated form of reporting and predictive analysis when hidden data sources, relationships and patterns are identified.
With the increasing awareness of data security, data protection and data privacy – and the regularity compliance enforced in this space – it is crucial for organisations to know what kind of data that flows and are stored within the organization. While you may argue that this should be available in already existing documentation, I have yet to meet an organization, where this is the case. And I come around a lot.
Data discovery is also a component of test data management and tool vendors package their offerings in this space with capabilities for data masking, data subsetting and data discovery in order to answer questions as:
Where are the data elements that should be masked when using production data in test scenarios without violating data privacy regulations?
How can you subset (minimize) test data sets derived from production (covering several databases) and still have proper relationships covered?
Within Data Quality Management, Data Governance and Master Data Management (MDM) data discovery also plays a role similar to the role in data reporting. We can use data discovery to map data lineage, find potential data relationships where data matching, data cleansing and/or data stewardship might help with ensuring data quality and business process improvement and explore where the same data have different labels (metadata) attached or the same labels are used for different data types.
During my engagements in selecting and working with the major data management tools on the market, I have from time to time experienced that they often lack support for specialized data management needs in minor markets.
Two such areas I have been involved with as a Denmark based consultant are:
The authorities in Denmark offers a free of charge access to very up to data and granular accurate address data that besides the envelope form of an address also comes with a data management friendly key (usually referred to as KVHX) on the unit level for each residential and business address within the country. Besides the existence of the address you also have access to what activity that takes place on the address as for example if it is a single-family house, a nursing home, a campus and other useful information for verification, matching and other data management activities.
If you want to verify addresses with the major international data managements tools I have come around, much of these goodies are gone, as for example:
Address reference data are refreshed only once per quarter
The key and the access to more information is not available
A price tag for data has been introduced
In Denmark (and other Scandinavian countries) we have a national identification number (known as personnummer) used much more intensively than the national IDs known from most other countries as told in the post Citizen ID within seconds.
The data masking capabilities in major data management solutions comes with pre-build functions for national IDs – but only covering major markets as the United States Social Security Number, the United Kingdom NINO and the kind of national id in use in a few other large western countries.
So, GDPR compliance is just a little bit harder here even when using a major tool.