Now, I am not going to write about the importance of location when selling real estates, but I am going to provide three examples about knowing about the location when you are doing data matching like trying to find duplicates in names and addresses.
Location uniqueness
Let’s say we have these two records:
- Stefani Germanotta, Main Street, Anytown
- Stefani Germanotta, Main Street, Anytown
The data is character by character exactly the same. But:
- There is only a very high probability that it is the same real world individual if there is only one address on Main Street in Anytown.
- If there are only a few addresses on Main Street in Anytown, you will still have a fair probability that this is the same individual.
- But if there are hundreds of addresses on Main Street in Anytown, the probability that this is the same individual will be below threshold for many matching purposes.
Of course, if you are sending a direct marketing letter it is pointless sending both letters, as:
- Either they will be delivered in the same mailbox.
- Or both will be returned by postal service.
So this example highlights a major point in data quality. If you are matching for a single purpose of use like direct marketing you may apply simple processing. But if you are matching for multiple purposes of use like building a master data hub, you don’t avoid some kind of complexity.
Location enrichment
Let’s say we have these two records:
- Alejandro Germanotta, 123 Main Street, Anytown
- Alejandro Germanotta, 123 Main Street, Anytown
If you know that 123 Main Street in Anytown is a single family house there is a high probability that this is the same real world individual.
But if you know that 123 Main Street in Anytown is a building used as a nursing home, a campus or that this entrance has many apartments or other kind of units, then it is not so certain that these records represents the same real world individual (not at least if the name is John Smith).
So this example highlights the importance of using external reference data in data matching.
Location geocoding
Let’s say we have these two records:
- Gaga Real Estate, 1 Main Street, Anytown
- L. Gaga Real Estate, Central Square, Anytown
If you match using the street address, the match is not that close.
But if you assigned a geocode for the two addresses, then the two addresses may be very close (just around the corner) and your match will then be pretty confident.
Assigning geocodes usually serve other purposes than data matching. So this example highlights how enhancing your data may have several positive impacts.
Very good points Henrik and that too on the hot topic ‘geo-location’.
Another good post Henrik and nice lateral thinking!
This highlights the importance of getting your data right and complete before data matching occurs. Obviously you need to ensure data is standardised and as clean as possible to match your address data to the geocode in order to gain the best results, which inherently improves data matching capabilities of the standardised data.
very good example of why geocoding is important! I’m glad I stoppped by to read this!
Hello again, Henrik
Excellent topic (again!)and one I would like to add an observation for…
We use faceted classification throughout our DQ work and one of the things we noticed is an extension of your observations. Once you establish a geo-code (or lat/long) for a given address, it becomes persistent in a collection. Now, if Julie’s Massage and Dog Wash moves into 1 Main Street, Anytown (or if Gaga Real Estate changes its name) you don’t have to worry about re-working that organization’s location – you already have it.
Another way of putting this from a data relationship perspective: the building has a geo-code (and as you mentioned a bunch of other facets) not the organization.
Cheers.
John O’
Great post.
I’m increasingly recommending organisations use lat-long enrichment, we’re actually doing a webcast about this in a few weeks on Data Quality Pro with a data visualization specialist, there are so many data quality issues that can be detected that are impossible to find using traditional profiling.
Thanks Monis, Daryl, William, John and Dylan for the comments.
It seems like geocoding (assigning latitude and longitude to addresses) is a hot topic in the data quality realm.