The classic data quality business case is avoiding sending promotion letters and printed materials to duplicate prospects and customers.
Even as e-commerce moves forward and more complex data quality business cases as those related to multi-purpose master data management becomes more important I will like to take a look at the classic business case by examining some different kind of choices for a data quality tool.
As you may be used to all different kind of currencies as EUR, USD, AUD, GBP and so on I will use the fictitious currency SSB (Simple Stupid Bananas).
Let’s say we have a direct marketing campaign with these facts:
- 100,000 names and addresses, ½ of them also with phone number
- Cost per mail is 3 SSB
- Response is 4,500 orders with an average profit of 100 SSB
From investigating a sample we know that 10% of the names and addresses are duplicates with slightly different spellings.
So from these figures we know that the cost of a false negative (a not found actual duplicate) is 3 SSB. Savings of a true positive is then also 3 SSB.
The cost of a false positive (a found duplicate that actually isn’t a duplicate) is a possible missing order worth: 4,500 / (100,000 * 90 %) * 100 SSB = 5 SSB.
Now let’s examine 3 options for tools for finding duplicates:
A: We already have Excel
B: Buying the leader of the pack data quality tool
C: Buying an algorithm based dedupe tool
A: We already have Excel
You may first sort 100,000 rows by address and look for duplicates this way. Say you find 2,000 duplicates. Then sort 98,000 rows by surname and look for duplicates. Say you find 1,000 duplicates. Then sort 97,000 rows by given name. Say you find 1,000 duplicate. Finally sort 48,000 rows by phone number. Say you find 1,000 duplicates.
If a person can look for duplicates in 1,000 rows per hour (without making false positives) we will browse totally 343,000 sorted rows in 343 hours.
Say you hire a student for that and have the Subject Matter Expert explaining, controlling and verifying the process using 15 hours.
Costs are:
- 343 student hours each 15 SSB = 5.145 SSB
- 15 SME hours each 50 SSB = 750 SSB
Total costs are 5.895 SSB.
Total savings are 5,000 true positives each 3 SSB = 15.000 SSB, making a positive ROI = 9.105 SSB in each campaign.
Only thing is that it will take one student more than 2 months (without quitting) to do the job.
B: Buying the leader of the pack data quality tool
Such a tool may have all kind of data quality monitoring features, may be integrated smoothly with ETL functionality and so on. For data matching it may use so called match codes. Doing that we may expect that the tool will find 7,500 duplicates where 7,000 are true positives and 500 are false positives.
Costs may be:
- Tool license fee is 50.000 SSB
- Training fee is 7.000 SSB
- 80 hours external consultancy each 125 SSB = 10.000 SSB
- 60 IT hours for training and installation each 50 SSB = 3.000 SSB
- 100 SME hours for training and configuration each 50 SSB = 5.000 SSB
Total costs are 75.000 SSB
Savings per campaign are 7,000 * 3 SSB – 500* 5 SSB = 18.500 SSB.
A positive ROI will show up after the 5th campaign.
C: Buying an algorithm based dedupe tool
By using algorithm based data matching such a tool depending on the threshold setting may find 9,100 duplicates where 9,000 are true positives and 100 are false positives.
Costs may be:
- Tool license fee is 5.000 SSB
- 8 hours external consultancy for a workshop each 125 SSB = 1.000 SSB
- 15 SME hours for training, configuration and pushing the button each 50 SSB = 750 SSB
Total costs are 6.750 SSB
Savings per campaign are 9,000 * 3 SSB – 100* 5 SSB = 26.500 SSB
A remarkable ROI will show up in the 1st campaign.
So C is the correct answer? 😉
Dario, I think so if you are going for deduplication and consolidation of master data.
Option D:
1. Export the Excel rows to a CSV file.
2. Find a Linux or Perl or Python or Ruby hacker, or anyone with a Macintosh. You probably know a teenager that can do this.
3. Give them ten bucks. 😉
You need *millions* of rows before a commercial data quality tool, algorithm-based or otherwise, is competitive with tools like “bash / sort -u”, scripting languages on this simple de-duplication task. In fact, if your 100,000 rows came out of a relational database, I’d fire the guy who gave them to you *with* duplicates, since it’s a simple matter to use a DISTINCT operator on your query. 😉
In reply to Mr M. Edward (Ed) Borasky:
I like the humoristic approach to solving this problem!
On a more serious note I’ve actually met many people in the business that solve similar problems with “home made tools” like scripts… see below.
What fans of hand coding often forget is that the commercial DQ tools have a deep functionality stack needed to find fuzzy duplicates. It would take a lot of effort to hand code similar functionality.
As always it boils down to the business case where there are many more parameters than functionality. In Henrik’s example above I would also consider timeliness and the value of each duplicate found.
A marketing department would want results as fast as possible to not miss any ongoing campaign deadlines. They don’t want to wait for a small development project to finalize.
A commercial DQ tool would deliver high quality results fast.
The second parameter I would consider in this case is the monetary value of each record. The extra cost of a commercial tool may well be covered by the value each record delivers when utilized in a business context.
I’m working with a bank that are having data quality issues with their contact information. For each contact they can’t reach because of faulty contact data they know they are loosing around 3000 euros. They can’t take any risks in using a home made script based solution.
So when a commercial data quality tool is competitive or not depends on more factors than the amount of rows. For some its competitive at 10M rows and for others at 50K rows.
Sorry for the long and rather serious answer! 🙂
Yes, commercial tools do have value in a large enterprise. And the game has changed – rather than “direct mail”, we have “digital body language”, “social CRM” and “business intelligence” – integrated tool sets with data quality as a component.
But a 100,000-row database with just names, addresses and phone numbers for a simple direct mail campaign is not an enterprise-scale problem. It’s something a couple of people can deal with in an afternoon.
There are 500 million Facebook users. Can you imagine both the joys and challenges of data quality in that environment? And can you further imagine hundreds of enterprises all sharing “the same Rolodex?” 😉
Thanks Ed for the comment.
The duplicates that hurt are those that can’t be solved with the SQL DISTINCT feature like:
Ed Smith, 1 Main Street, Newcastle
Edward Smith, 1 Main Str, New Castle
But you are right; you may come a long way with purging the exact ones.
In my calculations I counted only with those not being exact duplicates.
1. I’m not by any stretch of the imagination an SQL guru, but I think your example above could be done in a “modern” relational database like PostgreSQL with the extended SQL languages these databases support.
2. Your example has all three fields differing between rows. A Perl script could easily flag duplicates in only one field for further human processing.
3. A “real” address database would have *standardized* addresses *with* postal codes! You wouldn’t have “1 Main Street, Newcastle” on one row and “1 Main Str, New Castle” on the second row!
By the way, as you may have guessed, I’m a Perl and PostgreSQL fan. It turns out that you can actually write database functions and stored procedures in PostgreSQL *with* Perl, so you can build arbitrarily complex data quality processing into your database, for the most part eliminating any need for an external tool. I’m not sure what’s available commercially using these technologies, though – I build my own tools. 😉
Ed, your wise suggestions is certainly worth an option D. What I have found is that the variations in what to do with this and that makes the homegrown fixes a never ending story. But again, you may come a long way. Thanks again for engaging.
Interesting discussion indeed! For the record I’m not interested in a technology battle but I’m a little curious in what you actually can do with extended SQL, do you have any examples on how to handle parsing and deduplication?
In respons to 2: Any multiple cross record and cross field linkage (does that word exist?!) would require extensive human processing and as we all know human processing is the number one reason for data quality problems in the first place, so we want to eliminate the human factor as much as possible. 🙂
In respons to 3: Our Swedish addresses may be considered as the best structured addresses in the World (my humble guess). You wouldn’t believe some of the messed up name’n’address databases I’ve seen in my life so you would definitely find addresses with conformity issues.
I’ve met some customers with inhouse developed data quality tools. Many times they have done a great job but when I’ve looked at the solution they have only about 20% of the functionality a commercial DQ tool has. But again: The tools may have been good enough for their purposes.
Sorry, last post (I hope!) 🙂
Deduping 100K rows is most likely not done in an afternoon. Maybe if you’re 99% sure that your address database is highly standardized and conformant with address standards and no spelling errors whatsoever, but then you wouldn’t have much duplication issues either. You would solve this with SQL quite easily.
To know the state of the data you would have to profile it first. Afterwards you would know what challanges you have. Now if the database turns out to be a mess finding those fuzzy duplicates will take many many afternoons. Another point is that it would be difficult to attach two resources to the task of manual deduplication. If they worked with 50K rows each they would run the risk of having two or several records linking between the two datasets. Its hard enough finding fuzzy duplicates manually. How will they be able to keep track of fuzzy duplicates between themselves?
Involving even more human resources would be like pooring gasoline on the fire. 🙂
I would say that the cost/value of a commercial data quality tool is not related to the size of the company but to the value of its data. Some small companies have solid business cases for a commercial DQ solution and some large enterprises will never require any data quality initiatives at all.
By the way Ed: Maybe a little OT but your tweets are so funny!
Cheers//D