When to cleanse in Data Migration?

About a year ago I posted a question on DataMigrationPro about when it is the best time for executing data cleansing in a migration project. Is it:

  • Before the migration?
  • During the migration?
  • After the migration?

In a recent excellent blog post by Dalton Cervo he explains some of the points considered to this question in a particular MDM migration project.

As I am going to prepare a speech including this subject I will be very pleased to receive additional considerations made on this matter. Please comment here or on DataMigrationPro.


Bookmark and Share

5 thoughts on “When to cleanse in Data Migration?

  1. Tibor Bossanyi 2nd September 2009 / 22:44


    in my opinion: cleansing in the legacy system means a separate project. –> more cost + effort
    during the migration: here I would put into 2 parts: before the mapping or after the mapping. After my experience a hibrid solution is also acceptable: the mapping can filter the metadata problems (simple syntax checking), and after the mapping we can clean the data which is already in the new structure.
    That means the iterations enhances not only the mapping quality, but also the data (semantic) quality.

  2. Olivier Mathurin 3rd September 2009 / 10:01

    In a Customer-Hub service-oriented implementation using a matching engine, we simply migrated the data using the production interface, meaning that we passed legacy data one-by-one to the Hub through a messaging interface. It was a good opportunity for:

    1) Saving costs and time by:
    a- Not implementing a dedicated migration architecture that would have been thrown away after the initial load of the CDH (licenses+design+infrastructure+implementation+etc.)

    b- Not having to implement twice validation and matching rules in two different tools (ETL and Match Engine).

    2) Testing and optimizing our validation rules and our matching rules with real and tricky duplicate records, enhancing the overall quality controls of the Hub,

    3) Testing the infrastructure on high volumes,

    Our volumetry was less than 1M recordsengaging 3 legacy-systems, approx. 15% of duplicate records, and we had to test our procedure several times. But it worked well.

  3. Dylan Jones 3rd September 2009 / 10:31

    Thanks for reminding me about this Henrik, I’ll add some comments on the forum too.

    The simple answer is – it depends.

    The key here is that you simply cannot take a universal standpoint and say “We will cleanse in the target (after the migration”. This sounds bizarre but I see it in so many migrations, companies are unwilling to foot the bill for DQ but they end up paying far more in delays or failure.

    The point is that irrespective of where you resolve DQ defects you still need to assess data quality thoroughly right at the start of the project.

    So many people view DQ as data migration testing, they create their mapping logic and then do some basic testing close to runtime, only when they execute live do they hit the problems.

    By performing a DQ assessment up front you will be able to classify the various defects as…

    a) Must be resolved in the legacy system
    b) Cannot be resolved in legacy so must be resolved in a staging area
    c) Can be resolved “in-flight” using DQ software or ETL tools etc.
    d) Can be resolved in the target or some further downstream processing

    Remember also that some data defects mean that the data won’t physically migrate at all, another danger point.

    Hope that helps.

  4. Alan Snow 4th September 2009 / 14:53

    In my view, as Dylan says, ‘It depends’, however I have been involved in several legacy to ERP migrations and I know what often happens. In many cases, responsibility for data quality (let alone defining what is ‘acceptable’ DQ), is not explicitly assigned. The result is that Migration Teams, who usually have to migrate to a strict timetable, don’t allow enough time for assessing DQ and making a plan on how to address the issues uncovered. The result is that they ‘do the best they can in the allotted time’, but this usually means that data is incorrectly migrated (or not migrated at all). It may be several weeks or months before the knowledge workers raise the alarm (that data in the new system is missing or incorrect), and by that time the old legacy system may have been switched out, and its data is not easily accessible. Result: a new system with lower data quality than the old one. All you succeed in doing is getting the wrong answer faster! The moral is: NEVER wait until after the migration, its too late then to recover.
    Should data always be ‘corrected at source’? I don’t think so, sometimes it is not viable due to unclear business rules or relationships in legacy systems, and if you are migrating soon, what’s the point? It will be a lot of effort and expense for a short-lived benefit. Ultimately a pragmatic balance has to be found between the reletive cost and practicality of addressing DQ at source, and addressing it further downstream.

  5. Henrik Liliendahl Sørensen 2nd September 2010 / 21:23

    One year later (shame on me): Thanks for the input.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s