Thursday, June 30, 2011

How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)

How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)
My contribution to this discussion:

Steve has made some very good suggestions (best practices). I'd like to give my support to:
  1. Keep the source (You said "legacy"? Does this mean the source system is being retired?) system's Primary Keys (PK) in the MDM.
  2. Create a PK in the MDM using a surrogate value "surrogate key".
  3. No new rows or attribute values are to be created in the MDM that did not originate from the source system; that violates the principle of MDM.
  4. If an additional column in the MDM table for storing the source system's PK is not desirable, then you can, as Steve suggests, use another cross-reference table to store the source-to-MDM PK relationships. This can get messy if you need to do this for multiple sources or if the source system reuses its PK values - I have seen this happen.
I'd like to add that the MDM is NOT the place to do data cleansing. For example, if the "new" attribute values being created in the MDM are corrections for values received from the source, that is a bad practice. Corrections should be made in the source and then captured in the MDM either as complete overwrites or in a tracking (history) table of what the old values were and when they were effective.
Also, "downstream" components using the MDM data should only use whatever attributes they would use if getting their data directly from the source system. If the source system is using a surrogate PK value, then it is unlikely that users would know what it is. They should be using some "natural key" attribute or attributes that uniquely identify the rows. If they do know and use the source's PK values, that is the attribute they should continue using; the MDM's PK (if a surrogate value) should not be known to them. Since column and table names are sure to be different in the MDM than they are in the source, create views through which the users can access data using the source systems' table and column names.
For a dissertation on "surrogate keys", you might want to read

No comments:

Post a Comment