Thursday, June 30, 2011

How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)

How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)
My contribution to this discussion:

Steve has made some very good suggestions (best practices). I'd like to give my support to:
  1. Keep the source (You said "legacy"? Does this mean the source system is being retired?) system's Primary Keys (PK) in the MDM.
  2. Create a PK in the MDM using a surrogate value "surrogate key".
  3. No new rows or attribute values are to be created in the MDM that did not originate from the source system; that violates the principle of MDM.
  4. If an additional column in the MDM table for storing the source system's PK is not desirable, then you can, as Steve suggests, use another cross-reference table to store the source-to-MDM PK relationships. This can get messy if you need to do this for multiple sources or if the source system reuses its PK values - I have seen this happen.
I'd like to add that the MDM is NOT the place to do data cleansing. For example, if the "new" attribute values being created in the MDM are corrections for values received from the source, that is a bad practice. Corrections should be made in the source and then captured in the MDM either as complete overwrites or in a tracking (history) table of what the old values were and when they were effective.
Also, "downstream" components using the MDM data should only use whatever attributes they would use if getting their data directly from the source system. If the source system is using a surrogate PK value, then it is unlikely that users would know what it is. They should be using some "natural key" attribute or attributes that uniquely identify the rows. If they do know and use the source's PK values, that is the attribute they should continue using; the MDM's PK (if a surrogate value) should not be known to them. Since column and table names are sure to be different in the MDM than they are in the source, create views through which the users can access data using the source systems' table and column names.
For a dissertation on "surrogate keys", you might want to read http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.

Tuesday, June 28, 2011

Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups

Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups
I was just wondering in a type 4 SCD does the surrogate key ever change between the current table and the history table.
My response to this question is:

First of all, a good understanding of what a "surrogate key" really is is needed. For that, see my blog site at http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
As I state in my blog, there really is no such thing as a "surrogate key"; the term is used to describe a Primary Key (PK) that uses a surrogate value.
If you follow C. J. Date's rules, every table would have a PK using a surrogate value, ergo a "surrogate key". But each table would also have a "natural key", or a column or group of columns that uniquely identify a row (enforced by a unique index).

If the source database system uses a "surrogate key" (PK in the source "current database") that is brought into the data warehouse (DW - a "temporal database"), that column's value will never change. However, the row in the DW should also have a PK that uses a surrogate value. In order to keep track of changes to specific columns in that row, audit columns must be used such as "start date", "end date", "current".
Read a good, IMHO, definition of "surrogate key" at http://en.wikipedia.org/wiki/Surrogate_key.
Now, an understanding of Slowly Changing Dimensions is needed. Since you mentioned Type 4 (SCD4), let me explain that this is nothing more than a Type 3 (SCD3) kept in a separate table. Therefore, SCD4 requires a "current dimension table" that contains only currently valid records and separate "historical dimension table" that contains the history of changes; SCD3 does this in one table. I have rarely seen SCD4 used because it requires more database space, management, and views to use by users to find the value that is either current or in use at a previous time. But, in either case, the DW PK that uses a surrogate value and the source's PK using a surrogate value will never change.

One question: in the IT field do you receive a job opportunity because of who you know or what you know? | LinkedIn

One question: in the IT field do you receive a job opportunity because of who you know or what you know? | LinkedIn

My response to this question:


Both "who" and "what" are important; it is not an "either or" situation. A person (friend, colleague, network contact, ...) - the "who" - can give you a "heads up" or maybe get your resume to a hiring manager, but it seems that most large companies (where the jobs are) now use recruiting firms to screen applicants. But if you don't meet the job's requirements - the "what" -, that won't get you the job and it will make your "who" look bad. Entry-level jobs in the US are few and far between. It seems most jobs (contract or FTE) are looking for mid-level experience (around 5 years) in specific skill sets. Certifications without work experience will only be better for you if you are competing against someone without either certifications or work experience - you have an advantage for an entry-level job. To get work experience, you can try to do pro bono work for charities, a friend's business, small businesses that need some small help; you can also try temp agencies, but they may require work experienced, too. This pro bono and/or temp work could lead to long-term contract or FTE recommendations. Good luck.

Thursday, June 23, 2011

What is the difference between Surrogate Key and Primary Key? | LinkedIn

What is the difference between Surrogate Key and Primary Key? | LinkedIn

Introduction

There is no difference to the meanings below due to the data model; these are relational database management system (RDBMS) concepts. Whether you are using a normalized model or a dimensional model, the "primary key", "surrogate key", "natural key", and "alternate key" terms all have the same meanings.
A Google search turned up lots of places to read up on keys, including these:

Primary Key

A Primary Key (PK) in a database table uniquely identifies a row in the table. There can be only one PK per table. Depending on the database engine being used, a PK may require the explicit creation of a unique index or the database engine may automatically create such an index for the column or columns identified as the PK column(s).
A primary key can consist of a single column using a "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a "natural" or "meaningful" key).

Surrogate Key, aka Meaningless Key

The term "Surrogate Key" describes a particular type of Primary Key (PK). There really is no such thing as a "Surrogate Key". There are Primary Keys that use surrogate values. A PK that uses a surrogate value always consists of only one column.
The surrogate value is generally a system-generated value that nobody would know what it is to search for from a query. This PK may then be used as a foreign key (FK) in other tables and used for join conditions.
A Primary Key that uses a surrogate value is also known as a "meaningless" key. A table with only a "surrogate key" (primary key using an arbitrary value) must have an alternate key (AK) (see below). Without an AK, it is very likely that there will be "duplicate" rows in the table. An explanation of this follows below.

Natural Key, aka Meaningful Key

The other type of PK is called a "natural key". A natural (or meaningful) primary key is made up of one or more columns in a row that uniquely identify that row. The column or columns used for a natural key contain values that have meaning to the business.
If the natural key is the PK, then there can be no surrogate key. If the natural key is not the PK, it is an alternate key.

Alternate Key

While there can be only one primary key in a table, there may be more than one "alternate" keys in a table. These alternate keys are controlled by one or more unique indexes. A table with only a "surrogate key" (primary key using an arbitrary number) and no unique index for an alternate key is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
Generally, the columns used for the alternate key(s) are the columns that people will use in their queries to find a unique row, but the joins to other tables will be done using the single PK column comprising the "meaningless key" with the PK column being used as a FK column in those other tables.
The use of surrogate key values prevents the need to duplicate multiple columns in "child" tables just to be able to find the children of the "parent" record in a JOIN (or vice versa).

Impact of Primary Key Design

Let's examine the impact and implications of designing a table's primary key. Table_1, below, is a table that has a PK made up of a surrogate value, hence "Surrogate Key", and no AK.
Table_1: This table has a surrogate primary key (PK_ID) but no alternate key.
T1_PK_IDT1_COL1T1_COL2T1_COL3
1 A B C
2 A B C
3 A D E

While the three rows above are unique in the PK value (PK_ID column), rows 1 and 2 are really not unique rows, are they.
A table with only a "surrogate key" (PK using an arbitrary value) and no unique index for an alternate key (or the natural key) is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
If Table_1 were created with a unique index on the columns that make up an alternate key, it would look like Table_2 below.
Table_2: This table has a surrogate primary key (T2_PK_ID) plus an alternate key (T2_AK1, T2_AK2) enforced by a unique index.
T2_PK_ID T2_AK1 T2_AK2 T2_COL3
1 A B C
2 A D E
Generally, the alternate key columns are the columns that people will use in their queries to find a row, but the join to another table will be done using the single primary key column comprising the "meaningless key" which is a FK in the table being joined to. I'll show an example of this later.
A primary key can consist of a single "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a meaningful key). If Table_1 were recreated using the natural key, it would look like Table_3 below.
Table_3: This table has a meaningful primary key (T3_PK1, T3_PK2).
T3_PK1 T3_PK2 T3_COL3
A B C
A D E
If Table_3 was joined to another table, that table would have to include all the columns making up the primary key in Table_3; these would be the foreign key (FK) columns. The use of surrogate key values as in Table_2 prevents the need to duplicate multiple columns in "child" tables just to be able to find the unique "parent" records in a JOIN, and vice versa.
As an example of what I stated above, let's assume the following table structure.
Table_4: This table has a surrogate primary key (T4_PK), a natural key (T4_AK1, T4_AK2), and foreign key columns (T3_PK1, T3_PK2) needed to join this table to Table_3 ."
T4_PK T4_AK1 T4_AK2 T4_COL3 T3_PK1 T3_PK2
100 AA BA CA A B
101 AA BB CB A D
A query to find a row in Table_4 and join it to Table_3 to get another column would look like this:
SELECT
  T4.T4_PK,
  T4.T4_AK1,
  T4.T4_AK2,
  T4.T4_COL3,
  T3.T3_COL3
FROM
  Table_4 "T4"
  INNER JOIN Table_3 "T3" ON
    T4.T3_PK1 = T3.T3_PK1 AND
    T4.T3_PK2 = T3.T3_PK2
WHERE
  T4.T4_AK1 = 'AA' AND
  T4.T4_AK2 = 'BA'
The result from the query above would be:
T4_PK                  T4_AK1 T4_AK2 T4_COL3 T3_COL3 
---------------------- ------ ------ ------- ------- 
100                    AA     BA     CA      C     
If this same query were modified to be run against Table_1, assuming T1_COL1 is the same as T3_PK1 and T1_COL2 is the same as T3_PK2, the query would look like this:
SELECT
  T4.T4_PK,
  T4.T4_AK1,
  T4.T4_AK2,
  T4.T4_COL3,
  T1.T1_COL3
FROM
  Table_4 "T4"
  INNER JOIN Table_1 "T1" ON
    T4.T3_PK1 = T1.T1_COL1 AND
    T4.T3_PK2 = T1.T1_COL2
WHERE
  T4.T4_AK1 = 'AA' AND
  T4.T4_AK2 = 'BA'
The result from the query above would be:
T4_PK                  T4_AK1 T4_AK2 T4_COL3 T1_COL3 
---------------------- ------ ------ ------- ------- 
100                    AA     BA     CA      C       
100                    AA     BA     CA      C 
Instead of a single row being returned, we now get two rows.
Remember we created a modified version of Table_1 giving it an AK as Table_2. If we run the same query above against Table_2, changing the table and column names in the JOIN condition, we see that the AK in Table_2 prevented the "duplicate" row and we get only one row returned.
SELECT
  T4.T4_PK,
  T4.T4_AK1,
  T4.T4_AK2,
  T4.T4_COL3,
  T2.T2_COL3
FROM
  Table_4 "T4"
  INNER JOIN Table_2 "T2" ON
    T4.T3_PK1 = T2.T2_AK1 AND
    T4.T3_PK2 = T2.T2_AK2
WHERE
  T4.T4_AK1 = 'AA' AND
  T4.T4_AK2 = 'BA'

Results:
T4_PK                  T4_AK1 T4_AK2 T4_COL3 T2_COL3 
---------------------- ------ ------ ------- ------- 
100                    AA     BA     CA      C       
The table below, Table_5, has a surrogate primary key (T5_PK), a natural key (T5_AK1, T5_AK2), and foreign key columns needed to join this table to Table_1 and Table_4. Since T1_PK_ID is a surrogate (meaningless) key, it is not expected that one would know what it is. Therefore, searches for a row would be based on other columns thought to uniquely identify a row, in this case T1_COL1, T1_COL2, and T1_COL3.
Table_5:
T5_PK T5_AK1 T5_AK2 T5_COL4 T4_PK T1_COL1 T1_COL2 T1_COL3
100 AA BA CA 100 A B C
101 AA BB CB 101 A D E
In the query below, we will try to get primary key value from Table_1; let's say we need to use it in another query. The expectation is a return of a single row, but, because the rows are not truly unique based on the other columns in Table_1, two rows are returned:
SELECT 
  Table_1.T1_PK_ID
FROM Table_5
INNER JOIN Table_4 ON
  Table_5.T4_PK = Table_4.T4_PK
INNER JOIN table_1 ON
  Table_5.t1_col1 = table_1.t1_col1
  AND Table_5.t1_col2 = Table_1.t1_col2
WHERE
Table_5.t1_col1 = 'A'
AND Table_5.t1_col2 = 'B'

Results:
T1_PK_ID               
---------------------- 
1                      
2                  
If we wanted to get the value of Table_1.T1_COL3 for the "unique" values in the supposed "key" values in T1_COL1 and T1_COL2 using the SQL below, we will get an error.
SELECT 
  T1_COL3
FROM 
  Table_1
WHERE
  t1_pk_id = (select t1_pk_id
    from table_1
    where
    t1_col1 = 'A'
    AND t1_col2 = 'B')


Error starting at line 1 in command:
SELECT 
  T1_COL3
FROM 
  Table_1
WHERE
  t1_pk_id = (select t1_pk_id
    from table_1
    where
    t1_col1 = 'A'
    AND t1_col2 = 'B')
Error report:
SQL Error: ORA-01427: single-row subquery returns more than one row
01427. 00000 -  "single-row subquery returns more than one row"
*Cause:    
*Action:

Conclusion

While this started out with a question asking about the difference between a "surrogate key" and a "primary key", it evolved into a dissertaion on primary keys and how they can be designed. You should have learned:
  1. There really is no such thing as a "surrogate key; there are primary keys using surrogate values."
  2. A "primary key" can be made up from
    • a single column using a surrogate value
    • one or more columns that have unique values and known business meanings
  3. A table with a PK using a single column with a surrogate value MUST have an alternate key (AK)
  4. There are impacts to SQL writing if the PK is designed wrong!
I could go on about the impact of primary keys in a data warehouse environment where multiple rows with the same primary key must be maintained for historical purposes (slowly changing dimension tables), but I won't! Aren't you happy?! This disssertation should give you a "basic" understanding of primary keys and I leave it up to you to apply this understanding in your data model designs.