Tuesday, July 12, 2011

Is the relational database doomed?

Is the relational database doomed?

Have you ever been asked about document-oriented or key value (KV) data stores / databases?  Read the article at the link above for a good background on them.

Start with data warehouse or datamart?

This discussion on LinkedIn presents many ideas about how to start a Business Intelligence (BI) project when there is no existing data warehouse or data mart.  There aren't many companies that haven't got some sort of BI solution already, so I don't think there will be many times this question will need to be answered.

But, in the case where a brand new BI solution is being developed, here are my preferences:
  1. Design the BI solution for the future.
    This means planning ahead so that whatever you start with, you are planning for future enterprise data standards, naming conventions, global uniqueness, data quality, and data governance.  Do not operate in a business silo.
  2. ROI is important.
    While it is difficult to prove a return on investment for a non-tangible product such as BI, most business funders of a BI solution have a niche perspective they want results for quickly.  Since building an Enterprise Data Warehouse (EDW) averages about 18 months, focused business data marts (DM) can be completed relatively quickly in six months or less.
  3. Build both DW (to become EDW) and DM.
    When starting a new BI solution, don't consider an "either / or" solution.  A DM should get its data from the data warehouse (DW), which should be designed for the future EDW.  So, bring the master and transactional data from the source-of-record (SOR) database system into a small DW first.  This small DW could have only a few subject areas yet still be the beginning of the future EDW.  From the small DW, which should be designed as a normalized model in at least 3NF, create the DM that is designed as star (dimensional) model.  The advantage of this approach is that the SOR systems only need to populate a single DW (to become future EDW) and all DM created will be sourced from the same source of truth.  Thus they will all start with the same data and this will reduce the chances of different metrics being reported from different DM.

Friday, July 1, 2011

Testing data warehouse - Toolbox for IT Groups

Testing data warehouse - Toolbox for IT Groups

My response to the posted question concerning QA testing of Data Warehouse ETL is:


More than just using independently developed SQL statements, I recommend writing separate ETL programs using another programming / scripting language such as Perl to mimic what the ETL programs are supposed to be doing.

Since the same effort will be needed by both the ETL development team and the QA development team, QA cannot be left to the end as, in many projects, a "necessary evil" at the last minute. Both development teams will need to be working independently to, hopefully, achieve the same results.

For example, I had a QA lead project where the ETL programs were written in Pro-C++ moving data from a mainframe DB2 extracted file set to an Oracle UNIX-based enterprise database. Using the same design specifications that the Pro-C developers used, I wrote Perl scripts to create UNIX-based "tables" (file stores) that were to mimic the Oracle target tables.

A complete set of data extract files were created from the DB2 source. Using the same DB2 extract files, the developers ran their programs to load the target tables and I ran mine to load the UNIX files. Then I extracted the data loaded from the Oracle tables and saved the extract to another set of UNIX files in the same format as my extract files. Using UNIX diff command, I compared the two sets of files.

By doing this, I revealed two glaring errors that the developers said would not have been caught using just a small sample set of data extracts:

First, the Pro-C programs were doing batch commits every 2,000 rows, but the program was not resetting the counter in the FOR loop. Thus, only the first 2,000 rows were getting written to the Oracle tables. If the source-to-target mapping predicted a million rows, the resulting 2,000 rows was a big problem!

After this was fixed, the next glaring error my Perl scripts caught was that the variable array in Pro-C was not being cleared properly between rows. Therefore, some fields from the previous row were not cleared and were duplicated for subsequent rows. Thus some customers records that did not supply certain fields, say an email address, got the data in those fields from the previous customer record.

Since these programs were written using the same design specifications, but independently, one of the checks was to give the code used to each other, so I checked their Pro-C code and they checked my Perl code to see where the problem was.

Simply sending a few hundred rows of mocked-up test data to the Oracle target and checking that it matched expectations was not adequate testing and would not have revealed these serious bugs.

Thursday, June 30, 2011

How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)

How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)
My contribution to this discussion:

Steve has made some very good suggestions (best practices). I'd like to give my support to:
  1. Keep the source (You said "legacy"? Does this mean the source system is being retired?) system's Primary Keys (PK) in the MDM.
  2. Create a PK in the MDM using a surrogate value "surrogate key".
  3. No new rows or attribute values are to be created in the MDM that did not originate from the source system; that violates the principle of MDM.
  4. If an additional column in the MDM table for storing the source system's PK is not desirable, then you can, as Steve suggests, use another cross-reference table to store the source-to-MDM PK relationships. This can get messy if you need to do this for multiple sources or if the source system reuses its PK values - I have seen this happen.
I'd like to add that the MDM is NOT the place to do data cleansing. For example, if the "new" attribute values being created in the MDM are corrections for values received from the source, that is a bad practice. Corrections should be made in the source and then captured in the MDM either as complete overwrites or in a tracking (history) table of what the old values were and when they were effective.
Also, "downstream" components using the MDM data should only use whatever attributes they would use if getting their data directly from the source system. If the source system is using a surrogate PK value, then it is unlikely that users would know what it is. They should be using some "natural key" attribute or attributes that uniquely identify the rows. If they do know and use the source's PK values, that is the attribute they should continue using; the MDM's PK (if a surrogate value) should not be known to them. Since column and table names are sure to be different in the MDM than they are in the source, create views through which the users can access data using the source systems' table and column names.
For a dissertation on "surrogate keys", you might want to read http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.

Tuesday, June 28, 2011

Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups

Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups
I was just wondering in a type 4 SCD does the surrogate key ever change between the current table and the history table.
My response to this question is:

First of all, a good understanding of what a "surrogate key" really is is needed. For that, see my blog site at http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
As I state in my blog, there really is no such thing as a "surrogate key"; the term is used to describe a Primary Key (PK) that uses a surrogate value.
If you follow C. J. Date's rules, every table would have a PK using a surrogate value, ergo a "surrogate key". But each table would also have a "natural key", or a column or group of columns that uniquely identify a row (enforced by a unique index).

If the source database system uses a "surrogate key" (PK in the source "current database") that is brought into the data warehouse (DW - a "temporal database"), that column's value will never change. However, the row in the DW should also have a PK that uses a surrogate value. In order to keep track of changes to specific columns in that row, audit columns must be used such as "start date", "end date", "current".
Read a good, IMHO, definition of "surrogate key" at http://en.wikipedia.org/wiki/Surrogate_key.
Now, an understanding of Slowly Changing Dimensions is needed. Since you mentioned Type 4 (SCD4), let me explain that this is nothing more than a Type 3 (SCD3) kept in a separate table. Therefore, SCD4 requires a "current dimension table" that contains only currently valid records and separate "historical dimension table" that contains the history of changes; SCD3 does this in one table. I have rarely seen SCD4 used because it requires more database space, management, and views to use by users to find the value that is either current or in use at a previous time. But, in either case, the DW PK that uses a surrogate value and the source's PK using a surrogate value will never change.

One question: in the IT field do you receive a job opportunity because of who you know or what you know? | LinkedIn

One question: in the IT field do you receive a job opportunity because of who you know or what you know? | LinkedIn

My response to this question:


Both "who" and "what" are important; it is not an "either or" situation. A person (friend, colleague, network contact, ...) - the "who" - can give you a "heads up" or maybe get your resume to a hiring manager, but it seems that most large companies (where the jobs are) now use recruiting firms to screen applicants. But if you don't meet the job's requirements - the "what" -, that won't get you the job and it will make your "who" look bad. Entry-level jobs in the US are few and far between. It seems most jobs (contract or FTE) are looking for mid-level experience (around 5 years) in specific skill sets. Certifications without work experience will only be better for you if you are competing against someone without either certifications or work experience - you have an advantage for an entry-level job. To get work experience, you can try to do pro bono work for charities, a friend's business, small businesses that need some small help; you can also try temp agencies, but they may require work experienced, too. This pro bono and/or temp work could lead to long-term contract or FTE recommendations. Good luck.

Thursday, June 23, 2011

What is the difference between Surrogate Key and Primary Key? | LinkedIn

What is the difference between Surrogate Key and Primary Key? | LinkedIn

Introduction

There is no difference to the meanings below due to the data model; these are relational database management system (RDBMS) concepts. Whether you are using a normalized model or a dimensional model, the "primary key", "surrogate key", "natural key", and "alternate key" terms all have the same meanings.
A Google search turned up lots of places to read up on keys, including these:

Primary Key

A Primary Key (PK) in a database table uniquely identifies a row in the table. There can be only one PK per table. Depending on the database engine being used, a PK may require the explicit creation of a unique index or the database engine may automatically create such an index for the column or columns identified as the PK column(s).
A primary key can consist of a single column using a "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a "natural" or "meaningful" key).

Surrogate Key, aka Meaningless Key

The term "Surrogate Key" describes a particular type of Primary Key (PK). There really is no such thing as a "Surrogate Key". There are Primary Keys that use surrogate values. A PK that uses a surrogate value always consists of only one column.
The surrogate value is generally a system-generated value that nobody would know what it is to search for from a query. This PK may then be used as a foreign key (FK) in other tables and used for join conditions.
A Primary Key that uses a surrogate value is also known as a "meaningless" key. A table with only a "surrogate key" (primary key using an arbitrary value) must have an alternate key (AK) (see below). Without an AK, it is very likely that there will be "duplicate" rows in the table. An explanation of this follows below.

Natural Key, aka Meaningful Key

The other type of PK is called a "natural key". A natural (or meaningful) primary key is made up of one or more columns in a row that uniquely identify that row. The column or columns used for a natural key contain values that have meaning to the business.
If the natural key is the PK, then there can be no surrogate key. If the natural key is not the PK, it is an alternate key.

Alternate Key

While there can be only one primary key in a table, there may be more than one "alternate" keys in a table. These alternate keys are controlled by one or more unique indexes. A table with only a "surrogate key" (primary key using an arbitrary number) and no unique index for an alternate key is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
Generally, the columns used for the alternate key(s) are the columns that people will use in their queries to find a unique row, but the joins to other tables will be done using the single PK column comprising the "meaningless key" with the PK column being used as a FK column in those other tables.
The use of surrogate key values prevents the need to duplicate multiple columns in "child" tables just to be able to find the children of the "parent" record in a JOIN (or vice versa).

Impact of Primary Key Design

Let's examine the impact and implications of designing a table's primary key. Table_1, below, is a table that has a PK made up of a surrogate value, hence "Surrogate Key", and no AK.
Table_1: This table has a surrogate primary key (PK_ID) but no alternate key.
T1_PK_IDT1_COL1T1_COL2T1_COL3
1 A B C
2 A B C
3 A D E

While the three rows above are unique in the PK value (PK_ID column), rows 1 and 2 are really not unique rows, are they.
A table with only a "surrogate key" (PK using an arbitrary value) and no unique index for an alternate key (or the natural key) is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
If Table_1 were created with a unique index on the columns that make up an alternate key, it would look like Table_2 below.
Table_2: This table has a surrogate primary key (T2_PK_ID) plus an alternate key (T2_AK1, T2_AK2) enforced by a unique index.
T2_PK_ID T2_AK1 T2_AK2 T2_COL3
1 A B C
2 A D E
Generally, the alternate key columns are the columns that people will use in their queries to find a row, but the join to another table will be done using the single primary key column comprising the "meaningless key" which is a FK in the table being joined to. I'll show an example of this later.
A primary key can consist of a single "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a meaningful key). If Table_1 were recreated using the natural key, it would look like Table_3 below.
Table_3: This table has a meaningful primary key (T3_PK1, T3_PK2).
T3_PK1 T3_PK2 T3_COL3
A B C
A D E
If Table_3 was joined to another table, that table would have to include all the columns making up the primary key in Table_3; these would be the foreign key (FK) columns. The use of surrogate key values as in Table_2 prevents the need to duplicate multiple columns in "child" tables just to be able to find the unique "parent" records in a JOIN, and vice versa.
As an example of what I stated above, let's assume the following table structure.
Table_4: This table has a surrogate primary key (T4_PK), a natural key (T4_AK1, T4_AK2), and foreign key columns (T3_PK1, T3_PK2) needed to join this table to Table_3 ."
T4_PK T4_AK1 T4_AK2 T4_COL3 T3_PK1 T3_PK2
100 AA BA CA A B
101 AA BB CB A D
A query to find a row in Table_4 and join it to Table_3 to get another column would look like this:
SELECT
  T4.T4_PK,
  T4.T4_AK1,
  T4.T4_AK2,
  T4.T4_COL3,
  T3.T3_COL3
FROM
  Table_4 "T4"
  INNER JOIN Table_3 "T3" ON
    T4.T3_PK1 = T3.T3_PK1 AND
    T4.T3_PK2 = T3.T3_PK2
WHERE
  T4.T4_AK1 = 'AA' AND
  T4.T4_AK2 = 'BA'
The result from the query above would be:
T4_PK                  T4_AK1 T4_AK2 T4_COL3 T3_COL3 
---------------------- ------ ------ ------- ------- 
100                    AA     BA     CA      C     
If this same query were modified to be run against Table_1, assuming T1_COL1 is the same as T3_PK1 and T1_COL2 is the same as T3_PK2, the query would look like this:
SELECT
  T4.T4_PK,
  T4.T4_AK1,
  T4.T4_AK2,
  T4.T4_COL3,
  T1.T1_COL3
FROM
  Table_4 "T4"
  INNER JOIN Table_1 "T1" ON
    T4.T3_PK1 = T1.T1_COL1 AND
    T4.T3_PK2 = T1.T1_COL2
WHERE
  T4.T4_AK1 = 'AA' AND
  T4.T4_AK2 = 'BA'
The result from the query above would be:
T4_PK                  T4_AK1 T4_AK2 T4_COL3 T1_COL3 
---------------------- ------ ------ ------- ------- 
100                    AA     BA     CA      C       
100                    AA     BA     CA      C 
Instead of a single row being returned, we now get two rows.
Remember we created a modified version of Table_1 giving it an AK as Table_2. If we run the same query above against Table_2, changing the table and column names in the JOIN condition, we see that the AK in Table_2 prevented the "duplicate" row and we get only one row returned.
SELECT
  T4.T4_PK,
  T4.T4_AK1,
  T4.T4_AK2,
  T4.T4_COL3,
  T2.T2_COL3
FROM
  Table_4 "T4"
  INNER JOIN Table_2 "T2" ON
    T4.T3_PK1 = T2.T2_AK1 AND
    T4.T3_PK2 = T2.T2_AK2
WHERE
  T4.T4_AK1 = 'AA' AND
  T4.T4_AK2 = 'BA'

Results:
T4_PK                  T4_AK1 T4_AK2 T4_COL3 T2_COL3 
---------------------- ------ ------ ------- ------- 
100                    AA     BA     CA      C       
The table below, Table_5, has a surrogate primary key (T5_PK), a natural key (T5_AK1, T5_AK2), and foreign key columns needed to join this table to Table_1 and Table_4. Since T1_PK_ID is a surrogate (meaningless) key, it is not expected that one would know what it is. Therefore, searches for a row would be based on other columns thought to uniquely identify a row, in this case T1_COL1, T1_COL2, and T1_COL3.
Table_5:
T5_PK T5_AK1 T5_AK2 T5_COL4 T4_PK T1_COL1 T1_COL2 T1_COL3
100 AA BA CA 100 A B C
101 AA BB CB 101 A D E
In the query below, we will try to get primary key value from Table_1; let's say we need to use it in another query. The expectation is a return of a single row, but, because the rows are not truly unique based on the other columns in Table_1, two rows are returned:
SELECT 
  Table_1.T1_PK_ID
FROM Table_5
INNER JOIN Table_4 ON
  Table_5.T4_PK = Table_4.T4_PK
INNER JOIN table_1 ON
  Table_5.t1_col1 = table_1.t1_col1
  AND Table_5.t1_col2 = Table_1.t1_col2
WHERE
Table_5.t1_col1 = 'A'
AND Table_5.t1_col2 = 'B'

Results:
T1_PK_ID               
---------------------- 
1                      
2                  
If we wanted to get the value of Table_1.T1_COL3 for the "unique" values in the supposed "key" values in T1_COL1 and T1_COL2 using the SQL below, we will get an error.
SELECT 
  T1_COL3
FROM 
  Table_1
WHERE
  t1_pk_id = (select t1_pk_id
    from table_1
    where
    t1_col1 = 'A'
    AND t1_col2 = 'B')


Error starting at line 1 in command:
SELECT 
  T1_COL3
FROM 
  Table_1
WHERE
  t1_pk_id = (select t1_pk_id
    from table_1
    where
    t1_col1 = 'A'
    AND t1_col2 = 'B')
Error report:
SQL Error: ORA-01427: single-row subquery returns more than one row
01427. 00000 -  "single-row subquery returns more than one row"
*Cause:    
*Action:

Conclusion

While this started out with a question asking about the difference between a "surrogate key" and a "primary key", it evolved into a dissertaion on primary keys and how they can be designed. You should have learned:
  1. There really is no such thing as a "surrogate key; there are primary keys using surrogate values."
  2. A "primary key" can be made up from
    • a single column using a surrogate value
    • one or more columns that have unique values and known business meanings
  3. A table with a PK using a single column with a surrogate value MUST have an alternate key (AK)
  4. There are impacts to SQL writing if the PK is designed wrong!
I could go on about the impact of primary keys in a data warehouse environment where multiple rows with the same primary key must be maintained for historical purposes (slowly changing dimension tables), but I won't! Aren't you happy?! This disssertation should give you a "basic" understanding of primary keys and I leave it up to you to apply this understanding in your data model designs.