Tuesday, July 12, 2011
Is the relational database doomed?
Have you ever been asked about document-oriented or key value (KV) data stores / databases? Read the article at the link above for a good background on them.
Start with data warehouse or datamart?
But, in the case where a brand new BI solution is being developed, here are my preferences:
- Design the BI solution for the future.
This means planning ahead so that whatever you start with, you are planning for future enterprise data standards, naming conventions, global uniqueness, data quality, and data governance. Do not operate in a business silo. - ROI is important.
While it is difficult to prove a return on investment for a non-tangible product such as BI, most business funders of a BI solution have a niche perspective they want results for quickly. Since building an Enterprise Data Warehouse (EDW) averages about 18 months, focused business data marts (DM) can be completed relatively quickly in six months or less. - Build both DW (to become EDW) and DM.
When starting a new BI solution, don't consider an "either / or" solution. A DM should get its data from the data warehouse (DW), which should be designed for the future EDW. So, bring the master and transactional data from the source-of-record (SOR) database system into a small DW first. This small DW could have only a few subject areas yet still be the beginning of the future EDW. From the small DW, which should be designed as a normalized model in at least 3NF, create the DM that is designed as star (dimensional) model. The advantage of this approach is that the SOR systems only need to populate a single DW (to become future EDW) and all DM created will be sourced from the same source of truth. Thus they will all start with the same data and this will reduce the chances of different metrics being reported from different DM.
Friday, July 1, 2011
Testing data warehouse - Toolbox for IT Groups
Testing data warehouse - Toolbox for IT Groups
My response to the posted question concerning QA testing of Data Warehouse ETL is:
More than just using independently developed SQL statements, I recommend writing separate ETL programs using another programming / scripting language such as Perl to mimic what the ETL programs are supposed to be doing.
Since the same effort will be needed by both the ETL development team and the QA development team, QA cannot be left to the end as, in many projects, a "necessary evil" at the last minute. Both development teams will need to be working independently to, hopefully, achieve the same results.
For example, I had a QA lead project where the ETL programs were written in Pro-C++ moving data from a mainframe DB2 extracted file set to an Oracle UNIX-based enterprise database. Using the same design specifications that the Pro-C developers used, I wrote Perl scripts to create UNIX-based "tables" (file stores) that were to mimic the Oracle target tables.
A complete set of data extract files were created from the DB2 source. Using the same DB2 extract files, the developers ran their programs to load the target tables and I ran mine to load the UNIX files. Then I extracted the data loaded from the Oracle tables and saved the extract to another set of UNIX files in the same format as my extract files. Using UNIX diff command, I compared the two sets of files.
By doing this, I revealed two glaring errors that the developers said would not have been caught using just a small sample set of data extracts:
First, the Pro-C programs were doing batch commits every 2,000 rows, but the program was not resetting the counter in the FOR loop. Thus, only the first 2,000 rows were getting written to the Oracle tables. If the source-to-target mapping predicted a million rows, the resulting 2,000 rows was a big problem!
After this was fixed, the next glaring error my Perl scripts caught was that the variable array in Pro-C was not being cleared properly between rows. Therefore, some fields from the previous row were not cleared and were duplicated for subsequent rows. Thus some customers records that did not supply certain fields, say an email address, got the data in those fields from the previous customer record.
Since these programs were written using the same design specifications, but independently, one of the checks was to give the code used to each other, so I checked their Pro-C code and they checked my Perl code to see where the problem was.
Simply sending a few hundred rows of mocked-up test data to the Oracle target and checking that it matched expectations was not adequate testing and would not have revealed these serious bugs.
Thursday, June 30, 2011
How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)
My contribution to this discussion:
Steve has made some very good suggestions (best practices). I'd like to give my support to:
- Keep the source (You said "legacy"? Does this mean the source system is being retired?) system's Primary Keys (PK) in the MDM.
- Create a PK in the MDM using a surrogate value "surrogate key".
- No new rows or attribute values are to be created in the MDM that did not originate from the source system; that violates the principle of MDM.
- If an additional column in the MDM table for storing the source system's PK is not desirable, then you can, as Steve suggests, use another cross-reference table to store the source-to-MDM PK relationships. This can get messy if you need to do this for multiple sources or if the source system reuses its PK values - I have seen this happen.
Also, "downstream" components using the MDM data should only use whatever attributes they would use if getting their data directly from the source system. If the source system is using a surrogate PK value, then it is unlikely that users would know what it is. They should be using some "natural key" attribute or attributes that uniquely identify the rows. If they do know and use the source's PK values, that is the attribute they should continue using; the MDM's PK (if a surrogate value) should not be known to them. Since column and table names are sure to be different in the MDM than they are in the source, create views through which the users can access data using the source systems' table and column names.
For a dissertation on "surrogate keys", you might want to read http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
Tuesday, June 28, 2011
Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups
I was just wondering in a type 4 SCD does the surrogate key ever change between the current table and the history table.
My response to this question is:
First of all, a good understanding of what a "surrogate key" really is is needed. For that, see my blog site at http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
As I state in my blog, there really is no such thing as a "surrogate key"; the term is used to describe a Primary Key (PK) that uses a surrogate value.
If you follow C. J. Date's rules, every table would have a PK using a surrogate value, ergo a "surrogate key". But each table would also have a "natural key", or a column or group of columns that uniquely identify a row (enforced by a unique index).
If the source database system uses a "surrogate key" (PK in the source "current database") that is brought into the data warehouse (DW - a "temporal database"), that column's value will never change. However, the row in the DW should also have a PK that uses a surrogate value. In order to keep track of changes to specific columns in that row, audit columns must be used such as "start date", "end date", "current".
Read a good, IMHO, definition of "surrogate key" at http://en.wikipedia.org/wiki/Surrogate_key.
Now, an understanding of Slowly Changing Dimensions is needed. Since you mentioned Type 4 (SCD4), let me explain that this is nothing more than a Type 3 (SCD3) kept in a separate table. Therefore, SCD4 requires a "current dimension table" that contains only currently valid records and separate "historical dimension table" that contains the history of changes; SCD3 does this in one table. I have rarely seen SCD4 used because it requires more database space, management, and views to use by users to find the value that is either current or in use at a previous time. But, in either case, the DW PK that uses a surrogate value and the source's PK using a surrogate value will never change.
One question: in the IT field do you receive a job opportunity because of who you know or what you know? | LinkedIn
My response to this question:
Both "who" and "what" are important; it is not an "either or" situation. A person (friend, colleague, network contact, ...) - the "who" - can give you a "heads up" or maybe get your resume to a hiring manager, but it seems that most large companies (where the jobs are) now use recruiting firms to screen applicants. But if you don't meet the job's requirements - the "what" -, that won't get you the job and it will make your "who" look bad. Entry-level jobs in the US are few and far between. It seems most jobs (contract or FTE) are looking for mid-level experience (around 5 years) in specific skill sets. Certifications without work experience will only be better for you if you are competing against someone without either certifications or work experience - you have an advantage for an entry-level job. To get work experience, you can try to do pro bono work for charities, a friend's business, small businesses that need some small help; you can also try temp agencies, but they may require work experienced, too. This pro bono and/or temp work could lead to long-term contract or FTE recommendations. Good luck.
Thursday, June 23, 2011
What is the difference between Surrogate Key and Primary Key? | LinkedIn
Introduction
There is no difference to the meanings below due to the data model; these are relational database management system (RDBMS) concepts. Whether you are using a normalized model or a dimensional model, the "primary key", "surrogate key", "natural key", and "alternate key" terms all have the same meanings.A Google search turned up lots of places to read up on keys, including these:
- http://en.wikipedia.org/wiki/Unique_key
- http://en.wikipedia.org/wiki/Surrogate_key
- http://www.dbdebunk.com/page/page/622337.htm
- http://databases.about.com/cs/administration/g/primarykey.htm
- Methods for primary and secondary key selection (Data base management)
Primary Key
A Primary Key (PK) in a database table uniquely identifies a row in the table. There can be only one PK per table. Depending on the database engine being used, a PK may require the explicit creation of a unique index or the database engine may automatically create such an index for the column or columns identified as the PK column(s).A primary key can consist of a single column using a "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a "natural" or "meaningful" key).
Surrogate Key, aka Meaningless Key
The term "Surrogate Key" describes a particular type of Primary Key (PK). There really is no such thing as a "Surrogate Key". There are Primary Keys that use surrogate values. A PK that uses a surrogate value always consists of only one column.The surrogate value is generally a system-generated value that nobody would know what it is to search for from a query. This PK may then be used as a foreign key (FK) in other tables and used for join conditions.
A Primary Key that uses a surrogate value is also known as a "meaningless" key. A table with only a "surrogate key" (primary key using an arbitrary value) must have an alternate key (AK) (see below). Without an AK, it is very likely that there will be "duplicate" rows in the table. An explanation of this follows below.
Natural Key, aka Meaningful Key
The other type of PK is called a "natural key". A natural (or meaningful) primary key is made up of one or more columns in a row that uniquely identify that row. The column or columns used for a natural key contain values that have meaning to the business.If the natural key is the PK, then there can be no surrogate key. If the natural key is not the PK, it is an alternate key.
Alternate Key
While there can be only one primary key in a table, there may be more than one "alternate" keys in a table. These alternate keys are controlled by one or more unique indexes. A table with only a "surrogate key" (primary key using an arbitrary number) and no unique index for an alternate key is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.Generally, the columns used for the alternate key(s) are the columns that people will use in their queries to find a unique row, but the joins to other tables will be done using the single PK column comprising the "meaningless key" with the PK column being used as a FK column in those other tables.
The use of surrogate key values prevents the need to duplicate multiple columns in "child" tables just to be able to find the children of the "parent" record in a JOIN (or vice versa).
Impact of Primary Key Design
Let's examine the impact and implications of designing a table's primary key. Table_1, below, is a table that has a PK made up of a surrogate value, hence "Surrogate Key", and no AK.T1_PK_ID | T1_COL1 | T1_COL2 | T1_COL3 |
---|---|---|---|
1 | A | B | C |
2 | A | B | C |
3 | A | D | E |
While the three rows above are unique in the PK value (PK_ID column), rows 1 and 2 are really not unique rows, are they.
A table with only a "surrogate key" (PK using an arbitrary value) and no unique index for an alternate key (or the natural key) is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
If Table_1 were created with a unique index on the columns that make up an alternate key, it would look like Table_2 below.
T2_PK_ID | T2_AK1 | T2_AK2 | T2_COL3 |
---|---|---|---|
1 | A | B | C |
2 | A | D | E |
A primary key can consist of a single "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a meaningful key). If Table_1 were recreated using the natural key, it would look like Table_3 below.
T3_PK1 | T3_PK2 | T3_COL3 |
---|---|---|
A | B | C |
A | D | E |
As an example of what I stated above, let's assume the following table structure.
T4_PK | T4_AK1 | T4_AK2 | T4_COL3 | T3_PK1 | T3_PK2 |
---|---|---|---|---|---|
100 | AA | BA | CA | A | B |
101 | AA | BB | CB | A | D |
SELECT T4.T4_PK, T4.T4_AK1, T4.T4_AK2, T4.T4_COL3, T3.T3_COL3 FROM Table_4 "T4" INNER JOIN Table_3 "T3" ON T4.T3_PK1 = T3.T3_PK1 AND T4.T3_PK2 = T3.T3_PK2 WHERE T4.T4_AK1 = 'AA' AND T4.T4_AK2 = 'BA'The result from the query above would be:
T4_PK T4_AK1 T4_AK2 T4_COL3 T3_COL3 ---------------------- ------ ------ ------- ------- 100 AA BA CA CIf this same query were modified to be run against Table_1, assuming T1_COL1 is the same as T3_PK1 and T1_COL2 is the same as T3_PK2, the query would look like this:
SELECT T4.T4_PK, T4.T4_AK1, T4.T4_AK2, T4.T4_COL3, T1.T1_COL3 FROM Table_4 "T4" INNER JOIN Table_1 "T1" ON T4.T3_PK1 = T1.T1_COL1 AND T4.T3_PK2 = T1.T1_COL2 WHERE T4.T4_AK1 = 'AA' AND T4.T4_AK2 = 'BA'The result from the query above would be:
T4_PK T4_AK1 T4_AK2 T4_COL3 T1_COL3 ---------------------- ------ ------ ------- ------- 100 AA BA CA C 100 AA BA CA CInstead of a single row being returned, we now get two rows.
Remember we created a modified version of Table_1 giving it an AK as Table_2. If we run the same query above against Table_2, changing the table and column names in the JOIN condition, we see that the AK in Table_2 prevented the "duplicate" row and we get only one row returned.
SELECT T4.T4_PK, T4.T4_AK1, T4.T4_AK2, T4.T4_COL3, T2.T2_COL3 FROM Table_4 "T4" INNER JOIN Table_2 "T2" ON T4.T3_PK1 = T2.T2_AK1 AND T4.T3_PK2 = T2.T2_AK2 WHERE T4.T4_AK1 = 'AA' AND T4.T4_AK2 = 'BA' Results: T4_PK T4_AK1 T4_AK2 T4_COL3 T2_COL3 ---------------------- ------ ------ ------- ------- 100 AA BA CA CThe table below, Table_5, has a surrogate primary key (T5_PK), a natural key (T5_AK1, T5_AK2), and foreign key columns needed to join this table to Table_1 and Table_4. Since T1_PK_ID is a surrogate (meaningless) key, it is not expected that one would know what it is. Therefore, searches for a row would be based on other columns thought to uniquely identify a row, in this case T1_COL1, T1_COL2, and T1_COL3.
T5_PK | T5_AK1 | T5_AK2 | T5_COL4 | T4_PK | T1_COL1 | T1_COL2 | T1_COL3 |
---|---|---|---|---|---|---|---|
100 | AA | BA | CA | 100 | A | B | C |
101 | AA | BB | CB | 101 | A | D | E |
SELECT Table_1.T1_PK_ID FROM Table_5 INNER JOIN Table_4 ON Table_5.T4_PK = Table_4.T4_PK INNER JOIN table_1 ON Table_5.t1_col1 = table_1.t1_col1 AND Table_5.t1_col2 = Table_1.t1_col2 WHERE Table_5.t1_col1 = 'A' AND Table_5.t1_col2 = 'B' Results: T1_PK_ID ---------------------- 1 2If we wanted to get the value of Table_1.T1_COL3 for the "unique" values in the supposed "key" values in T1_COL1 and T1_COL2 using the SQL below, we will get an error.
SELECT T1_COL3 FROM Table_1 WHERE t1_pk_id = (select t1_pk_id from table_1 where t1_col1 = 'A' AND t1_col2 = 'B') Error starting at line 1 in command: SELECT T1_COL3 FROM Table_1 WHERE t1_pk_id = (select t1_pk_id from table_1 where t1_col1 = 'A' AND t1_col2 = 'B') Error report: SQL Error: ORA-01427: single-row subquery returns more than one row 01427. 00000 - "single-row subquery returns more than one row" *Cause: *Action:
Conclusion
While this started out with a question asking about the difference between a "surrogate key" and a "primary key", it evolved into a dissertaion on primary keys and how they can be designed. You should have learned:- There really is no such thing as a "surrogate key; there are primary keys using surrogate values."
- A "primary key" can be made up from
- a single column using a surrogate value
- one or more columns that have unique values and known business meanings
- A table with a PK using a single column with a surrogate value MUST have an alternate key (AK)
- There are impacts to SQL writing if the PK is designed wrong!