Showing posts with label Data Warehouse. Show all posts
Showing posts with label Data Warehouse. Show all posts

Wednesday, April 18, 2012

Comparitive EDW Architecture Seminar by Bill Inmon

I attended a "Comparitive EDW Architecture" seminar put on by Bill Inmon in Golden, CO, on 4/4/2012 to 4/6/2012.  The purpose of the seminar was to explain the different DW philosophies of Inmon and Kimabll and how they can not only be used together, but complement each other.

Sadly, no one from the "official" Kimball organization accepted Bill's invite to present at the conference.  Scott Hirst, Ultra Solutions, Inc., did attend and presented the Kimball DW architecture philosophy very well.  Thanks, Scott.

The seminar presented Inmon's latest DW 2.0 architecture philosophy and Kimball's latest Dimensional architecture.  Also discussed was how the two of them can complement each other.

The DW 2.0 architecture was compared to a city's architecture and Kimball's Dimensional model was compared to a house's architecture.  The DW 2.0 architecture (in at least 3NF normal form) provided the enterprise data warehouse (EDW) foundation upon which the Dimensional  architecture is used to build the data marts needed for OLAP and reporting.

Of course, the best part of any seminar like this is the chance to network with others and learn from shared experiences.


Tuesday, July 12, 2011

Start with data warehouse or datamart?

This discussion on LinkedIn presents many ideas about how to start a Business Intelligence (BI) project when there is no existing data warehouse or data mart.  There aren't many companies that haven't got some sort of BI solution already, so I don't think there will be many times this question will need to be answered.

But, in the case where a brand new BI solution is being developed, here are my preferences:
  1. Design the BI solution for the future.
    This means planning ahead so that whatever you start with, you are planning for future enterprise data standards, naming conventions, global uniqueness, data quality, and data governance.  Do not operate in a business silo.
  2. ROI is important.
    While it is difficult to prove a return on investment for a non-tangible product such as BI, most business funders of a BI solution have a niche perspective they want results for quickly.  Since building an Enterprise Data Warehouse (EDW) averages about 18 months, focused business data marts (DM) can be completed relatively quickly in six months or less.
  3. Build both DW (to become EDW) and DM.
    When starting a new BI solution, don't consider an "either / or" solution.  A DM should get its data from the data warehouse (DW), which should be designed for the future EDW.  So, bring the master and transactional data from the source-of-record (SOR) database system into a small DW first.  This small DW could have only a few subject areas yet still be the beginning of the future EDW.  From the small DW, which should be designed as a normalized model in at least 3NF, create the DM that is designed as star (dimensional) model.  The advantage of this approach is that the SOR systems only need to populate a single DW (to become future EDW) and all DM created will be sourced from the same source of truth.  Thus they will all start with the same data and this will reduce the chances of different metrics being reported from different DM.

Friday, July 1, 2011

Testing data warehouse - Toolbox for IT Groups

Testing data warehouse - Toolbox for IT Groups

My response to the posted question concerning QA testing of Data Warehouse ETL is:


More than just using independently developed SQL statements, I recommend writing separate ETL programs using another programming / scripting language such as Perl to mimic what the ETL programs are supposed to be doing.

Since the same effort will be needed by both the ETL development team and the QA development team, QA cannot be left to the end as, in many projects, a "necessary evil" at the last minute. Both development teams will need to be working independently to, hopefully, achieve the same results.

For example, I had a QA lead project where the ETL programs were written in Pro-C++ moving data from a mainframe DB2 extracted file set to an Oracle UNIX-based enterprise database. Using the same design specifications that the Pro-C developers used, I wrote Perl scripts to create UNIX-based "tables" (file stores) that were to mimic the Oracle target tables.

A complete set of data extract files were created from the DB2 source. Using the same DB2 extract files, the developers ran their programs to load the target tables and I ran mine to load the UNIX files. Then I extracted the data loaded from the Oracle tables and saved the extract to another set of UNIX files in the same format as my extract files. Using UNIX diff command, I compared the two sets of files.

By doing this, I revealed two glaring errors that the developers said would not have been caught using just a small sample set of data extracts:

First, the Pro-C programs were doing batch commits every 2,000 rows, but the program was not resetting the counter in the FOR loop. Thus, only the first 2,000 rows were getting written to the Oracle tables. If the source-to-target mapping predicted a million rows, the resulting 2,000 rows was a big problem!

After this was fixed, the next glaring error my Perl scripts caught was that the variable array in Pro-C was not being cleared properly between rows. Therefore, some fields from the previous row were not cleared and were duplicated for subsequent rows. Thus some customers records that did not supply certain fields, say an email address, got the data in those fields from the previous customer record.

Since these programs were written using the same design specifications, but independently, one of the checks was to give the code used to each other, so I checked their Pro-C code and they checked my Perl code to see where the problem was.

Simply sending a few hundred rows of mocked-up test data to the Oracle target and checking that it matched expectations was not adequate testing and would not have revealed these serious bugs.

Thursday, June 30, 2011

How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)

How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)
My contribution to this discussion:

Steve has made some very good suggestions (best practices). I'd like to give my support to:
  1. Keep the source (You said "legacy"? Does this mean the source system is being retired?) system's Primary Keys (PK) in the MDM.
  2. Create a PK in the MDM using a surrogate value "surrogate key".
  3. No new rows or attribute values are to be created in the MDM that did not originate from the source system; that violates the principle of MDM.
  4. If an additional column in the MDM table for storing the source system's PK is not desirable, then you can, as Steve suggests, use another cross-reference table to store the source-to-MDM PK relationships. This can get messy if you need to do this for multiple sources or if the source system reuses its PK values - I have seen this happen.
I'd like to add that the MDM is NOT the place to do data cleansing. For example, if the "new" attribute values being created in the MDM are corrections for values received from the source, that is a bad practice. Corrections should be made in the source and then captured in the MDM either as complete overwrites or in a tracking (history) table of what the old values were and when they were effective.
Also, "downstream" components using the MDM data should only use whatever attributes they would use if getting their data directly from the source system. If the source system is using a surrogate PK value, then it is unlikely that users would know what it is. They should be using some "natural key" attribute or attributes that uniquely identify the rows. If they do know and use the source's PK values, that is the attribute they should continue using; the MDM's PK (if a surrogate value) should not be known to them. Since column and table names are sure to be different in the MDM than they are in the source, create views through which the users can access data using the source systems' table and column names.
For a dissertation on "surrogate keys", you might want to read http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.

Tuesday, June 28, 2011

Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups

Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups
I was just wondering in a type 4 SCD does the surrogate key ever change between the current table and the history table.
My response to this question is:

First of all, a good understanding of what a "surrogate key" really is is needed. For that, see my blog site at http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
As I state in my blog, there really is no such thing as a "surrogate key"; the term is used to describe a Primary Key (PK) that uses a surrogate value.
If you follow C. J. Date's rules, every table would have a PK using a surrogate value, ergo a "surrogate key". But each table would also have a "natural key", or a column or group of columns that uniquely identify a row (enforced by a unique index).

If the source database system uses a "surrogate key" (PK in the source "current database") that is brought into the data warehouse (DW - a "temporal database"), that column's value will never change. However, the row in the DW should also have a PK that uses a surrogate value. In order to keep track of changes to specific columns in that row, audit columns must be used such as "start date", "end date", "current".
Read a good, IMHO, definition of "surrogate key" at http://en.wikipedia.org/wiki/Surrogate_key.
Now, an understanding of Slowly Changing Dimensions is needed. Since you mentioned Type 4 (SCD4), let me explain that this is nothing more than a Type 3 (SCD3) kept in a separate table. Therefore, SCD4 requires a "current dimension table" that contains only currently valid records and separate "historical dimension table" that contains the history of changes; SCD3 does this in one table. I have rarely seen SCD4 used because it requires more database space, management, and views to use by users to find the value that is either current or in use at a previous time. But, in either case, the DW PK that uses a surrogate value and the source's PK using a surrogate value will never change.

Wednesday, December 8, 2010

Kimball or Inmon?

The debate goes on as to whether Kimball's or Inmon's philosophy on data warehouses is best. But here is a fresh perspective, by Bill Inmon himself.

A TALE OF TWO ARCHITECTURES

Monday, November 29, 2010

STAR schema and transactions

I recently replied to a question in ITtoolbox's Data Warehouse forum regarding star schema data marts.
The original post had several questions that others answered, but one question had not been addressed, so that was the basis for my reply.

"Is it sound industry practice to use a star schema to hold individual records rather than totals?"

In order to create a fact table that aggregates transaction data, there must be a dimension table that contains the transaction detail. The granularity of that dimension table needs to be at the level needed to perform the drill down or slice and dice necessary. For some industries, such as cell phone carriers, the granularity might need to be down to the millisecond level. For others, daily granularity might be OK.
These transactional dimension tables also need to be updated, so you must determine what type of slowly changing dimension (SCD) table you need. (The "slowly" part of the name was a bad choice of words when transaction changes can occur by the second.) There are three types: SCD1, SCD2, and SCD3. Most common is SCD2, next is SCD1, and rarely used is SCD3.
A time dimension table is critical for any data warehouse / data mart. The granularity of the time dimension needs to be at a level at least as fine as the dimensional transaction table is at.
For reporting purposes, I usually suggest using a fact table or view to be the source for the reporting engine rather than have the reporting engine do all the aggregating. This way, the algorithms needed to do the aggregations will be consistent. If these are left to the reporting engine(s) and true ad-hoc queries, the results might be different and there will be lots of complaints and troubleshooting to determine why the results are not consistent.
Whether a table or view is used depends on how frequently the report needs to updated and the length of time needed to return results.

Tuesday, March 16, 2010

Company Acquisition and Data Integration / Migration for EDW

Scenario

Company A purchases Company B. Company A needs to integrate Company B's operational systems into its Enterprise Data Warehouse and possibly many of A's data marts Company B's EDW will be retired. Company A and B have different application systems that do the same jobs: General Ledger, Order Management, Human Resources, etc.

Enterprise Data Warehouse

This blog will focus on the Enterprise Data Warehouse (EDW). There will be many more data stores that will need to be considered, but we'll start with this.

Planning

What needs to be planned for? What order do tasks need to be done in? What risk factors need to be considered?

  1. Understand Company A's source data architecture and metadata.
  2. Understand Company B's source data architecture and metadata.
  3. Company B may use a different database system (application software and database management system) than Company B does for the same business function. All businesses need to track their financial picture. Most will have some General Ledger (GL) application to do that. If A and B do not use the same GL application, we need to understand how B's system works and stores data and how that relates to A's system. We need to understand the metadata.

    To do this, we will need to work with Company B's subject matter experts (SME). We will need B's source system's entity relationship diagram(s) (ERD) and data dictionaries. We also need to know what data items are brought into the acquisition's EDW (assuming they have one). Just because they bring something into their EDW doesn't tell us why they bring it in. So we will need to understand their business requirements for bringing data into their EDW.

    So, the list of documents / information we have to get for the source system is:

    1. Database ERD (Physical Data Model (PDM))
    2. Data dictionary
    3. Source-to-target mappings from source database to their EDW
    4. Business rules for data brought to EDW. Why does the business need it?

  4. What data from A's source goes into A's EDW
    1. Why?
    2. How used?
    3. Transformations?
  5. What data from B's source goes into B's EDW
    1. Why?
    2. How used?
    3. Transformations?
  6. Gap analysis
    1. What data from A's source to EDW is missing from B's source to EDW?
    2. What data from B's source to EDW is missing from A's source to EDW
    3. Why does B need "extra" data?
    4. Why doesn't B use A's "extra" data?
    5. Why doesn't A use B's "extra" data?
  7. Data profiling
    1. Which tables and columns in B's source will be coming to A's EDW?
    2. Which tables and columns in A's source hold the same data as in B's source?
    3. Which column's in B's source have different data types, lengths, values, etc. than found in A's source.
  8. Data mart analysis (assume data marts are sourced from EDW)
    1. What data marts does B have that A doesn't have?
    2. What data marts does A have that B doesn't have?
    3. Will existing data marts in A support B's needs?
    4. What new data marts will A need to support B?
  9. Report analysis

Expand on these or add more.

Thursday, January 28, 2010

Data Warehousing: Remembering the Basics

A discussion was started in one of the LinkedIn groups I belong to about "Data Warehousing: Remembering the Basics".

There is a link to an article that was to be used as the catalyst for the discsussion, but it is broken. Therefore, I'm not sure what the focus of the article would have been. But the reference to Inmon's classic approach in the discussion originator's comments leads me to believe the article would favor the relational schema.

(Link to the article being referenced is now at http://tdwi.org/articles/2010/01/28/remembering-the-basics.aspx.)

Many original data warehouses followed Kimball's philosophy and were designed in the Star (Dimensional) schema, each satisfying a specific business need.  After multiple departments within a company each had their own data warehouse, it became evident that there was no single source of truth.  Different versions of truth evolved depending on what business unit created the rules for the data warehouse and its ETL.

Following Kimball's philosophy, when an effort was made to merge these diverse data warehouses into a single enterprise data warehouse using the Star schema, the complexity of this became readily apparent. Many relatively small dimensional schemas became quite difficult, if not impossible, to merge into a single source of truth, Forgetting the political correctness needed (whose naming convention and data type was "right", not to mention the "right" definitions), deciding on what should be in this single source of truth became fraught with power struggles.  And the granularity derived from these many Dimensional data warehouses was often not enough to be able to "drill down" to the detail behind the "facts". Quite often, the granularity of the fact tables was not at the atomic level and, therefore, drill down to the lowest level of detail was not possible. 

The Star (Dimensional) schema requires much denormalization.  The term denormalization implies there was a normalized database to start with.  Since dimensional modeled data warehouses are, by definition, not normalized, building an Enterprise Data Warehouse from them does not allow denormalization.  The Star schem is not, in my and many other's opinion, workable for an enterprise data warehouse (EDW).

In most cases, I feel that Inmon's philosophy that a data warehouse should use a relational data model in at least 3NF is most appropriate for an EDW.  Following Inmon's philosophy, the EDW is not business nor application focused. The EDW is sourced from the original operational database systems; the data marts are sourced from the EDW - the single source of truth.

In my recent experiences, the original "data warehouses" that were application / business specific became "data marts".

This is a good topic to discuss.  Visit the LinkedIn discussion and post your comments there or, if you aren't a member of LinkedIn, post your comments here.