Thursday, January 28, 2010

Data Warehousing: Remembering the Basics

A discussion was started in one of the LinkedIn groups I belong to about "Data Warehousing: Remembering the Basics".

There is a link to an article that was to be used as the catalyst for the discsussion, but it is broken. Therefore, I'm not sure what the focus of the article would have been. But the reference to Inmon's classic approach in the discussion originator's comments leads me to believe the article would favor the relational schema.

(Link to the article being referenced is now at http://tdwi.org/articles/2010/01/28/remembering-the-basics.aspx.)

Many original data warehouses followed Kimball's philosophy and were designed in the Star (Dimensional) schema, each satisfying a specific business need.  After multiple departments within a company each had their own data warehouse, it became evident that there was no single source of truth.  Different versions of truth evolved depending on what business unit created the rules for the data warehouse and its ETL.

Following Kimball's philosophy, when an effort was made to merge these diverse data warehouses into a single enterprise data warehouse using the Star schema, the complexity of this became readily apparent. Many relatively small dimensional schemas became quite difficult, if not impossible, to merge into a single source of truth, Forgetting the political correctness needed (whose naming convention and data type was "right", not to mention the "right" definitions), deciding on what should be in this single source of truth became fraught with power struggles.  And the granularity derived from these many Dimensional data warehouses was often not enough to be able to "drill down" to the detail behind the "facts". Quite often, the granularity of the fact tables was not at the atomic level and, therefore, drill down to the lowest level of detail was not possible. 

The Star (Dimensional) schema requires much denormalization.  The term denormalization implies there was a normalized database to start with.  Since dimensional modeled data warehouses are, by definition, not normalized, building an Enterprise Data Warehouse from them does not allow denormalization.  The Star schem is not, in my and many other's opinion, workable for an enterprise data warehouse (EDW).

In most cases, I feel that Inmon's philosophy that a data warehouse should use a relational data model in at least 3NF is most appropriate for an EDW.  Following Inmon's philosophy, the EDW is not business nor application focused. The EDW is sourced from the original operational database systems; the data marts are sourced from the EDW - the single source of truth.

In my recent experiences, the original "data warehouses" that were application / business specific became "data marts".

This is a good topic to discuss.  Visit the LinkedIn discussion and post your comments there or, if you aren't a member of LinkedIn, post your comments here.

2 comments:

  1. Link to the article being referenced is now at http://tdwi.org/articles/2010/01/28/remembering-the-basics.aspx

    ReplyDelete
  2. I agree with most of what Michael A. Schiff says in his article, especially with the need to maintain timestamped transaction history for time-variant collection of data.

    Schiff quotes Inmon's old "classic definition that "a data warehouse is a subject oriented, integrated, nonvolatile, time variant collection of data in support of management's decisions" (Building the Data Warehouse, John Wiley & Sons, Inc., 1992)."

    Inmon has developed a new definition for data warehousing. You can read his new definition at http://inmoncif.com/news/pdf/dw20FN.pdf (you may have to register with his site in order to read it). He has trademarked his new term in order to control how it is used.

    According to Inmon's new defintion (DW2), data arrives in DW2 from the interactive sector - database applications capturing real-time transactions.

    Real-time data analysis using tools to integrate and analyze operational systems seem to fit into Inmon's new definition's "interactive sector" which is the data source into his "integrated sector". These tools do not replace the need for data warehouses that maintain timestamped historical transactional data.

    Inmon now divides the data warehouse into three time-based sectors: Integrated (one day to two or three years old), Near line (six months to 10 years old), and Archival (five years to infinity). Real-time data is not part of the data warehouse and, therefore, tools integrating and analyzing operational systems (interactive sector) cannot replace data warehouses.

    ReplyDelete