Thursday, January 28, 2010

Data Warehousing: Remembering the Basics

A discussion was started in one of the LinkedIn groups I belong to about "Data Warehousing: Remembering the Basics".

There is a link to an article that was to be used as the catalyst for the discsussion, but it is broken. Therefore, I'm not sure what the focus of the article would have been. But the reference to Inmon's classic approach in the discussion originator's comments leads me to believe the article would favor the relational schema.

(Link to the article being referenced is now at http://tdwi.org/articles/2010/01/28/remembering-the-basics.aspx.)

Many original data warehouses followed Kimball's philosophy and were designed in the Star (Dimensional) schema, each satisfying a specific business need.  After multiple departments within a company each had their own data warehouse, it became evident that there was no single source of truth.  Different versions of truth evolved depending on what business unit created the rules for the data warehouse and its ETL.

Following Kimball's philosophy, when an effort was made to merge these diverse data warehouses into a single enterprise data warehouse using the Star schema, the complexity of this became readily apparent. Many relatively small dimensional schemas became quite difficult, if not impossible, to merge into a single source of truth, Forgetting the political correctness needed (whose naming convention and data type was "right", not to mention the "right" definitions), deciding on what should be in this single source of truth became fraught with power struggles.  And the granularity derived from these many Dimensional data warehouses was often not enough to be able to "drill down" to the detail behind the "facts". Quite often, the granularity of the fact tables was not at the atomic level and, therefore, drill down to the lowest level of detail was not possible. 

The Star (Dimensional) schema requires much denormalization.  The term denormalization implies there was a normalized database to start with.  Since dimensional modeled data warehouses are, by definition, not normalized, building an Enterprise Data Warehouse from them does not allow denormalization.  The Star schem is not, in my and many other's opinion, workable for an enterprise data warehouse (EDW).

In most cases, I feel that Inmon's philosophy that a data warehouse should use a relational data model in at least 3NF is most appropriate for an EDW.  Following Inmon's philosophy, the EDW is not business nor application focused. The EDW is sourced from the original operational database systems; the data marts are sourced from the EDW - the single source of truth.

In my recent experiences, the original "data warehouses" that were application / business specific became "data marts".

This is a good topic to discuss.  Visit the LinkedIn discussion and post your comments there or, if you aren't a member of LinkedIn, post your comments here.

Wednesday, January 6, 2010

Data Architect versus Database Architect

Introduction

It seems that many people don't understand the difference between a Data Architect and a Database Architect and the two titles are used interchangeably.  I, however, view them as separate skill sets and positions.
These web sites are examples of how these terms are used.
http://www.databaseguides.com/database-architecture
This definition of database architecture seems to start out by describing the physical infrastructure of a database system, but then switches to the architecture of the physical data model or data schema (the structure of how the data will be stored).
http://www.wisegeek.com/what-is-a-data-architect.htm
This person's definition of a Data Architect is consistent with mine.

Confusion

Perhaps the confusion begins with the term database.  Is a database the structure (schema) of how the data will be stored (Hierarchical, Object, Relational, Dimensional, CODASYL, etc.) or what I call the "engine"? 

The database engine is the software application, or the Database Management System (DBMS), that allows us to store the data.  The most common database engines today are used for Relational Database Management Systems (RDBMS).  These include Teradata, Oracle, and MS SQL Server.  Within each DBMS, there may be multiple schemata, or collections of data. (Each individual collection of data within a DBMS is called a schemata; the plural is schema, but the plural form is often used in the singular.)

The term database seems to be freely used for both the schema and the DBMS (engine) and the distinction only becomes apparent through the context the word is used in.  Therefore, I feel there should be a clear delineation between what a Data Architect does and what a Database Architect does.

Data Architect

A Data Architect (DA) is someone who focuses first on the data itself and how it will be used and maybe later on how it will be stored (schema) on the chosen DBMS.  It is the DA who strives to ensure that the "Garbage" in "Garbage In, Garbage Out" (GIGO) won't exist.  The DA should start working at the very beginning of a project in the Plan and Analyze phases.

At this point, the DA will be working with the Business Analyst (BA) and the client stakeholders to get the meta data for the data items (entities and attributes) that will be needed for the solution.  It is very important to understand early in the project what the business means by Customer, Order Date, Sale Date, etc.  The DA should be considering the big picture, i.e. the enterprise (global) use of the data, so the same data item won't have different definitions (meta data) and data types used in other data stores.

Typical deliverables in the plan and analyze phases are the Conceptual and Logical Data Models.  These documents are not concerned with what database engine (Teradata, Oracle, MS SQL Server, etc.) the data will be stored on.  Once there is consensus on what data items are needed and how they are related, then the DA may pass the Logical Data Model off to a Data Modeler or Data Analyst to create the Physical Data Model or schema; the DA may do this model, too.

This is generally done in the Design phase of the project even though it is really in the DA's Build and Deploy phases.  This is because there are others who can't do their work until the data store's structure has been completed and published.  At this point, the DA enters a supporting role to those who will be using the data.

Once the Logical data model is completed and agreed to with the business stakeholders, the database engine (probably a RDBMS) should be chosen.  Based on data volume metrics for initial and future storage capacity needed, maximum expected number of concurrent users, batch processing needed, and other usage factors, a decision should be made as to which database engine is the best.  Quite often, however, this decision has already been made because the company has invested money, time, and staff to one database engine already and does not want to invest more into a second database engine even if the existing database engine will have scalability issues.

Database Architect

When the data-centric project gets to the point of choosing what DBMS will be used to store the data, this is the person who guides the selection of a DBMS and architects the infrastructure for it.  Things like how many servers will be needed, how many tiers there will be, how much disk space is needed initially and three years later, how many concurrent users must be supported, and so forth are what the Database Architect is concerned with and will be creating a solution for.

The Database Architect is generally someone who is expert in the chosen DBMS.  At first, there may be several of these people, each representing a competing DBMS provider.  Vendors usually provide these people to design, estimate, and (if chosen) be the company's DBMS solution provider.  Once the DBMS is chosen by the program / project managers and team members and the CIO/CFO approve the selection, then a Database Architect who specializes in that DBMS is needed.

The Database Architect could be the same person who will be the Database Adminstrator (DBA); it would be better, in my opinion, to have a consultant act as the Database Architect because once the database system (DBMS and hardware) is designed and installed, there won't be much need for a Database Architect until the system needs to be upgraded, modified, or expanded.  Generally, the Database Architect should be provided by the vendor of the DBMS since they will have the greatest and most current knowledge of their DBMS.  A DBA should work along side of the Database Architect so there can be knowledge transfer of the infrastructure that the DBMS will reside on and the DBA will be monitoring.

Database Administrator

I included this third role in this dissertation even though it was not my original intent.  It seems that most people understand the operational role of the Database Administrator (DBA), but may not understand the DBA's role in modeling the data store.

The Data Architect (or Modeler or Analyst) will work with the DBA when the Logical Data Model is transformed into the Physical Data Model (schema).  It is very rare that a Utopian Logical Data Model is transformed directly to a Physical Data Model.  What is great in theory (or from a logical business perspective) is not always best in reality.  The DBA should be the one guiding the person doing the Physical Data Model as to what will work best with the chosen DBMS.  The DBA needs to understand the relationships between the entities, constraints and data validation needed, and what attributes will need indexing for SQL performance.  Once the Data Architect has gone on to other projects, the DBA will generally be the first contact for developers and users who need to understand how to get data out of the schema contained within the DBMS.  The DBA may need to consult with the Data Architect, but the users will generally contact the DBA first.

Sometimes, the DBA is also the Physical Data Modeler, but I don't think this is a good approach.  The DBA focuses on the performance and ease of maintenance of the DBMS; the Data Architect / Modeler / Analyst focuses on the schema, data quality, data structure, data relationships, and data use (queries).  These are definitely interrelated aspects, but it is hard for one person to advocate them equally.  What the modeler sees as a best solution may not be what the DBA sees as providing the best performance and least disk space consumption.  It is best when two or more people can provide input for both sides of the issue and come up with a solution that is agreeable to both.