I attended a "Comparitive EDW Architecture" seminar put on by Bill Inmon in Golden, CO, on 4/4/2012 to 4/6/2012. The purpose of the seminar was to explain the different DW philosophies of Inmon and Kimabll and how they can not only be used together, but complement each other.
Sadly, no one from the "official" Kimball organization accepted Bill's invite to present at the conference. Scott Hirst, Ultra Solutions, Inc., did attend and presented the Kimball DW architecture philosophy very well. Thanks, Scott.
The seminar presented Inmon's latest DW 2.0 architecture philosophy and Kimball's latest Dimensional architecture. Also discussed was how the two of them can complement each other.
The DW 2.0 architecture was compared to a city's architecture and Kimball's Dimensional model was compared to a house's architecture. The DW 2.0 architecture (in at least 3NF normal form) provided the enterprise data warehouse (EDW) foundation upon which the Dimensional architecture is used to build the data marts needed for OLAP and reporting.
Of course, the best part of any seminar like this is the chance to network with others and learn from shared experiences.
Showing posts with label Data Mart. Show all posts
Showing posts with label Data Mart. Show all posts
Wednesday, April 18, 2012
Tuesday, July 12, 2011
Start with data warehouse or datamart?
This discussion on LinkedIn presents many ideas about how to start a Business Intelligence
(BI) project when there is no existing data warehouse or data mart. There aren't many companies that haven't got some sort of BI solution already, so I don't think there will be many times this question will need to be answered.
But, in the case where a brand new BI solution is being developed, here are my preferences:
But, in the case where a brand new BI solution is being developed, here are my preferences:
- Design the BI solution for the future.
This means planning ahead so that whatever you start with, you are planning for future enterprise data standards, naming conventions, global uniqueness, data quality, and data governance. Do not operate in a business silo. - ROI is important.
While it is difficult to prove a return on investment for a non-tangible product such as BI, most business funders of a BI solution have a niche perspective they want results for quickly. Since building an Enterprise Data Warehouse (EDW) averages about 18 months, focused business data marts (DM) can be completed relatively quickly in six months or less. - Build both DW (to become EDW) and DM.
When starting a new BI solution, don't consider an "either / or" solution. A DM should get its data from the data warehouse (DW), which should be designed for the future EDW. So, bring the master and transactional data from the source-of-record (SOR) database system into a small DW first. This small DW could have only a few subject areas yet still be the beginning of the future EDW. From the small DW, which should be designed as a normalized model in at least 3NF, create the DM that is designed as star (dimensional) model. The advantage of this approach is that the SOR systems only need to populate a single DW (to become future EDW) and all DM created will be sourced from the same source of truth. Thus they will all start with the same data and this will reduce the chances of different metrics being reported from different DM.
Labels:
Data Architecture,
Data Mart,
data modeling,
Data Warehouse,
New Project
Tuesday, June 28, 2011
Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups
Type 4 Slowly Changing Dimensions
Design - Toolbox for IT Groups
I was just wondering in a type 4 SCD
does the surrogate key ever change between the current table and the history table.
My response to this question is:
First of all, a good understanding of what a "surrogate key" really is is needed. For that, see my blog site at http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
As I state in my blog, there really is no such thing as a "surrogate key"; the term is used to describe a Primary Key (PK) that uses a surrogate value.
If you follow C. J. Date's rules, every table would have a PK using a surrogate value, ergo a "surrogate key". But each table would also have a "natural key", or a column or group of columns that uniquely identify a row (enforced by a unique index).
If the source database system uses a "surrogate key" (PK in the source "current database") that is brought into the data warehouse (DW - a "temporal database"), that column's value will never change. However, the row in the DW should also have a PK that uses a surrogate value. In order to keep track of changes to specific columns in that row, audit columns must be used such as "start date", "end date", "current".
Read a good, IMHO, definition of "surrogate key" at http://en.wikipedia.org/wiki/Surrogate_key.
Now, an understanding of Slowly Changing Dimensions is needed. Since you mentioned Type 4 (SCD4), let me explain that this is nothing more than a Type 3 (SCD3) kept in a separate table. Therefore, SCD4 requires a "current dimension table" that contains only currently valid records and separate "historical dimension table" that contains the history of changes; SCD3 does this in one table. I have rarely seen SCD4 used because it requires more database space, management, and views to use by users to find the value that is either current or in use at a previous time. But, in either case, the DW PK that uses a surrogate value and the source's PK using a surrogate value will never change.
I was just wondering in a type 4 SCD
My response to this question is:
First of all, a good understanding of what a "surrogate key" really is is needed. For that, see my blog site at http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
As I state in my blog, there really is no such thing as a "surrogate key"; the term is used to describe a Primary Key (PK) that uses a surrogate value.
If you follow C. J. Date's rules, every table would have a PK using a surrogate value, ergo a "surrogate key". But each table would also have a "natural key", or a column or group of columns that uniquely identify a row (enforced by a unique index).
If the source database system uses a "surrogate key" (PK in the source "current database") that is brought into the data warehouse (DW - a "temporal database"), that column's value will never change. However, the row in the DW should also have a PK that uses a surrogate value. In order to keep track of changes to specific columns in that row, audit columns must be used such as "start date", "end date", "current".
Read a good, IMHO, definition of "surrogate key" at http://en.wikipedia.org/wiki/Surrogate_key.
Now, an understanding of Slowly Changing Dimensions is needed. Since you mentioned Type 4 (SCD4), let me explain that this is nothing more than a Type 3 (SCD3) kept in a separate table. Therefore, SCD4 requires a "current dimension table" that contains only currently valid records and separate "historical dimension table" that contains the history of changes; SCD3 does this in one table. I have rarely seen SCD4 used because it requires more database space, management, and views to use by users to find the value that is either current or in use at a previous time. But, in either case, the DW PK that uses a surrogate value and the source's PK using a surrogate value will never change.
Labels:
Data Architecture,
Data Mart,
Data Warehouse
Wednesday, December 8, 2010
Kimball or Inmon?
The debate goes on as to whether Kimball's or Inmon's philosophy on data warehouses is best. But here is a fresh perspective, by Bill Inmon himself.
Labels:
Data Architecture,
Data Mart,
Data Warehouse
Monday, November 29, 2010
STAR schema and transactions
I recently replied to a question in ITtoolbox's Data Warehouse forum regarding star schema data marts.
The original post had several questions that others answered, but one question had not been addressed, so that was the basis for my reply.
These transactional dimension tables also need to be updated, so you must determine what type of slowly changing dimension (SCD) table you need. (The "slowly" part of the name was a bad choice of words when transaction changes can occur by the second.) There are three types: SCD1, SCD2, and SCD3. Most common is SCD2, next is SCD1, and rarely used is SCD3.
A time dimension table is critical for any data warehouse / data mart. The granularity of the time dimension needs to be at a level at least as fine as the dimensional transaction table is at.
For reporting purposes, I usually suggest using a fact table or view to be the source for the reporting engine rather than have the reporting engine do all the aggregating. This way, the algorithms needed to do the aggregations will be consistent. If these are left to the reporting engine(s) and true ad-hoc queries, the results might be different and there will be lots of complaints and troubleshooting to determine why the results are not consistent.
Whether a table or view is used depends on how frequently the report needs to updated and the length of time needed to return results.
The original post had several questions that others answered, but one question had not been addressed, so that was the basis for my reply.
"Is it sound industry practice to use a star schema to hold individual records rather than totals?"
In order to create a fact table that aggregates transaction data, there must be a dimension table that contains the transaction detail. The granularity of that dimension table needs to be at the level needed to perform the drill down or slice and dice necessary. For some industries, such as cell phone carriers, the granularity might need to be down to the millisecond level. For others, daily granularity might be OK.These transactional dimension tables also need to be updated, so you must determine what type of slowly changing dimension (SCD) table you need. (The "slowly" part of the name was a bad choice of words when transaction changes can occur by the second.) There are three types: SCD1, SCD2, and SCD3. Most common is SCD2, next is SCD1, and rarely used is SCD3.
A time dimension table is critical for any data warehouse / data mart. The granularity of the time dimension needs to be at a level at least as fine as the dimensional transaction table is at.
For reporting purposes, I usually suggest using a fact table or view to be the source for the reporting engine rather than have the reporting engine do all the aggregating. This way, the algorithms needed to do the aggregations will be consistent. If these are left to the reporting engine(s) and true ad-hoc queries, the results might be different and there will be lots of complaints and troubleshooting to determine why the results are not consistent.
Whether a table or view is used depends on how frequently the report needs to updated and the length of time needed to return results.
Labels:
Data Architecture,
Data Mart,
data modeling,
Data Warehouse,
schema
Subscribe to:
Posts (Atom)