I attended a "Comparitive EDW Architecture" seminar put on by Bill Inmon in Golden, CO, on 4/4/2012 to 4/6/2012. The purpose of the seminar was to explain the different DW philosophies of Inmon and Kimabll and how they can not only be used together, but complement each other.
Sadly, no one from the "official" Kimball organization accepted Bill's invite to present at the conference. Scott Hirst, Ultra Solutions, Inc., did attend and presented the Kimball DW architecture philosophy very well. Thanks, Scott.
The seminar presented Inmon's latest DW 2.0 architecture philosophy and Kimball's latest Dimensional architecture. Also discussed was how the two of them can complement each other.
The DW 2.0 architecture was compared to a city's architecture and Kimball's Dimensional model was compared to a house's architecture. The DW 2.0 architecture (in at least 3NF normal form) provided the enterprise data warehouse (EDW) foundation upon which the Dimensional architecture is used to build the data marts needed for OLAP and reporting.
Of course, the best part of any seminar like this is the chance to network with others and learn from shared experiences.
Showing posts with label Data Architecture. Show all posts
Showing posts with label Data Architecture. Show all posts
Wednesday, April 18, 2012
Tuesday, July 12, 2011
Is the relational database doomed?
Is the relational database doomed?
Have you ever been asked about document-oriented or key value (KV) data stores / databases? Read the article at the link above for a good background on them.
Have you ever been asked about document-oriented or key value (KV) data stores / databases? Read the article at the link above for a good background on them.
Start with data warehouse or datamart?
This discussion on LinkedIn presents many ideas about how to start a Business Intelligence
(BI) project when there is no existing data warehouse or data mart. There aren't many companies that haven't got some sort of BI solution already, so I don't think there will be many times this question will need to be answered.
But, in the case where a brand new BI solution is being developed, here are my preferences:
But, in the case where a brand new BI solution is being developed, here are my preferences:
- Design the BI solution for the future.
This means planning ahead so that whatever you start with, you are planning for future enterprise data standards, naming conventions, global uniqueness, data quality, and data governance. Do not operate in a business silo. - ROI is important.
While it is difficult to prove a return on investment for a non-tangible product such as BI, most business funders of a BI solution have a niche perspective they want results for quickly. Since building an Enterprise Data Warehouse (EDW) averages about 18 months, focused business data marts (DM) can be completed relatively quickly in six months or less. - Build both DW (to become EDW) and DM.
When starting a new BI solution, don't consider an "either / or" solution. A DM should get its data from the data warehouse (DW), which should be designed for the future EDW. So, bring the master and transactional data from the source-of-record (SOR) database system into a small DW first. This small DW could have only a few subject areas yet still be the beginning of the future EDW. From the small DW, which should be designed as a normalized model in at least 3NF, create the DM that is designed as star (dimensional) model. The advantage of this approach is that the SOR systems only need to populate a single DW (to become future EDW) and all DM created will be sourced from the same source of truth. Thus they will all start with the same data and this will reduce the chances of different metrics being reported from different DM.
Labels:
Data Architecture,
Data Mart,
data modeling,
Data Warehouse,
New Project
Thursday, June 30, 2011
How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)
How to handle Legacy Key of entity while migrating data into MDM Hub? (Linked In)
My contribution to this discussion:
Steve has made some very good suggestions (best practices). I'd like to give my support to:
. For example, if the "new" attribute values being created in the MDM are corrections for values received from the source, that is a bad practice. Corrections should be made in the source and then captured in the MDM either as complete overwrites or in a tracking (history) table of what the old values were and when they were effective.
Also, "downstream" components using the MDM data should only use whatever attributes they would use if getting their data directly from the source system. If the source system is using a surrogate PK value, then it is unlikely that users would know what it is. They should be using some "natural key" attribute or attributes that uniquely identify the rows. If they do know and use the source's PK values, that is the attribute they should continue using; the MDM's PK (if a surrogate value) should not be known to them. Since column and table names are sure to be different in the MDM than they are in the source, create views through which the users can access data using the source systems' table and column names.
For a dissertation on "surrogate keys", you might want to read http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
My contribution to this discussion:
Steve has made some very good suggestions (best practices). I'd like to give my support to:
- Keep the source (You said "legacy"? Does this mean the source system is being retired?) system's Primary Keys (PK) in the MDM.
- Create a PK in the MDM using a surrogate value "surrogate key".
- No new rows or attribute values are to be created in the MDM that did not originate from the source system; that violates the principle of MDM.
- If an additional column in the MDM table for storing the source system's PK is not desirable, then you can, as Steve suggests, use another cross-reference table to store the source-to-MDM PK relationships. This can get messy if you need to do this for multiple sources or if the source system reuses its PK values - I have seen this happen.
Also, "downstream" components using the MDM data should only use whatever attributes they would use if getting their data directly from the source system. If the source system is using a surrogate PK value, then it is unlikely that users would know what it is. They should be using some "natural key" attribute or attributes that uniquely identify the rows. If they do know and use the source's PK values, that is the attribute they should continue using; the MDM's PK (if a surrogate value) should not be known to them. Since column and table names are sure to be different in the MDM than they are in the source, create views through which the users can access data using the source systems' table and column names.
For a dissertation on "surrogate keys", you might want to read http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
Labels:
Data Architecture,
data modeling,
Data Warehouse,
MDM
Tuesday, June 28, 2011
Type 4 Slowly Changing Dimensions Design - Toolbox for IT Groups
Type 4 Slowly Changing Dimensions
Design - Toolbox for IT Groups
I was just wondering in a type 4 SCD
does the surrogate key ever change between the current table and the history table.
My response to this question is:
First of all, a good understanding of what a "surrogate key" really is is needed. For that, see my blog site at http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
As I state in my blog, there really is no such thing as a "surrogate key"; the term is used to describe a Primary Key (PK) that uses a surrogate value.
If you follow C. J. Date's rules, every table would have a PK using a surrogate value, ergo a "surrogate key". But each table would also have a "natural key", or a column or group of columns that uniquely identify a row (enforced by a unique index).
If the source database system uses a "surrogate key" (PK in the source "current database") that is brought into the data warehouse (DW - a "temporal database"), that column's value will never change. However, the row in the DW should also have a PK that uses a surrogate value. In order to keep track of changes to specific columns in that row, audit columns must be used such as "start date", "end date", "current".
Read a good, IMHO, definition of "surrogate key" at http://en.wikipedia.org/wiki/Surrogate_key.
Now, an understanding of Slowly Changing Dimensions is needed. Since you mentioned Type 4 (SCD4), let me explain that this is nothing more than a Type 3 (SCD3) kept in a separate table. Therefore, SCD4 requires a "current dimension table" that contains only currently valid records and separate "historical dimension table" that contains the history of changes; SCD3 does this in one table. I have rarely seen SCD4 used because it requires more database space, management, and views to use by users to find the value that is either current or in use at a previous time. But, in either case, the DW PK that uses a surrogate value and the source's PK using a surrogate value will never change.
I was just wondering in a type 4 SCD
My response to this question is:
First of all, a good understanding of what a "surrogate key" really is is needed. For that, see my blog site at http://itpro420.blogspot.com/2011/06/what-is-difference-between-surrogate.html.
As I state in my blog, there really is no such thing as a "surrogate key"; the term is used to describe a Primary Key (PK) that uses a surrogate value.
If you follow C. J. Date's rules, every table would have a PK using a surrogate value, ergo a "surrogate key". But each table would also have a "natural key", or a column or group of columns that uniquely identify a row (enforced by a unique index).
If the source database system uses a "surrogate key" (PK in the source "current database") that is brought into the data warehouse (DW - a "temporal database"), that column's value will never change. However, the row in the DW should also have a PK that uses a surrogate value. In order to keep track of changes to specific columns in that row, audit columns must be used such as "start date", "end date", "current".
Read a good, IMHO, definition of "surrogate key" at http://en.wikipedia.org/wiki/Surrogate_key.
Now, an understanding of Slowly Changing Dimensions is needed. Since you mentioned Type 4 (SCD4), let me explain that this is nothing more than a Type 3 (SCD3) kept in a separate table. Therefore, SCD4 requires a "current dimension table" that contains only currently valid records and separate "historical dimension table" that contains the history of changes; SCD3 does this in one table. I have rarely seen SCD4 used because it requires more database space, management, and views to use by users to find the value that is either current or in use at a previous time. But, in either case, the DW PK that uses a surrogate value and the source's PK using a surrogate value will never change.
Labels:
Data Architecture,
Data Mart,
Data Warehouse
Thursday, June 23, 2011
What is the difference between Surrogate Key and Primary Key? | LinkedIn
What is the difference between Surrogate Key and Primary Key? | LinkedIn
A Google search turned up lots of places to read up on keys, including these:
A primary key can consist of a single column using a "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a "natural" or "meaningful" key).
The surrogate value is generally a system-generated value that nobody would know what it is to search for from a query. This PK may then be used as a foreign key (FK) in other tables and used for join conditions.
A Primary Key that uses a surrogate value is also known as a "meaningless" key. A table with only a "surrogate key" (primary key using an arbitrary value) must have an alternate key (AK) (see below). Without an AK, it is very likely that there will be "duplicate" rows in the table. An explanation of this follows below.
If the natural key is the PK, then there can be no surrogate key. If the natural key is not the PK, it is an alternate key.
Generally, the columns used for the alternate key(s) are the columns that people will use in their queries to find a unique row, but the joins to other tables will be done using the single PK column comprising the "meaningless key" with the PK column being used as a FK column in those other tables.
The use of surrogate key values prevents the need to duplicate multiple columns in "child" tables just to be able to find the children of the "parent" record in a JOIN (or vice versa).
While the three rows above are unique in the PK value (PK_ID column), rows 1 and 2 are really not unique rows, are they.
A table with only a "surrogate key" (PK using an arbitrary value) and no unique index for an alternate key (or the natural key) is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
If Table_1 were created with a unique index on the columns that make up an alternate key, it would look like Table_2 below.
Generally, the alternate key columns are the columns that people will use in their queries to find a row, but the join to another table will be done using the single primary key column comprising the "meaningless key" which is a FK in the table being joined to. I'll show an example of this later.
A primary key can consist of a single "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a meaningful key). If Table_1 were recreated using the natural key, it would look like Table_3 below.
If Table_3 was joined to another table, that table would have to include all the columns making up the primary key in Table_3; these would be the foreign key (FK) columns. The use of surrogate key values as in Table_2 prevents the need to duplicate multiple columns in "child" tables just to be able to find the unique "parent" records in a JOIN, and vice versa.
As an example of what I stated above, let's assume the following table structure.
A query to find a row in Table_4 and join it to Table_3 to get another column would look like this:
Remember we created a modified version of Table_1 giving it an AK as Table_2. If we run the same query above against Table_2, changing the table and column names in the JOIN condition, we see that the AK in Table_2 prevented the "duplicate" row and we get only one row returned.
In the query below, we will try to get primary key value from Table_1; let's say we need to use it in another query. The expectation is a return of a single row, but, because the rows are not truly unique based on the other columns in Table_1, two rows are returned:
Introduction
There is no difference to the meanings below due to the data model; these are relational database management system (RDBMS) concepts. Whether you are using a normalized model or a dimensional model, the "primary key", "surrogate key", "natural key", and "alternate key" terms all have the same meanings.A Google search turned up lots of places to read up on keys, including these:
- http://en.wikipedia.org/wiki/Unique_key
- http://en.wikipedia.org/wiki/Surrogate_key
- http://www.dbdebunk.com/page/page/622337.htm
- http://databases.about.com/cs/administration/g/primarykey.htm
- Methods for primary and secondary key selection (Data base management)
Primary Key
A Primary Key (PK) in a database table uniquely identifies a row in the table. There can be only one PK per table. Depending on the database engine being used, a PK may require the explicit creation of a unique index or the database engine may automatically create such an index for the column or columns identified as the PK column(s).A primary key can consist of a single column using a "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a "natural" or "meaningful" key).
Surrogate Key, aka Meaningless Key
The term "Surrogate Key" describes a particular type of Primary Key (PK). There really is no such thing as a "Surrogate Key". There are Primary Keys that use surrogate values. A PK that uses a surrogate value always consists of only one column.The surrogate value is generally a system-generated value that nobody would know what it is to search for from a query. This PK may then be used as a foreign key (FK) in other tables and used for join conditions.
A Primary Key that uses a surrogate value is also known as a "meaningless" key. A table with only a "surrogate key" (primary key using an arbitrary value) must have an alternate key (AK) (see below). Without an AK, it is very likely that there will be "duplicate" rows in the table. An explanation of this follows below.
Natural Key, aka Meaningful Key
The other type of PK is called a "natural key". A natural (or meaningful) primary key is made up of one or more columns in a row that uniquely identify that row. The column or columns used for a natural key contain values that have meaning to the business.If the natural key is the PK, then there can be no surrogate key. If the natural key is not the PK, it is an alternate key.
Alternate Key
While there can be only one primary key in a table, there may be more than one "alternate" keys in a table. These alternate keys are controlled by one or more unique indexes. A table with only a "surrogate key" (primary key using an arbitrary number) and no unique index for an alternate key is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.Generally, the columns used for the alternate key(s) are the columns that people will use in their queries to find a unique row, but the joins to other tables will be done using the single PK column comprising the "meaningless key" with the PK column being used as a FK column in those other tables.
The use of surrogate key values prevents the need to duplicate multiple columns in "child" tables just to be able to find the children of the "parent" record in a JOIN (or vice versa).
Impact of Primary Key Design
Let's examine the impact and implications of designing a table's primary key. Table_1, below, is a table that has a PK made up of a surrogate value, hence "Surrogate Key", and no AK.T1_PK_ID | T1_COL1 | T1_COL2 | T1_COL3 |
---|---|---|---|
1 | A | B | C |
2 | A | B | C |
3 | A | D | E |
While the three rows above are unique in the PK value (PK_ID column), rows 1 and 2 are really not unique rows, are they.
A table with only a "surrogate key" (PK using an arbitrary value) and no unique index for an alternate key (or the natural key) is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
If Table_1 were created with a unique index on the columns that make up an alternate key, it would look like Table_2 below.
T2_PK_ID | T2_AK1 | T2_AK2 | T2_COL3 |
---|---|---|---|
1 | A | B | C |
2 | A | D | E |
A primary key can consist of a single "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a meaningful key). If Table_1 were recreated using the natural key, it would look like Table_3 below.
T3_PK1 | T3_PK2 | T3_COL3 |
---|---|---|
A | B | C |
A | D | E |
As an example of what I stated above, let's assume the following table structure.
T4_PK | T4_AK1 | T4_AK2 | T4_COL3 | T3_PK1 | T3_PK2 |
---|---|---|---|---|---|
100 | AA | BA | CA | A | B |
101 | AA | BB | CB | A | D |
SELECT T4.T4_PK, T4.T4_AK1, T4.T4_AK2, T4.T4_COL3, T3.T3_COL3 FROM Table_4 "T4" INNER JOIN Table_3 "T3" ON T4.T3_PK1 = T3.T3_PK1 AND T4.T3_PK2 = T3.T3_PK2 WHERE T4.T4_AK1 = 'AA' AND T4.T4_AK2 = 'BA'The result from the query above would be:
T4_PK T4_AK1 T4_AK2 T4_COL3 T3_COL3 ---------------------- ------ ------ ------- ------- 100 AA BA CA CIf this same query were modified to be run against Table_1, assuming T1_COL1 is the same as T3_PK1 and T1_COL2 is the same as T3_PK2, the query would look like this:
SELECT T4.T4_PK, T4.T4_AK1, T4.T4_AK2, T4.T4_COL3, T1.T1_COL3 FROM Table_4 "T4" INNER JOIN Table_1 "T1" ON T4.T3_PK1 = T1.T1_COL1 AND T4.T3_PK2 = T1.T1_COL2 WHERE T4.T4_AK1 = 'AA' AND T4.T4_AK2 = 'BA'The result from the query above would be:
T4_PK T4_AK1 T4_AK2 T4_COL3 T1_COL3 ---------------------- ------ ------ ------- ------- 100 AA BA CA C 100 AA BA CA CInstead of a single row being returned, we now get two rows.
Remember we created a modified version of Table_1 giving it an AK as Table_2. If we run the same query above against Table_2, changing the table and column names in the JOIN condition, we see that the AK in Table_2 prevented the "duplicate" row and we get only one row returned.
SELECT T4.T4_PK, T4.T4_AK1, T4.T4_AK2, T4.T4_COL3, T2.T2_COL3 FROM Table_4 "T4" INNER JOIN Table_2 "T2" ON T4.T3_PK1 = T2.T2_AK1 AND T4.T3_PK2 = T2.T2_AK2 WHERE T4.T4_AK1 = 'AA' AND T4.T4_AK2 = 'BA' Results: T4_PK T4_AK1 T4_AK2 T4_COL3 T2_COL3 ---------------------- ------ ------ ------- ------- 100 AA BA CA CThe table below, Table_5, has a surrogate primary key (T5_PK), a natural key (T5_AK1, T5_AK2), and foreign key columns needed to join this table to Table_1 and Table_4. Since T1_PK_ID is a surrogate (meaningless) key, it is not expected that one would know what it is. Therefore, searches for a row would be based on other columns thought to uniquely identify a row, in this case T1_COL1, T1_COL2, and T1_COL3.
T5_PK | T5_AK1 | T5_AK2 | T5_COL4 | T4_PK | T1_COL1 | T1_COL2 | T1_COL3 |
---|---|---|---|---|---|---|---|
100 | AA | BA | CA | 100 | A | B | C |
101 | AA | BB | CB | 101 | A | D | E |
SELECT Table_1.T1_PK_ID FROM Table_5 INNER JOIN Table_4 ON Table_5.T4_PK = Table_4.T4_PK INNER JOIN table_1 ON Table_5.t1_col1 = table_1.t1_col1 AND Table_5.t1_col2 = Table_1.t1_col2 WHERE Table_5.t1_col1 = 'A' AND Table_5.t1_col2 = 'B' Results: T1_PK_ID ---------------------- 1 2If we wanted to get the value of Table_1.T1_COL3 for the "unique" values in the supposed "key" values in T1_COL1 and T1_COL2 using the SQL below, we will get an error.
SELECT T1_COL3 FROM Table_1 WHERE t1_pk_id = (select t1_pk_id from table_1 where t1_col1 = 'A' AND t1_col2 = 'B') Error starting at line 1 in command: SELECT T1_COL3 FROM Table_1 WHERE t1_pk_id = (select t1_pk_id from table_1 where t1_col1 = 'A' AND t1_col2 = 'B') Error report: SQL Error: ORA-01427: single-row subquery returns more than one row 01427. 00000 - "single-row subquery returns more than one row" *Cause: *Action:
Conclusion
While this started out with a question asking about the difference between a "surrogate key" and a "primary key", it evolved into a dissertaion on primary keys and how they can be designed. You should have learned:- There really is no such thing as a "surrogate key; there are primary keys using surrogate values."
- A "primary key" can be made up from
- a single column using a surrogate value
- one or more columns that have unique values and known business meanings
- A table with a PK using a single column with a surrogate value MUST have an alternate key (AK)
- There are impacts to SQL writing if the PK is designed wrong!
Wednesday, December 8, 2010
Kimball or Inmon?
The debate goes on as to whether Kimball's or Inmon's philosophy on data warehouses is best. But here is a fresh perspective, by Bill Inmon himself.
Labels:
Data Architecture,
Data Mart,
Data Warehouse
Monday, November 29, 2010
STAR schema and transactions
I recently replied to a question in ITtoolbox's Data Warehouse forum regarding star schema data marts.
The original post had several questions that others answered, but one question had not been addressed, so that was the basis for my reply.
These transactional dimension tables also need to be updated, so you must determine what type of slowly changing dimension (SCD) table you need. (The "slowly" part of the name was a bad choice of words when transaction changes can occur by the second.) There are three types: SCD1, SCD2, and SCD3. Most common is SCD2, next is SCD1, and rarely used is SCD3.
A time dimension table is critical for any data warehouse / data mart. The granularity of the time dimension needs to be at a level at least as fine as the dimensional transaction table is at.
For reporting purposes, I usually suggest using a fact table or view to be the source for the reporting engine rather than have the reporting engine do all the aggregating. This way, the algorithms needed to do the aggregations will be consistent. If these are left to the reporting engine(s) and true ad-hoc queries, the results might be different and there will be lots of complaints and troubleshooting to determine why the results are not consistent.
Whether a table or view is used depends on how frequently the report needs to updated and the length of time needed to return results.
The original post had several questions that others answered, but one question had not been addressed, so that was the basis for my reply.
"Is it sound industry practice to use a star schema to hold individual records rather than totals?"
In order to create a fact table that aggregates transaction data, there must be a dimension table that contains the transaction detail. The granularity of that dimension table needs to be at the level needed to perform the drill down or slice and dice necessary. For some industries, such as cell phone carriers, the granularity might need to be down to the millisecond level. For others, daily granularity might be OK.These transactional dimension tables also need to be updated, so you must determine what type of slowly changing dimension (SCD) table you need. (The "slowly" part of the name was a bad choice of words when transaction changes can occur by the second.) There are three types: SCD1, SCD2, and SCD3. Most common is SCD2, next is SCD1, and rarely used is SCD3.
A time dimension table is critical for any data warehouse / data mart. The granularity of the time dimension needs to be at a level at least as fine as the dimensional transaction table is at.
For reporting purposes, I usually suggest using a fact table or view to be the source for the reporting engine rather than have the reporting engine do all the aggregating. This way, the algorithms needed to do the aggregations will be consistent. If these are left to the reporting engine(s) and true ad-hoc queries, the results might be different and there will be lots of complaints and troubleshooting to determine why the results are not consistent.
Whether a table or view is used depends on how frequently the report needs to updated and the length of time needed to return results.
Labels:
Data Architecture,
Data Mart,
data modeling,
Data Warehouse,
schema
Sunday, March 28, 2010
Data Modeling
Data Modeling is a way of describing the tables (entities) and relationships in a database schemata (see other blogs with "schema" tags). (By the way, both "modeling" and "modelling" are correct spellings. The first is more American English and the second is British English; the British claim they are right.) Because a "picture speaks a thousand words", data models help database users to more easily understand the schemata they are interested in.
There are many kinds of models, but the three most common are: Conceptual, Logical, and Physical.
Data Model Hierarchy
CONCEPTUAL
A conceptual data model is the highest level data model. It is
usually created early in a database project as, or shortly after,
data requirements have been gathered in what is known as the Plan
and Analyze phase of the SDLC.
It shows the major data entities the project needs and how they
relate to each other. Entities include objects like Customers,
Employees, Products, Sales, Orders, etc. Sometimes, some entities
are made to be a child entity of a broader entity like Customers
could be a child entity of People. The names used to describe an
entity are the ones a business would use, such as Customer, not
the names the table in the physical data schemata would be, such
as TB_CUST.
A good definition of Conceptual Data Modeling can be found in
"Conceptual Modeling - ER 2002" by Stefano Spaccapietra and
Salvatore March (Springer)
"The task of conceptual modelling is to build a high-quality representation of selected phenomena in some domain. The conceptual models that result facilitate the implementation, operation, and maintenance of information systems."
LOGICAL
The next level in data modeling is the Logical data model. It is created in the Design phase of the SDLC.
At this level, data items that define what makes up an entity are added. For example, a Customer entity will have data items such as Name, Address, City, State, ZIP, etc. These data items are called "attributes" of the entity.
The logical model will show how the entities relate to each other through common attributes. These relationships will be in business terms such as "An order has a customer" and "A customer may have one or more orders."
The relationships will be connected by lines that depict what is called the "cardinality" of the relationship, or how many elements there can be in the set. For example, every Order must have one and only one customer; however, a customer can have zero or many current orders. The Customer-to-Order relationship is an optional one-to-many cardinality, while the Order-to-Customer relationship is a mandatory one-to-one cardinality.
PHYSICAL
The physical data model is the graphic depiction of the actual database schemata that contains the entities (tables) and attributes (columns) specified in the Logical data model. Sadly, many database schemata are built directly without the planning and design models being created. The physical data model should be the result of collaboration between the Data Architect (or Modeler) and the Database Administrator (DBA). The Data Architect knows what data entities and attributes are needed by the application and how they relate to each other. The DBA knows the database management system (DBMS or RDBMS) in which the data will be stored. Together, these two can design and build the physical database schemata that will provide the fastest I/O performance, be maintainable by the DBA, and will be scalable for future growth. (Scalability is a term used to indicate how well the system will be able to handle increased data volumes and more users of the database.)Data Modeling Tools
In the beginning, these models were created by hand drawings using stencils for the limited number of shapes and lines used at the time. As models became more complicated, software to create data models was developed. At first, these software applications were no more than diagramming tools.These simple diagramming tools have become sophisticated data modeling applications that included data dictionaries, metadata, scripting languages, requirements tracking, database definition language (DDL) generators and much more. These mega-modeling tools can be very expensive. The two most common ones that I know of and have used are Sybase's PowerDesigner and CA's Erwin.
These are great tools for large companies, but are there data modeling software tools available cheaply or freely for those who want to learn modeling skills or do data modeling for small to medium sized businesses?
Some database management systems (DBMS) come with there own data modeling tools. A couple DBMS can be downloaded freely for personal education and training. Both Oracle and MS SQL Server allow this and I believe they both have built-in data modeling tools, but it has been a long time since I used them myself so this may not be true any more.
Some people use Visio for data modeling, but others claim it is not a "real" data modeling tool. It certainly has limitations, but you can draw database schemata diagrams with it. I have found it to be useful for simple data models, but it has none of the extended features mentioned above.
Below are links to various web sites that list free and/or inexpensive data modeling tools:
Labels:
Data Architecture,
data modeling,
schema
Wednesday, January 6, 2010
Data Architect versus Database Architect
Introduction
It seems that many people don't understand the difference between a Data Architect and a Database Architect and the two titles are used interchangeably. I, however, view them as separate skill sets and positions.-
These web sites are examples of how these terms are used.
- http://www.databaseguides.com/database-architecture
- This definition of database architecture seems to start out by describing the physical infrastructure of a database system, but then switches to the architecture of the physical data model or data schema (the structure of how the data will be stored).
- http://www.wisegeek.com/what-is-a-data-architect.htm
- This person's definition of a Data Architect is consistent with mine.
Confusion
Perhaps the confusion begins with the term database. Is a database the structure (schema) of how the data will be stored (Hierarchical, Object, Relational, Dimensional, CODASYL, etc.) or what I call the "engine"?The database engine is the software application, or the Database Management System (DBMS), that allows us to store the data. The most common database engines today are used for Relational Database Management Systems (RDBMS). These include Teradata, Oracle, and MS SQL Server. Within each DBMS, there may be multiple schemata, or collections of data. (Each individual collection of data within a DBMS is called a schemata; the plural is schema, but the plural form is often used in the singular.)
The term database seems to be freely used for both the schema and the DBMS (engine) and the distinction only becomes apparent through the context the word is used in. Therefore, I feel there should be a clear delineation between what a Data Architect does and what a Database Architect does.
Data Architect
A Data Architect (DA) is someone who focuses first on the data itself and how it will be used and maybe later on how it will be stored (schema) on the chosen DBMS. It is the DA who strives to ensure that the "Garbage" in "Garbage In, Garbage Out" (GIGO) won't exist. The DA should start working at the very beginning of a project in the Plan and Analyze phases.At this point, the DA will be working with the Business Analyst (BA) and the client stakeholders to get the meta data for the data items (entities and attributes) that will be needed for the solution. It is very important to understand early in the project what the business means by Customer, Order Date, Sale Date, etc. The DA should be considering the big picture, i.e. the enterprise (global) use of the data, so the same data item won't have different definitions (meta data) and data types used in other data stores.
Typical deliverables in the plan and analyze phases are the Conceptual and Logical Data Models. These documents are not concerned with what database engine (Teradata, Oracle, MS SQL Server, etc.) the data will be stored on. Once there is consensus on what data items are needed and how they are related, then the DA may pass the Logical Data Model off to a Data Modeler or Data Analyst to create the Physical Data Model or schema; the DA may do this model, too.
This is generally done in the Design phase of the project even though it is really in the DA's Build and Deploy phases. This is because there are others who can't do their work until the data store's structure has been completed and published. At this point, the DA enters a supporting role to those who will be using the data.
Once the Logical data model is completed and agreed to with the business stakeholders, the database engine (probably a RDBMS) should be chosen. Based on data volume metrics for initial and future storage capacity needed, maximum expected number of concurrent users, batch processing needed, and other usage factors, a decision should be made as to which database engine is the best. Quite often, however, this decision has already been made because the company has invested money, time, and staff to one database engine already and does not want to invest more into a second database engine even if the existing database engine will have scalability issues.
Database Architect
When the data-centric project gets to the point of choosing what DBMS will be used to store the data, this is the person who guides the selection of a DBMS and architects the infrastructure for it. Things like how many servers will be needed, how many tiers there will be, how much disk space is needed initially and three years later, how many concurrent users must be supported, and so forth are what the Database Architect is concerned with and will be creating a solution for.The Database Architect is generally someone who is expert in the chosen DBMS. At first, there may be several of these people, each representing a competing DBMS provider. Vendors usually provide these people to design, estimate, and (if chosen) be the company's DBMS solution provider. Once the DBMS is chosen by the program / project managers and team members and the CIO/CFO approve the selection, then a Database Architect who specializes in that DBMS is needed.
The Database Architect could be the same person who will be the Database Adminstrator (DBA); it would be better, in my opinion, to have a consultant act as the Database Architect because once the database system (DBMS and hardware) is designed and installed, there won't be much need for a Database Architect until the system needs to be upgraded, modified, or expanded. Generally, the Database Architect should be provided by the vendor of the DBMS since they will have the greatest and most current knowledge of their DBMS. A DBA should work along side of the Database Architect so there can be knowledge transfer of the infrastructure that the DBMS will reside on and the DBA will be monitoring.
Database Administrator
I included this third role in this dissertation even though it was not my original intent. It seems that most people understand the operational role of the Database Administrator (DBA), but may not understand the DBA's role in modeling the data store.The Data Architect (or Modeler or Analyst) will work with the DBA when the Logical Data Model is transformed into the Physical Data Model (schema). It is very rare that a Utopian Logical Data Model is transformed directly to a Physical Data Model. What is great in theory (or from a logical business perspective) is not always best in reality. The DBA should be the one guiding the person doing the Physical Data Model as to what will work best with the chosen DBMS. The DBA needs to understand the relationships between the entities, constraints and data validation needed, and what attributes will need indexing for SQL performance. Once the Data Architect has gone on to other projects, the DBA will generally be the first contact for developers and users who need to understand how to get data out of the schema contained within the DBMS. The DBA may need to consult with the Data Architect, but the users will generally contact the DBA first.
Sometimes, the DBA is also the Physical Data Modeler, but I don't think this is a good approach. The DBA focuses on the performance and ease of maintenance of the DBMS; the Data Architect / Modeler / Analyst focuses on the schema, data quality, data structure, data relationships, and data use (queries). These are definitely interrelated aspects, but it is hard for one person to advocate them equally. What the modeler sees as a best solution may not be what the DBA sees as providing the best performance and least disk space consumption. It is best when two or more people can provide input for both sides of the issue and come up with a solution that is agreeable to both.
Subscribe to:
Posts (Atom)