Friday, October 29, 2010

Database Management Systems Best Uses

As a data architect, I am database agnostic until reaching the physical implementation. During the conceptual and logical modeling phases, I will have gathered metrics that will help predict the scalability needed for the physical implementation.
In the perfect scenario, these metrics will be used to determine the best database management system (DBMS) to use. Many times, however, the target DBMS has already been chosen; in this case, these metrics will provide the basis for scalability and risk assessment. These metrics include:
  • Number of entities (tables)
  • Number of attributes (columns)
  • Volume metrics
    • Initial
      • Table size (row count and bytes)
      • Connection Numbers
        • Total
        • Average concurrent
        • Maximum concurrent
        • Application / system users
        • Power users
    • Growth rates for above: 1 year, 3 years
The data architect may or may not be involved in the physical implementation process, depending on the company's organizational hierarchy. Assuming that I will be involved with this, I don't need to be expert in all the available DBMS, but I should have some knowledge and understanding of them based on past experience. But I won't be able to keep up with all the DBMS out there, especially the niche systems. Some open source systems are also available and used.  A data architect needs to work closely with the DBMS subject matter experts (SME).  The most important SME to me are those who have actual experience using and supporting a DBMS and the type of data store being used on that DBMS.

So we all might gain understanding of what is available and the best DBMS uses, what is your experience? Consider the following points and share what DBMS was used and how it met, or failed to meet, expectations.
  • Purpose of database (EDW, DW, Data mart, ODS, reporting, OLAP, OLTP, etc.):
  • Database structure (Normalized, Dimensional, Snow Flake, etc.):
  • Database size in TB:
  • Number of tables in database:
  • Number of columns in database:
  • Total number of users:
  • Total number of power users:
  • Maximum number of connections:
  • Average number of connections:
  • DBMS or RDBMS used:
  • Expectations met or not for:
    • Database growth in TB?
    • Query performance?
    • Performance tuning?
    • Connectivity?
    • Database maintenance (DBA)?
    • Other comments:

My experiences:

  • Purpose of database: Enterprise Data Warehouse (EDW), Single source of truth; was source for other data marts, reporting data stores, OLAP.
  • Database structure: Normalized, 3NF
  • Database size in TB: Greater than 100 TB
  • Number of tables in database: More than 1000 tables; more than 3000 views.
  • Number of columns in database: More than 50,000
  • Total number of users: More than 1000
  • Total number of power users: More than 100
  • Maximum number of connections: More than 500
  • Average number of connections: More than 100
  • DBMS or RDBMS used: Teradata
  • Expectations met or not for:
    • Database growth in TB? Yes. Additional disk space seemed fairly easy to add. Adding more servers (AMPS) increased disk space, provided more CPU processing, and more user connectivity.
    • Query performance? Very good.
    • Performance tuning? Performance dependent on Primary Indexes (distribution), partitioning, secondary indexes, SQL tuning, statistics collected.
    • Connectivity? Batch programs, user direct, and user web-based.
    • Database maintenance (DBA)? Faster to add or drop columns to/from tables by creating new table and copying data from old table to new table than using ALTER TABLE commands. Obviously, not all steps to do this are itemized here.
    • Other comments: By far, the best RDBMS I have experienced for VLDB. Most everything was done in massively parallel processing. Data marts, reporting data stores, etc. were derived from physical database via views to create logical data stores. Response times were excellent; in those few cases where multi-table joins could not be improved, a physical table was created and loaded via batch processing.