Sunday, March 28, 2010

Data Modeling


Data Modeling is a way of describing the tables (entities) and relationships in a database schemata (see other blogs with "schema" tags). (By the way, both "modeling" and "modelling" are correct spellings. The first is more American English and the second is British English; the British claim they are right.) Because a "picture speaks a thousand words", data models help database users to more easily understand the schemata they are interested in.
There are many kinds of models, but the three most common are: Conceptual, Logical, and Physical.

Data Model Hierarchy

CONCEPTUAL

A conceptual data model is the highest level data model. It is usually created early in a database project as, or shortly after, data requirements have been gathered in what is known as the Plan and Analyze phase of the SDLC. It shows the major data entities the project needs and how they relate to each other. Entities include objects like Customers, Employees, Products, Sales, Orders, etc. Sometimes, some entities are made to be a child entity of a broader entity like Customers could be a child entity of People. The names used to describe an entity are the ones a business would use, such as Customer, not the names the table in the physical data schemata would be, such as TB_CUST.
A good definition of Conceptual Data Modeling can be found in "Conceptual Modeling - ER 2002" by Stefano Spaccapietra and Salvatore March (Springer)
"The task of conceptual modelling is to build a high-quality representation of selected phenomena in some domain. The conceptual models that result facilitate the implementation, operation, and maintenance of information systems."
 LOGICAL

The next level in data modeling is the Logical data model. It is created in the Design phase of the SDLC.

At this level, data items that define what makes up an entity are added. For example, a Customer entity will have data items such as Name, Address, City, State, ZIP, etc. These data items are called "attributes" of the entity.

The logical model will show how the entities relate to each other through common attributes. These relationships will be in business terms such as "An order has a customer" and "A customer may have one or more orders."

The relationships will be connected by lines that depict what is called the "cardinality" of the relationship, or how many elements there can be in the set. For example, every Order must have one and only one customer; however, a customer can have zero or many current orders. The Customer-to-Order relationship is an optional one-to-many cardinality, while the Order-to-Customer relationship is a mandatory one-to-one cardinality.

PHYSICAL

The physical data model is the graphic depiction of the actual database schemata that contains the entities (tables) and attributes (columns) specified in the Logical data model. Sadly, many database schemata are built directly without the planning and design models being created. The physical data model should be the result of collaboration between the Data Architect (or Modeler) and the Database Administrator (DBA). The Data Architect knows what data entities and attributes are needed by the application and how they relate to each other. The DBA knows the database management system (DBMS or RDBMS) in which the data will be stored. Together, these two can design and build the physical database schemata that will provide the fastest I/O performance, be maintainable by the DBA, and will be scalable for future growth. (Scalability is a term used to indicate how well the system will be able to handle increased data volumes and more users of the database.)

Data Modeling Tools

In the beginning, these models were created by hand drawings using stencils for the limited number of shapes and lines used at the time. As models became more complicated, software to create data models was developed. At first, these software applications were no more than diagramming tools.

These simple diagramming tools have become sophisticated data modeling applications that included data dictionaries, metadata, scripting languages, requirements tracking, database definition language (DDL) generators and much more. These mega-modeling tools can be very expensive. The two most common ones that I know of and have used are Sybase's PowerDesigner and CA's Erwin.

These are great tools for large companies, but are there data modeling software tools available cheaply or freely for those who want to learn modeling skills or do data modeling for small to medium sized businesses?
Some database management systems (DBMS) come with there own data modeling tools. A couple DBMS can be downloaded freely for personal education and training. Both Oracle and MS SQL Server allow this and I believe they both have built-in data modeling tools, but it has been a long time since I used them myself so this may not be true any more.

Some people use Visio for data modeling, but others claim it is not a "real" data modeling tool. It certainly has limitations, but you can draw database schemata diagrams with it. I have found it to be useful for simple data models, but it has none of the extended features mentioned above.

Below are links to various web sites that list free and/or inexpensive data modeling tools:

Tuesday, March 16, 2010

Company Acquisition and Data Integration / Migration for EDW

Scenario

Company A purchases Company B. Company A needs to integrate Company B's operational systems into its Enterprise Data Warehouse and possibly many of A's data marts Company B's EDW will be retired. Company A and B have different application systems that do the same jobs: General Ledger, Order Management, Human Resources, etc.

Enterprise Data Warehouse

This blog will focus on the Enterprise Data Warehouse (EDW). There will be many more data stores that will need to be considered, but we'll start with this.

Planning

What needs to be planned for? What order do tasks need to be done in? What risk factors need to be considered?

  1. Understand Company A's source data architecture and metadata.
  2. Understand Company B's source data architecture and metadata.
  3. Company B may use a different database system (application software and database management system) than Company B does for the same business function. All businesses need to track their financial picture. Most will have some General Ledger (GL) application to do that. If A and B do not use the same GL application, we need to understand how B's system works and stores data and how that relates to A's system. We need to understand the metadata.

    To do this, we will need to work with Company B's subject matter experts (SME). We will need B's source system's entity relationship diagram(s) (ERD) and data dictionaries. We also need to know what data items are brought into the acquisition's EDW (assuming they have one). Just because they bring something into their EDW doesn't tell us why they bring it in. So we will need to understand their business requirements for bringing data into their EDW.

    So, the list of documents / information we have to get for the source system is:

    1. Database ERD (Physical Data Model (PDM))
    2. Data dictionary
    3. Source-to-target mappings from source database to their EDW
    4. Business rules for data brought to EDW. Why does the business need it?

  4. What data from A's source goes into A's EDW
    1. Why?
    2. How used?
    3. Transformations?
  5. What data from B's source goes into B's EDW
    1. Why?
    2. How used?
    3. Transformations?
  6. Gap analysis
    1. What data from A's source to EDW is missing from B's source to EDW?
    2. What data from B's source to EDW is missing from A's source to EDW
    3. Why does B need "extra" data?
    4. Why doesn't B use A's "extra" data?
    5. Why doesn't A use B's "extra" data?
  7. Data profiling
    1. Which tables and columns in B's source will be coming to A's EDW?
    2. Which tables and columns in A's source hold the same data as in B's source?
    3. Which column's in B's source have different data types, lengths, values, etc. than found in A's source.
  8. Data mart analysis (assume data marts are sourced from EDW)
    1. What data marts does B have that A doesn't have?
    2. What data marts does A have that B doesn't have?
    3. Will existing data marts in A support B's needs?
    4. What new data marts will A need to support B?
  9. Report analysis

Expand on these or add more.