Showing posts with label Data Integration. Show all posts
Showing posts with label Data Integration. Show all posts

Friday, July 1, 2011

Testing data warehouse - Toolbox for IT Groups

Testing data warehouse - Toolbox for IT Groups

My response to the posted question concerning QA testing of Data Warehouse ETL is:


More than just using independently developed SQL statements, I recommend writing separate ETL programs using another programming / scripting language such as Perl to mimic what the ETL programs are supposed to be doing.

Since the same effort will be needed by both the ETL development team and the QA development team, QA cannot be left to the end as, in many projects, a "necessary evil" at the last minute. Both development teams will need to be working independently to, hopefully, achieve the same results.

For example, I had a QA lead project where the ETL programs were written in Pro-C++ moving data from a mainframe DB2 extracted file set to an Oracle UNIX-based enterprise database. Using the same design specifications that the Pro-C developers used, I wrote Perl scripts to create UNIX-based "tables" (file stores) that were to mimic the Oracle target tables.

A complete set of data extract files were created from the DB2 source. Using the same DB2 extract files, the developers ran their programs to load the target tables and I ran mine to load the UNIX files. Then I extracted the data loaded from the Oracle tables and saved the extract to another set of UNIX files in the same format as my extract files. Using UNIX diff command, I compared the two sets of files.

By doing this, I revealed two glaring errors that the developers said would not have been caught using just a small sample set of data extracts:

First, the Pro-C programs were doing batch commits every 2,000 rows, but the program was not resetting the counter in the FOR loop. Thus, only the first 2,000 rows were getting written to the Oracle tables. If the source-to-target mapping predicted a million rows, the resulting 2,000 rows was a big problem!

After this was fixed, the next glaring error my Perl scripts caught was that the variable array in Pro-C was not being cleared properly between rows. Therefore, some fields from the previous row were not cleared and were duplicated for subsequent rows. Thus some customers records that did not supply certain fields, say an email address, got the data in those fields from the previous customer record.

Since these programs were written using the same design specifications, but independently, one of the checks was to give the code used to each other, so I checked their Pro-C code and they checked my Perl code to see where the problem was.

Simply sending a few hundred rows of mocked-up test data to the Oracle target and checking that it matched expectations was not adequate testing and would not have revealed these serious bugs.

Tuesday, March 16, 2010

Company Acquisition and Data Integration / Migration for EDW

Scenario

Company A purchases Company B. Company A needs to integrate Company B's operational systems into its Enterprise Data Warehouse and possibly many of A's data marts Company B's EDW will be retired. Company A and B have different application systems that do the same jobs: General Ledger, Order Management, Human Resources, etc.

Enterprise Data Warehouse

This blog will focus on the Enterprise Data Warehouse (EDW). There will be many more data stores that will need to be considered, but we'll start with this.

Planning

What needs to be planned for? What order do tasks need to be done in? What risk factors need to be considered?

  1. Understand Company A's source data architecture and metadata.
  2. Understand Company B's source data architecture and metadata.
  3. Company B may use a different database system (application software and database management system) than Company B does for the same business function. All businesses need to track their financial picture. Most will have some General Ledger (GL) application to do that. If A and B do not use the same GL application, we need to understand how B's system works and stores data and how that relates to A's system. We need to understand the metadata.

    To do this, we will need to work with Company B's subject matter experts (SME). We will need B's source system's entity relationship diagram(s) (ERD) and data dictionaries. We also need to know what data items are brought into the acquisition's EDW (assuming they have one). Just because they bring something into their EDW doesn't tell us why they bring it in. So we will need to understand their business requirements for bringing data into their EDW.

    So, the list of documents / information we have to get for the source system is:

    1. Database ERD (Physical Data Model (PDM))
    2. Data dictionary
    3. Source-to-target mappings from source database to their EDW
    4. Business rules for data brought to EDW. Why does the business need it?

  4. What data from A's source goes into A's EDW
    1. Why?
    2. How used?
    3. Transformations?
  5. What data from B's source goes into B's EDW
    1. Why?
    2. How used?
    3. Transformations?
  6. Gap analysis
    1. What data from A's source to EDW is missing from B's source to EDW?
    2. What data from B's source to EDW is missing from A's source to EDW
    3. Why does B need "extra" data?
    4. Why doesn't B use A's "extra" data?
    5. Why doesn't A use B's "extra" data?
  7. Data profiling
    1. Which tables and columns in B's source will be coming to A's EDW?
    2. Which tables and columns in A's source hold the same data as in B's source?
    3. Which column's in B's source have different data types, lengths, values, etc. than found in A's source.
  8. Data mart analysis (assume data marts are sourced from EDW)
    1. What data marts does B have that A doesn't have?
    2. What data marts does A have that B doesn't have?
    3. Will existing data marts in A support B's needs?
    4. What new data marts will A need to support B?
  9. Report analysis

Expand on these or add more.

Wednesday, February 24, 2010

Data Integration

Data integration involves combining data residing in different sources and providing users with a unified view of these data.” [Maurizio Lenzerini (2002). "Data Integration: A Theoretical Perspective". PODS 2002. pp. 233-246.] {emphasis added}

Data integration can be accomplished at an enterprise level, departmental level, or lower level.
The most common method of “data integration” is the Data Warehouse (DW). A DW can exist at an Enterprise level (EDW) or at lower hierarchical levels. At these lower levels, DW is really a misnomer as these data stores provide data for specific purposes whereas a DW, especially at the enterprise level, should provide a non-application-specific “single source of truth.”

Data stores at less than enterprise level are more appropriately called Data Marts (DM). DM are designed to answer specific business questions or serve a specific departmental need. These DM may be populated from the operational databases directly or, more common today, from the EDW, enforcing the “single source of truth” concept.

I have done “data integration” my entire data career but it has only recently become a named activity with the book mentioned above being the oldest publication I have found using the term (2002)

The process of getting data items from the source systems into the DW is complicated and requires significant research, analysis, and understanding of the source system's data stores. This has recently become known as “Data Profiling” but has been done “as a matter of fact” since data stores were first created. The actual movement of data is done by ETL procedures.

ETL means “Extract, Transform, and Load” although the Transformation step can be done as part of Extract step, part of the Load step, or separately from both.

The term “data integration” has recently been expanded to include sharing data directly from several operational source systems into one single presentation layer to provide real-time information. Customer information such as addresses and phone numbers may be kept in one data store, while customer sales may be kept in a different data store. Customer credit card information may be in a third data store. Gathering pieces of data from these three data stores into a single presentation layer such as a real-time Web-based GUI may provide customer service personnel the ability to provide accurate and timely help in resolving disputes while enabling instant updates and corrections to data in all three data stores.

Getting real-time data from operational data stores may be detrimental to the source system's performance. Therefore, “near real time” has become the norm. Data can be exported from the operational data stores into an Operational Data Store (ODS) at very short intervals. However, coordinating the usability of the data needs to be planned so that data from one source does not become out-of-sync (not synchronized) with data from another source. If three data stores supply data to the ODS, will all three need to be updated before the updated data is usable? Many synchronization problems must be addressed so that data from the source systems are usable and updates in the ODS are sent back to the proper source system in a timely manner. However, most ODS systems do not allow direct updates to its data, requiring changes to be made directly to the source system and then synchronized at the ODS the next time it is updated.