Tuesday, July 12, 2011

Is the relational database doomed?

Is the relational database doomed?

Have you ever been asked about document-oriented or key value (KV) data stores / databases?  Read the article at the link above for a good background on them.

Start with data warehouse or datamart?

This discussion on LinkedIn presents many ideas about how to start a Business Intelligence (BI) project when there is no existing data warehouse or data mart.  There aren't many companies that haven't got some sort of BI solution already, so I don't think there will be many times this question will need to be answered.

But, in the case where a brand new BI solution is being developed, here are my preferences:
  1. Design the BI solution for the future.
    This means planning ahead so that whatever you start with, you are planning for future enterprise data standards, naming conventions, global uniqueness, data quality, and data governance.  Do not operate in a business silo.
  2. ROI is important.
    While it is difficult to prove a return on investment for a non-tangible product such as BI, most business funders of a BI solution have a niche perspective they want results for quickly.  Since building an Enterprise Data Warehouse (EDW) averages about 18 months, focused business data marts (DM) can be completed relatively quickly in six months or less.
  3. Build both DW (to become EDW) and DM.
    When starting a new BI solution, don't consider an "either / or" solution.  A DM should get its data from the data warehouse (DW), which should be designed for the future EDW.  So, bring the master and transactional data from the source-of-record (SOR) database system into a small DW first.  This small DW could have only a few subject areas yet still be the beginning of the future EDW.  From the small DW, which should be designed as a normalized model in at least 3NF, create the DM that is designed as star (dimensional) model.  The advantage of this approach is that the SOR systems only need to populate a single DW (to become future EDW) and all DM created will be sourced from the same source of truth.  Thus they will all start with the same data and this will reduce the chances of different metrics being reported from different DM.

Friday, July 1, 2011

Testing data warehouse - Toolbox for IT Groups

Testing data warehouse - Toolbox for IT Groups

My response to the posted question concerning QA testing of Data Warehouse ETL is:


More than just using independently developed SQL statements, I recommend writing separate ETL programs using another programming / scripting language such as Perl to mimic what the ETL programs are supposed to be doing.

Since the same effort will be needed by both the ETL development team and the QA development team, QA cannot be left to the end as, in many projects, a "necessary evil" at the last minute. Both development teams will need to be working independently to, hopefully, achieve the same results.

For example, I had a QA lead project where the ETL programs were written in Pro-C++ moving data from a mainframe DB2 extracted file set to an Oracle UNIX-based enterprise database. Using the same design specifications that the Pro-C developers used, I wrote Perl scripts to create UNIX-based "tables" (file stores) that were to mimic the Oracle target tables.

A complete set of data extract files were created from the DB2 source. Using the same DB2 extract files, the developers ran their programs to load the target tables and I ran mine to load the UNIX files. Then I extracted the data loaded from the Oracle tables and saved the extract to another set of UNIX files in the same format as my extract files. Using UNIX diff command, I compared the two sets of files.

By doing this, I revealed two glaring errors that the developers said would not have been caught using just a small sample set of data extracts:

First, the Pro-C programs were doing batch commits every 2,000 rows, but the program was not resetting the counter in the FOR loop. Thus, only the first 2,000 rows were getting written to the Oracle tables. If the source-to-target mapping predicted a million rows, the resulting 2,000 rows was a big problem!

After this was fixed, the next glaring error my Perl scripts caught was that the variable array in Pro-C was not being cleared properly between rows. Therefore, some fields from the previous row were not cleared and were duplicated for subsequent rows. Thus some customers records that did not supply certain fields, say an email address, got the data in those fields from the previous customer record.

Since these programs were written using the same design specifications, but independently, one of the checks was to give the code used to each other, so I checked their Pro-C code and they checked my Perl code to see where the problem was.

Simply sending a few hundred rows of mocked-up test data to the Oracle target and checking that it matched expectations was not adequate testing and would not have revealed these serious bugs.