Showing posts with label Programming. Show all posts
Showing posts with label Programming. Show all posts

Friday, July 1, 2011

Testing data warehouse - Toolbox for IT Groups

Testing data warehouse - Toolbox for IT Groups

My response to the posted question concerning QA testing of Data Warehouse ETL is:


More than just using independently developed SQL statements, I recommend writing separate ETL programs using another programming / scripting language such as Perl to mimic what the ETL programs are supposed to be doing.

Since the same effort will be needed by both the ETL development team and the QA development team, QA cannot be left to the end as, in many projects, a "necessary evil" at the last minute. Both development teams will need to be working independently to, hopefully, achieve the same results.

For example, I had a QA lead project where the ETL programs were written in Pro-C++ moving data from a mainframe DB2 extracted file set to an Oracle UNIX-based enterprise database. Using the same design specifications that the Pro-C developers used, I wrote Perl scripts to create UNIX-based "tables" (file stores) that were to mimic the Oracle target tables.

A complete set of data extract files were created from the DB2 source. Using the same DB2 extract files, the developers ran their programs to load the target tables and I ran mine to load the UNIX files. Then I extracted the data loaded from the Oracle tables and saved the extract to another set of UNIX files in the same format as my extract files. Using UNIX diff command, I compared the two sets of files.

By doing this, I revealed two glaring errors that the developers said would not have been caught using just a small sample set of data extracts:

First, the Pro-C programs were doing batch commits every 2,000 rows, but the program was not resetting the counter in the FOR loop. Thus, only the first 2,000 rows were getting written to the Oracle tables. If the source-to-target mapping predicted a million rows, the resulting 2,000 rows was a big problem!

After this was fixed, the next glaring error my Perl scripts caught was that the variable array in Pro-C was not being cleared properly between rows. Therefore, some fields from the previous row were not cleared and were duplicated for subsequent rows. Thus some customers records that did not supply certain fields, say an email address, got the data in those fields from the previous customer record.

Since these programs were written using the same design specifications, but independently, one of the checks was to give the code used to each other, so I checked their Pro-C code and they checked my Perl code to see where the problem was.

Simply sending a few hundred rows of mocked-up test data to the Oracle target and checking that it matched expectations was not adequate testing and would not have revealed these serious bugs.

Monday, December 13, 2010

Programming using Scripting Languages

It might seem like I'm digressing from my "data-centric" focus, but, really, I don't think so.  I have had to write  scripts to automate many "data-centric" tasks over the years; most of these scripts contained embedded SQL for data manipulation (DML).  Many of these scripts were automated by using UNIX cron, Windows AT, or some other scheduler. For example:

  • Transfer data files from one location to another
  • Load data files directly into database staging tables
  • Extract data to create data files to be sent to clients or vendors using FTP
  • Create reports
  • Data migration from mainframe to regional client/server databases.
Granted, these tasks were done before modern CASE tools took over many of these tasks, but you still might need to do some tasks like these on an impromptu, urgent basis.  Or maybe the client doesn't have one of the CASE tools and doesn't want to spend the money to get one - just do it!

On UNIX systems, I suggest learning Korn Shell (ksh).  You will be surprised at how robust a programming language this is.  Now I know some people will say it is not a "programming" language because it isn't compiled, but the skills needed to write good scripts are the same as those needed to write good compiled programs.  Compiling a program is done to make it run faster and there are programs that will compile ksh scripts, such as shc.

Another favorite of mine is Perl.  It can be written for both UNIX and Windows systems (as well as many others) with little change needed between systems so it is quite portable.  Perl scripts can be compiled, too, using commercial programs such as Perl2EXE.