Wednesday, February 24, 2010

Data Integration

Data integration involves combining data residing in different sources and providing users with a unified view of these data.” [Maurizio Lenzerini (2002). "Data Integration: A Theoretical Perspective". PODS 2002. pp. 233-246.] {emphasis added}

Data integration can be accomplished at an enterprise level, departmental level, or lower level.
The most common method of “data integration” is the Data Warehouse (DW). A DW can exist at an Enterprise level (EDW) or at lower hierarchical levels. At these lower levels, DW is really a misnomer as these data stores provide data for specific purposes whereas a DW, especially at the enterprise level, should provide a non-application-specific “single source of truth.”

Data stores at less than enterprise level are more appropriately called Data Marts (DM). DM are designed to answer specific business questions or serve a specific departmental need. These DM may be populated from the operational databases directly or, more common today, from the EDW, enforcing the “single source of truth” concept.

I have done “data integration” my entire data career but it has only recently become a named activity with the book mentioned above being the oldest publication I have found using the term (2002)

The process of getting data items from the source systems into the DW is complicated and requires significant research, analysis, and understanding of the source system's data stores. This has recently become known as “Data Profiling” but has been done “as a matter of fact” since data stores were first created. The actual movement of data is done by ETL procedures.

ETL means “Extract, Transform, and Load” although the Transformation step can be done as part of Extract step, part of the Load step, or separately from both.

The term “data integration” has recently been expanded to include sharing data directly from several operational source systems into one single presentation layer to provide real-time information. Customer information such as addresses and phone numbers may be kept in one data store, while customer sales may be kept in a different data store. Customer credit card information may be in a third data store. Gathering pieces of data from these three data stores into a single presentation layer such as a real-time Web-based GUI may provide customer service personnel the ability to provide accurate and timely help in resolving disputes while enabling instant updates and corrections to data in all three data stores.

Getting real-time data from operational data stores may be detrimental to the source system's performance. Therefore, “near real time” has become the norm. Data can be exported from the operational data stores into an Operational Data Store (ODS) at very short intervals. However, coordinating the usability of the data needs to be planned so that data from one source does not become out-of-sync (not synchronized) with data from another source. If three data stores supply data to the ODS, will all three need to be updated before the updated data is usable? Many synchronization problems must be addressed so that data from the source systems are usable and updates in the ODS are sent back to the proper source system in a timely manner. However, most ODS systems do not allow direct updates to its data, requiring changes to be made directly to the source system and then synchronized at the ODS the next time it is updated.

Friday, February 12, 2010

The Need For Data Profiling

A new discussion was started on LinkedIn, White Paper: The Data-Driven Enterprise. Just in case you can't open the LinkedIn discussion itself, the discussion references another URL for the white paper's source.

The referenced article begins with
Data is one of your enterprise’s most valuable assets. So why are so many organisations still accepting business decisions based on inconsistent, unreliable data as being “good enough”?

The data being discussed is in relation to data warehousing the operational system's data.

I have found in my career that most IT managers do not understand the premise of "Garbage In, Garbage Out" well enough to want to fix the problem at the source. Most operational database systems have evolved over the years into, quite honestly, coding nightmares. As problems were discovered with data and new data needed to be added, the quickest way to deal with these was to change the "code". The code could be in the form of Extract / Transform/ Load (ETL) batch processing scripts, stored procedures or functions, or application program interfaces (API).

APIs control the way database applications and users interface with the database. When problems arise with the data, the API can be modified to work around the bad data by "fixing" it on the fly. When new data needs to be added, a column in an existing table that is not being used can be "re-purposed" by putting the new data in that location and have the API control access to that column so no one really needs to know what table and column the data is really stored in.

I have seen databases that contained tables with columns such as "A", "B", etc. These columns were put in the table because it seemed likely that new data items would be needed in the future and this would accommodate them without having to add new columns. Some databases have also been created with extra tables to accommodate new entities such as value lists without having to create a new table. This practice is especially practiced in proprietary databases that are sold to customers but must be modified to meet the customer's needs. By using this generic database schema, the vendor can update the database and associated APIs without having to worry about each of its customers' modifications.

When this operational data needs to be stored in a data warehouse, business users, managers, and C-level executives don't understand the importance of data profiling and data quality analysis. They feel that the data, and the information that data provides, in the source operational system just needs to be moved to the data warehouse; a very simple operation. What they don't understand is the years of data manipulation that has been done within the database applications to provide that "data quality."

Also, the original documentation, if any, has not been kept updated with all the programming and/or database changes made. Original requirements documentation may be missing or out-of-date; new requirements may be in disconnected pieces of documentation that can only be found by someone who worked on those specific projects years ago.

Many times, the only data models of these legacy source systems are the ones that will be reverse-engineered from the existing database; metadata will be missing. If the original requirements and design specifications can be found, metadata may be reverse-engineered from them, but, in most cases, inferences and "best guesses" will have to be made.
Significant data profiling will be required to ensure that only good data is brought into the data warehouse. Just because a table and column is populated in the operational source that does not mean it is even used in the current version; they may be just "left overs" that can't be removed for legacy reasons and to minimize upgrade time.

Codes and values may not have lookup tables in the operational system; they may all be hard coded in the API. Foreign keys (FK) may not be enforced (sometimes called "logical" FK); I have even seen so-called child tables in operational database systems that get populated before the parent table is populated! These orphaned child records are not really orphans, they are placeholders for transactional history until the parent transaction (order, for example) has met some requirement such as at least one order item record has been created or the order has been tendered.

Thus the Extract and Transform steps of ETL become much more complicated than just moving the data. Because data profiling and data quality analysis can be so time-consuming and resource intensive, and business managers and C-level executives don't understand the need for it, project budgets don't provide time, money, or resources for these tasks. These tasks then become bundled within other tasks that then go way over budget in all aspects: time, money, and resources. The push then becomes to "do the best with what you've got and make the best educated guesses (I prefer SWAG - Sophisticated Wild Ass Guess) possible; we'll fix it later."

As Data Architects and Data Managers, assuming enterprise standards have been created, we need to start enforcing data standards and not allow them to be ignored. This requires lots of teaching and convincing of upper management people and perseverance by Data Architects and Data Managers. Unfortunately, careers are sometimes killed by doing this, but then that was probably inevitable anyway. If not done right from the beginning, the unplanned fixes will become a nightmare in the future when even more heads will roll.

If you don't have time (money, resources) to do it right, when will you have time (money, resources) to do it over?

Wednesday, February 3, 2010

What should you ask a potential client about?

I was recently asked to provide some "initial scoping" questions to be asked when meeting with potential clients.  Perhaps the word clients should be expanded to include both external customers for consultants and internal business units for full-time employees of a company.  Finding out what the client wants requires a consultative approach.  You must bring your expertise to the table and be willing to share and advise.  The client needs to know if their expectations are realistic or not; as do you, of course.

First of all, I want to know if a way of doing what they are asking to be done exists right now; this can be manually or some other computer application.  If the client wants to automate (computerize) a task that they do not do manually now, my advice would be to do it manually first so you have something to check the automated results against.  If it cannot be done manually, it can't be automated / computerized.  If it is not being done in some way now, how will they know if the new system is doing it right?

Sometimes clients say that they have another application that needs to be upgraded / revamped / replaced.  The following questions apply to both existing manual and computerized processes:
  • Why do you need to upgrade / revamp / replace the existing method?


    • Too slow
    • Incorrect results
    • Missing needs
    • Cannot grow
    • Need to provide results to larger audience


  • What have you already tried to improve the existing method?


    • What has worked?
    • What has not worked?


  • Are the business rules documented for the current method?


    • If not, reverse engineering of the current method can be done to determine what the current algorithms are, but this will not prove that the original business rules are being followed.  I have bult systems that proved the existing system was doing it wrong, but it takes much time, effort, and documentation to prove that to the business.
    • If yes, will the same rules apply to the new system?


  • What do you like about the existing method?


    • Interface
    • Reports
    • Processing


  • What do you dislike about the existing method?


    • How can these be improved?


  • What is missing from the existing method that you want to add?


    • Reports
    • Dashboards
    • Functionality


  • What is included with the existing method (software) that you do not use that can be eliminated from the new solution?


    • Get rid of "bells and whistles."


  • What is the timeframe?


    • Can a phased approach be used?


  • What is the budget?
  • What are the constraints?


    • Must use existing infrastructure
    • No internal resources available
    • Must use existing in-house developers (and languages), databases, operating systems, etc.


  • What phase of this project are you in right now? 


    • If not in the intitial Plan and Analyze phase, it may be too late to recommend any changes.


  • Do you have dedicated resources who will be working with me?
  • How will results be verified / validated / proven?
  • Who will take ownership of the completed system?
  • Who will support the completed system?
Remember, these are initial questions to help get an understanding of and manage the client's expectations.  There will be many more questions to ask as the cycle progresses.

Tuesday, February 2, 2010

Data Architect Job Description

I found this definition of a Data Architect on CareerBuilder.
Defines, designs, and builds relational and/or multi-dimensional databases for warehousing of data. Confers with client, technical staff, and team members to plan, design, develop, implement, and enhance applications, scripts, procedures, and metadata for relational databases. Reviews current data structures and recommends optimizations and reconfigurations as warranted. Generates data acquisition and archive recovery strategies. Works with Database Administrators to migrate complex large scale applications into test and production environments. Integrates new data with existing warehouse structure and continuously improves system functionality. Evaluates and recommends new tools and methodologies. Consults with vendors to integrate selected packages. Deletes old data. Develops and maintains all standards in accordance with data structures and models. May work with organization departments to design and develop data marts. (Source: eDOT Job Description)
Do you agree with this definition? Does it include too much or not enough? I agree with most of it, but I question or disagree with the following points:
  • archive recovery strategies
    This is more of a DBA or Disaster Recovery role
  • continuously improves system functionality
    Not quite sure what is meant by this; seems to be DBA or development team role.
  • Evaluates and recommends new tools and methodologies
    What new tools and methodologies? I might do this for data modeling tools like Erwin or PowerDesigner if they are looking for this and can do this for data architecture-related practices such as data governance, data quality, naming conventions, modeling standards, and so forth.
  • Consults with vendors to integrate selected packages.
    This is Solution, System, or Application Architect role.
  • Deletes old data.
    This is done by the user, application, or DBA.