Friday, February 12, 2010

The Need For Data Profiling

A new discussion was started on LinkedIn, White Paper: The Data-Driven Enterprise. Just in case you can't open the LinkedIn discussion itself, the discussion references another URL for the white paper's source.

The referenced article begins with
Data is one of your enterprise’s most valuable assets. So why are so many organisations still accepting business decisions based on inconsistent, unreliable data as being “good enough”?

The data being discussed is in relation to data warehousing the operational system's data.

I have found in my career that most IT managers do not understand the premise of "Garbage In, Garbage Out" well enough to want to fix the problem at the source. Most operational database systems have evolved over the years into, quite honestly, coding nightmares. As problems were discovered with data and new data needed to be added, the quickest way to deal with these was to change the "code". The code could be in the form of Extract / Transform/ Load (ETL) batch processing scripts, stored procedures or functions, or application program interfaces (API).

APIs control the way database applications and users interface with the database. When problems arise with the data, the API can be modified to work around the bad data by "fixing" it on the fly. When new data needs to be added, a column in an existing table that is not being used can be "re-purposed" by putting the new data in that location and have the API control access to that column so no one really needs to know what table and column the data is really stored in.

I have seen databases that contained tables with columns such as "A", "B", etc. These columns were put in the table because it seemed likely that new data items would be needed in the future and this would accommodate them without having to add new columns. Some databases have also been created with extra tables to accommodate new entities such as value lists without having to create a new table. This practice is especially practiced in proprietary databases that are sold to customers but must be modified to meet the customer's needs. By using this generic database schema, the vendor can update the database and associated APIs without having to worry about each of its customers' modifications.

When this operational data needs to be stored in a data warehouse, business users, managers, and C-level executives don't understand the importance of data profiling and data quality analysis. They feel that the data, and the information that data provides, in the source operational system just needs to be moved to the data warehouse; a very simple operation. What they don't understand is the years of data manipulation that has been done within the database applications to provide that "data quality."

Also, the original documentation, if any, has not been kept updated with all the programming and/or database changes made. Original requirements documentation may be missing or out-of-date; new requirements may be in disconnected pieces of documentation that can only be found by someone who worked on those specific projects years ago.

Many times, the only data models of these legacy source systems are the ones that will be reverse-engineered from the existing database; metadata will be missing. If the original requirements and design specifications can be found, metadata may be reverse-engineered from them, but, in most cases, inferences and "best guesses" will have to be made.
Significant data profiling will be required to ensure that only good data is brought into the data warehouse. Just because a table and column is populated in the operational source that does not mean it is even used in the current version; they may be just "left overs" that can't be removed for legacy reasons and to minimize upgrade time.

Codes and values may not have lookup tables in the operational system; they may all be hard coded in the API. Foreign keys (FK) may not be enforced (sometimes called "logical" FK); I have even seen so-called child tables in operational database systems that get populated before the parent table is populated! These orphaned child records are not really orphans, they are placeholders for transactional history until the parent transaction (order, for example) has met some requirement such as at least one order item record has been created or the order has been tendered.

Thus the Extract and Transform steps of ETL become much more complicated than just moving the data. Because data profiling and data quality analysis can be so time-consuming and resource intensive, and business managers and C-level executives don't understand the need for it, project budgets don't provide time, money, or resources for these tasks. These tasks then become bundled within other tasks that then go way over budget in all aspects: time, money, and resources. The push then becomes to "do the best with what you've got and make the best educated guesses (I prefer SWAG - Sophisticated Wild Ass Guess) possible; we'll fix it later."

As Data Architects and Data Managers, assuming enterprise standards have been created, we need to start enforcing data standards and not allow them to be ignored. This requires lots of teaching and convincing of upper management people and perseverance by Data Architects and Data Managers. Unfortunately, careers are sometimes killed by doing this, but then that was probably inevitable anyway. If not done right from the beginning, the unplanned fixes will become a nightmare in the future when even more heads will roll.

If you don't have time (money, resources) to do it right, when will you have time (money, resources) to do it over?


  1. Hi Ted,
    you are right about data profiling and about the need to enforce data standards, or more generally to have a Data/Knowledge Management policy/strategy in place. I don't think the costs are so high for performing the data profiling itself, but in correcting the bad data, and you can see that in ERP implementations, many such projects failing because of the bad data quality even if there are attempts to clean the data.
    Maybe that there are people who are applying ostrich’s tactics – if they hold the head under the sand, maybe the issues won’t hit them. I was highlighting in my post on the same topic that it's difficult to built a case for Data Quality/Profiling without evaluating the impact of bad data. People see that only when something goes wrong, when there are penalties of missed orders, when wrong components are delivered and so on.
    Data Quality in Data Warehouses is quite a sensible topic, first of all because if people don't trust the data in the warehouse then it's less likely that the reports based on it will be used, and secondly because usually it reflects the data quality from the legacy systems. On the other side, from my experience, most of the issues don't relate to data structure and synchronization, which are quite important, but to how much the data reflect the reality.
    As an outsider, I would say that the existence of a Data Management policy/strategy in place is a reflection of the CMMi level of an organization. On the other side, given the crisis, I think most of the organization are trying to survive rather then invest in their knowledge base.
    Best regards,

  2. Adrian,

    I appreciate your comments.

    The cost of doing data profiling (finding out about the data structure, metadata, et al) and data quality analysis (FK matches, valid values, et al) does take a lot of time and time is money. This is because it takes a lot of interviewing, research, reading, and just plain digging to understand the source system's data store. At least 30% of an ETL project's budgeted time should be dedicated to data profiling.

    As far as cleaning up bad quality data, there are two approaches to that. One, the preferred, is to get the data cleaned up in the source system. But this is seldom done because, as I said in my original post, the bad data has already been "fixed" with many work around code modifications. Therefore, the users don't see the bad data on reports or GUI anymore, so, as far as they are concerned, there is no bad data.

    The second approach, highly undesirable but routinely done, is to replicate the legacy operational system's code fixes in the Transform stage of ETL so the correct data is actually stored in the DW. This adds to the total costs of the ETL project because it takes a long time to find and understand all the data manipulations that have been done in the legacy source system. As I said, the documentation for these "data fixes" in the code may be lost if it ever existed.

    You are correct when you say people will not trust the data in the DW if it doesn't provide the same information they are used to seeing from the legacy source system. That is why the source data needs to be highly understood and cleansed somewhere before being stored in the DW. I have, however, found that even after data has been cleansed on its way to the DW, the results from the DW source in reports, dashboards, and other metric views may differ from what the business user sees from the operational system. For reports (either printed or viewed), I have found this is either because the logic in the operational system's code to "fix" the data was not understood well enough to replicate that logic in the Transform logic, or the code in the source system was wrong and the Transform code is correct.

    If people get used to looking at the wrong metrics for a long period of time, they accept those metrics as correct. When they are different when generated from the DW, it takes a lot of diplomacy and discussion to get the user to accept the DW results.

    But then, who is going to fix the legacy source system?

  3. I agree with your comments Ted.

    It is a sobering experience attempting to integrate new data streams into a data warehouse. You are correct, it is many times not a simple data move. There are data quality and metadata issues. There seems to be few circumstances where operational systems understand that they are a source for the rest of the Enterprise and understand the responsibility that goes with that. In fairness, teams are busy with their workload. The DW consumes data from many operational systems and must integrate that data into it's own environment where data must be join-able and have meaning to all users.

    So data profiling becomes a necessary and unplanned activity in an attempt to understand and use provided data.

    Working on DW's has raised my own awareness that that data is an Enterprise asset and needs to be envisioned and functional at the Enterprise level, not just "I understand my data and it works in my system". Even a data warehouse needs to consider how it's data is exposed and consumed.

    I enjoy your blog,
    Richard Buelow