A new discussion was started on LinkedIn, White Paper: The Data-Driven Enterprise. Just in case you can't open the LinkedIn discussion itself, the discussion references another URL for the white paper's source.
The referenced article begins with
Data is one of your enterprise’s most valuable assets. So why are so many organisations still accepting business decisions based on inconsistent, unreliable data as being “good enough”?
The data being discussed is in relation to data warehousing the operational system's data.
I have found in my career that most IT managers do not understand the premise of "Garbage In, Garbage Out" well enough to want to fix the problem at the source. Most operational database systems have evolved over the years into, quite honestly, coding nightmares. As problems were discovered with data and new data needed to be added, the quickest way to deal with these was to change the "code". The code could be in the form of Extract / Transform/ Load (ETL) batch processing scripts, stored procedures or functions, or application program interfaces (API).
APIs control the way database applications and users interface with the database. When problems arise with the data, the API can be modified to work around the bad data by "fixing" it on the fly. When new data needs to be added, a column in an existing table that is not being used can be "re-purposed" by putting the new data in that location and have the API control access to that column so no one really needs to know what table and column the data is really stored in.
I have seen databases that contained tables with columns such as "A", "B", etc. These columns were put in the table because it seemed likely that new data items would be needed in the future and this would accommodate them without having to add new columns. Some databases have also been created with extra tables to accommodate new entities such as value lists without having to create a new table. This practice is especially practiced in proprietary databases that are sold to customers but must be modified to meet the customer's needs. By using this generic database schema, the vendor can update the database and associated APIs without having to worry about each of its customers' modifications.
When this operational data needs to be stored in a data warehouse, business users, managers, and C-level executives don't understand the importance of data profiling and data quality analysis. They feel that the data, and the information that data provides, in the source operational system just needs to be moved to the data warehouse; a very simple operation. What they don't understand is the years of data manipulation that has been done within the database applications to provide that "data quality."
Also, the original documentation, if any, has not been kept updated with all the programming and/or database changes made. Original requirements documentation may be missing or out-of-date; new requirements may be in disconnected pieces of documentation that can only be found by someone who worked on those specific projects years ago.
Many times, the only data models of these legacy source systems are the ones that will be reverse-engineered from the existing database; metadata will be missing. If the original requirements and design specifications can be found, metadata may be reverse-engineered from them, but, in most cases, inferences and "best guesses" will have to be made.
Significant data profiling will be required to ensure that only good data is brought into the data warehouse. Just because a table and column is populated in the operational source that does not mean it is even used in the current version; they may be just "left overs" that can't be removed for legacy reasons and to minimize upgrade time.
Codes and values may not have lookup tables in the operational system; they may all be hard coded in the API. Foreign keys (FK) may not be enforced (sometimes called "logical" FK); I have even seen so-called child tables in operational database systems that get populated before the parent table is populated! These orphaned child records are not really orphans, they are placeholders for transactional history until the parent transaction (order, for example) has met some requirement such as at least one order item record has been created or the order has been tendered.
Thus the Extract and Transform steps of ETL become much more complicated than just moving the data. Because data profiling and data quality analysis can be so time-consuming and resource intensive, and business managers and C-level executives don't understand the need for it, project budgets don't provide time, money, or resources for these tasks. These tasks then become bundled within other tasks that then go way over budget in all aspects: time, money, and resources. The push then becomes to "do the best with what you've got and make the best educated guesses (I prefer SWAG - Sophisticated Wild Ass Guess) possible; we'll fix it later."
As Data Architects and Data Managers, assuming enterprise standards have been created, we need to start enforcing data standards and not allow them to be ignored. This requires lots of teaching and convincing of upper management people and perseverance by Data Architects and Data Managers. Unfortunately, careers are sometimes killed by doing this, but then that was probably inevitable anyway. If not done right from the beginning, the unplanned fixes will become a nightmare in the future when even more heads will roll.
If you don't have time (money, resources) to do it right, when will you have time (money, resources) to do it over?