Wednesday, February 24, 2010

Data Integration

Data integration involves combining data residing in different sources and providing users with a unified view of these data.” [Maurizio Lenzerini (2002). "Data Integration: A Theoretical Perspective". PODS 2002. pp. 233-246.] {emphasis added}

Data integration can be accomplished at an enterprise level, departmental level, or lower level.
The most common method of “data integration” is the Data Warehouse (DW). A DW can exist at an Enterprise level (EDW) or at lower hierarchical levels. At these lower levels, DW is really a misnomer as these data stores provide data for specific purposes whereas a DW, especially at the enterprise level, should provide a non-application-specific “single source of truth.”

Data stores at less than enterprise level are more appropriately called Data Marts (DM). DM are designed to answer specific business questions or serve a specific departmental need. These DM may be populated from the operational databases directly or, more common today, from the EDW, enforcing the “single source of truth” concept.

I have done “data integration” my entire data career but it has only recently become a named activity with the book mentioned above being the oldest publication I have found using the term (2002)

The process of getting data items from the source systems into the DW is complicated and requires significant research, analysis, and understanding of the source system's data stores. This has recently become known as “Data Profiling” but has been done “as a matter of fact” since data stores were first created. The actual movement of data is done by ETL procedures.

ETL means “Extract, Transform, and Load” although the Transformation step can be done as part of Extract step, part of the Load step, or separately from both.

The term “data integration” has recently been expanded to include sharing data directly from several operational source systems into one single presentation layer to provide real-time information. Customer information such as addresses and phone numbers may be kept in one data store, while customer sales may be kept in a different data store. Customer credit card information may be in a third data store. Gathering pieces of data from these three data stores into a single presentation layer such as a real-time Web-based GUI may provide customer service personnel the ability to provide accurate and timely help in resolving disputes while enabling instant updates and corrections to data in all three data stores.

Getting real-time data from operational data stores may be detrimental to the source system's performance. Therefore, “near real time” has become the norm. Data can be exported from the operational data stores into an Operational Data Store (ODS) at very short intervals. However, coordinating the usability of the data needs to be planned so that data from one source does not become out-of-sync (not synchronized) with data from another source. If three data stores supply data to the ODS, will all three need to be updated before the updated data is usable? Many synchronization problems must be addressed so that data from the source systems are usable and updates in the ODS are sent back to the proper source system in a timely manner. However, most ODS systems do not allow direct updates to its data, requiring changes to be made directly to the source system and then synchronized at the ODS the next time it is updated.

No comments:

Post a Comment