Data-centric Discussions: 2010

Monday, December 13, 2010

Programming using Scripting Languages

It might seem like I'm digressing from my "data-centric" focus, but, really, I don't think so. I have had to write scripts to automate many "data-centric" tasks over the years; most of these scripts contained embedded SQL for data manipulation (DML). Many of these scripts were automated by using UNIX cron, Windows AT, or some other scheduler. For example:

Transfer data files from one location to another
Load data files directly into database staging tables
Extract data to create data files to be sent to clients or vendors using FTP
Create reports
Data migration from mainframe to regional client/server databases.

Granted, these tasks were done before modern CASE tools took over many of these tasks, but you still might need to do some tasks like these on an impromptu, urgent basis. Or maybe the client doesn't have one of the CASE tools and doesn't want to spend the money to get one - just do it!

On UNIX systems, I suggest learning Korn Shell (ksh). You will be surprised at how robust a programming language this is. Now I know some people will say it is not a "programming" language because it isn't compiled, but the skills needed to write good scripts are the same as those needed to write good compiled programs. Compiling a program is done to make it run faster and there are programs that will compile ksh scripts, such as shc.

Another favorite of mine is Perl. It can be written for both UNIX and Windows systems (as well as many others) with little change needed between systems so it is quite portable. Perl scripts can be compiled, too, using commercial programs such as Perl2EXE.

Wednesday, December 8, 2010

Kimball or Inmon?

The debate goes on as to whether Kimball's or Inmon's philosophy on data warehouses is best. But here is a fresh perspective, by Bill Inmon himself.

A TALE OF TWO ARCHITECTURES

Tuesday, November 30, 2010

LPAD and RPAD for Teradata

Teradata does not come with an LPAD() or RPAD() function. If you can create a UDF for them, that would be best, but if you can't, here are some workarounds.

RPAD a string with spaces

-- replace 10 in char(10) below with desired column length
-- single ticks put around result to show spaces were actually added to right

select
  'abc' as "SampleString",
  '''' || cast(SampleString as char(10)) || '''' "rpad_spaces";

RPAD a string with any character

-- make string of characters (in this example '0000') to match desired length
-- leading space in column is preserved
select 
  ' 1' AS "SampleString",
  SampleString || Substring('0000' From 1 For Chars(SampleString)) AS "rpad_zero";

-- same as above, but using ANSI SUBSTR
select 
  ' 1' AS "SampleString",
  SampleString || SUBSTR('0000',Chars(SampleString)) AS "rpad_zero";

LPAD a string with any character

-- replace 0 with desired character
-- make string of characters (in this example '0000') to match desired length
-- leading space in column is preserved
select 
  ' 1' AS "SampleString",
  Substring('0000' From 1 For Chars(SampleString)) || SampleString "lpad_zero";

-- same as above, but using ANSI SUBSTR
SELECT 
  '19' as "SampleText",
  SUBSTR('00000'||SampleText,CHARACTERS(SampleText)+1) "LPAD(SampleText,5,'0')";

-- Showing how trailing spaces from a CHAR column can be dropped 
select
  '0000123  ' as "SampleText",
  CAST(CAST(SampleText AS integer FORMAT '-9(10)') AS CHAR(10)) "LPAD(SampleText,10,'0')";

Monday, November 29, 2010

STAR schema and transactions

I recently replied to a question in ITtoolbox's Data Warehouse forum regarding star schema data marts.
The original post had several questions that others answered, but one question had not been addressed, so that was the basis for my reply.

"Is it sound industry practice to use a star schema to hold individual records rather than totals?"

In order to create a fact table that aggregates transaction data, there must be a dimension table that contains the transaction detail. The granularity of that dimension table needs to be at the level needed to perform the drill down or slice and dice necessary. For some industries, such as cell phone carriers, the granularity might need to be down to the millisecond level. For others, daily granularity might be OK.
These transactional dimension tables also need to be updated, so you must determine what type of slowly changing dimension (SCD) table you need. (The "slowly" part of the name was a bad choice of words when transaction changes can occur by the second.) There are three types: SCD1, SCD2, and SCD3. Most common is SCD2, next is SCD1, and rarely used is SCD3.
A time dimension table is critical for any data warehouse / data mart. The granularity of the time dimension needs to be at a level at least as fine as the dimensional transaction table is at.
For reporting purposes, I usually suggest using a fact table or view to be the source for the reporting engine rather than have the reporting engine do all the aggregating. This way, the algorithms needed to do the aggregations will be consistent. If these are left to the reporting engine(s) and true ad-hoc queries, the results might be different and there will be lots of complaints and troubleshooting to determine why the results are not consistent.
Whether a table or view is used depends on how frequently the report needs to updated and the length of time needed to return results.

Wednesday, November 17, 2010

How to work with bad date formats in Teradata

A question was posted on ITtoolbox's Teradata forum concerning data loading from a file that contained character strings for dates that used single digit day and month formats like 4/1/2010. Teradata does not support dates in this format: M/Y/YYYY; it only supports two-digit month and day formats like MM/DD/YYYY. Therefore my reply was a sample SQL statement that would work for the date conversion, but the actual load from the data file would need to do this in the "wrapper' code.

The SQL in Teradata format is:

SELECT
    '4/1/2010' AS "BAD_DATE",
    CASE
        WHEN INDEX(BAD_DATE,'/') = 2
          THEN '0' || SUBSTRING(BAD_DATE FROM 1)
          ELSE BAD_DATE
    END AS "MONTH_OK",
    CASE
        WHEN INDEX(SUBSTRING(MONTH_OK FROM 4),'/') = 2
          THEN SUBSTRING(MONTH_OK FROM 1 FOR 3) || '0' || 
            SUBSTRING(MONTH_OK FROM 4)
          ELSE BAD_DATE
    END AS "GOOD_DATE"
-- don't need a FROM clause to do this
;

Friday, October 29, 2010

Database Management Systems Best Uses

As a data architect, I am database agnostic until reaching the physical implementation. During the conceptual and logical modeling phases, I will have gathered metrics that will help predict the scalability needed for the physical implementation.
In the perfect scenario, these metrics will be used to determine the best database management system (DBMS) to use. Many times, however, the target DBMS has already been chosen; in this case, these metrics will provide the basis for scalability and risk assessment. These metrics include:

Number of entities (tables)
Number of attributes (columns)
Volume metrics

Initial

Table size (row count and bytes)
Connection Numbers
- Total
- Average concurrent
- Maximum concurrent
- Application / system users
- Power users

Growth rates for above: 1 year, 3 years

The data architect may or may not be involved in the physical implementation process, depending on the company's organizational hierarchy. Assuming that I will be involved with this, I don't need to be expert in all the available DBMS, but I should have some knowledge and understanding of them based on past experience. But I won't be able to keep up with all the DBMS out there, especially the niche systems. Some open source systems are also available and used. A data architect needs to work closely with the DBMS subject matter experts (SME). The most important SME to me are those who have actual experience using and supporting a DBMS and the type of data store being used on that DBMS.

So we all might gain understanding of what is available and the best DBMS uses, what is your experience? Consider the following points and share what DBMS was used and how it met, or failed to meet, expectations.

Purpose of database (EDW, DW, Data mart, ODS, reporting, OLAP, OLTP, etc.):
Database structure (Normalized, Dimensional, Snow Flake, etc.):
Database size in TB:
Number of tables in database:
Number of columns in database:
Total number of users:
Total number of power users:
Maximum number of connections:
Average number of connections:
DBMS or RDBMS used:
Expectations met or not for:

Database growth in TB?
Query performance?
Performance tuning?
Connectivity?
Database maintenance (DBA)?
Other comments:

My experiences:

Purpose of database: Enterprise Data Warehouse (EDW), Single source of truth; was source for other data marts, reporting data stores, OLAP.
Database structure: Normalized, 3NF
Database size in TB: Greater than 100 TB
Number of tables in database: More than 1000 tables; more than 3000 views.
Number of columns in database: More than 50,000
Total number of users: More than 1000
Total number of power users: More than 100
Maximum number of connections: More than 500
Average number of connections: More than 100
DBMS or RDBMS used: Teradata
Expectations met or not for:

Database growth in TB? Yes. Additional disk space seemed fairly easy to add. Adding more servers (AMPS) increased disk space, provided more CPU processing, and more user connectivity.
Query performance? Very good.
Performance tuning? Performance dependent on Primary Indexes (distribution), partitioning, secondary indexes, SQL tuning, statistics collected.
Connectivity? Batch programs, user direct, and user web-based.
Database maintenance (DBA)? Faster to add or drop columns to/from tables by creating new table and copying data from old table to new table than using ALTER TABLE commands. Obviously, not all steps to do this are itemized here.
Other comments: By far, the best RDBMS I have experienced for VLDB. Most everything was done in massively parallel processing. Data marts, reporting data stores, etc. were derived from physical database via views to create logical data stores. Response times were excellent; in those few cases where multi-table joins could not be improved, a physical table was created and loaded via batch processing.

Tuesday, September 14, 2010

Should a data architect be more adept on the logical or the physical model? What are important traits of a data architect?

I recently replied to a LinkedIn group discussion asking the question in this post's title.

For the complete discussion, use the following link.
http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&discussionID=2551716&gid=108904&trk=EML_anet_qa_ttle-0Nt79xs2RVr6JBpnsJt7dBpSBA

My Answer:

To use a house analogy, would you want to hire an architect to design your house who had no idea what was needed to actually build it at the "physical" level or what the zoning and building requirements were for the location you chose? You wouldn't? Then why would you expect a data architect NOT to understand what is needed at the physical level? That is not saying that a data architect has to have experience as a DBA or database engineer, but, since the physical data model and implementation is the ultimate goal, a data architect should understand what is needed for the physical design.

Depending on the organization a data architect is working at, the actual tasks given to the data architect may vary. My list below shows what I think are the duties of a data architect in their order of importance. I will not go into in-depth explanations for each step. I know of organizations that limit a data architect's role to items 1 and 2, while others include 1 through 4, and some include all of them.

Understand the data requirements. Get agreement from client that data requirements are complete and understood.
Create Conceptual Data Model that documents the data entities needed and how they relate to each other. Confirm with the client that the entities are complete and fulfill the high level data requirements
Create Logical Data Model that documents the attributes assigned to fully describe each entity. Confirm with the client that sufficient detail is captured for each entity. Also, plan for future growth and needs by showing how new data types can be added without requiring a new design (using super-type, sub-type, lookup tables, etc).
Provide traceability matrix to show how each data requirement has been met in conceptual and logical data models.
Determine and document volume and user constraints. Determine initial load volumes and future growth rates. Determine initial user load: total users, maximum (peak) users, how connected, administrative users, power users, limited access users, .... Determine future user growth rates. These are important for physical database selection.
Determine suitable database engine needed. This task should not be done until steps 1 through 5 have been done, but quite often, the client has already decided what the database engine will be. Results from Step 5 will then be useful to forecast the database engine's future scalability. Clients should at least know what the risks are based on their choice of database engine.
Create Physical Data Model. Working with the DBA and/or other experts for the chosen database engine, transform the Logical Data Model into a physical data model. It is unusual that a Logical Data Model can be converted directly into a "best fit" physical data model.

Other duties of a data architect may include:

Enterprise standards: naming, modeling, architecture, data warehouse, more.
Data profiling.
Gap analysis.
Data dictionary.
Data governance.
ETL design
Source-to-target data mapping.

If you look at recent postings for Data Architect positions, you will see many of them requiring expertise with the already chosen target database engine, reporting / BI tools, CASE modeling tools, etc. But I do not agree that a data architect's duties and requirements include these tools, but rather knowing what needs to be done using these tools. Tools can be learned and are an aid to doing what needs to be done.

I have elaborated on the duties of a data architect on my blog site.

Data Architect, Duties of a
http://itpro420.blogspot.com/2010/09/data-architect-duties-of.html
Data Architect Job Description
http://itpro420.blogspot.com/2010/02/data-architect-job-description.html
Data Architect versus Database Architect
http://itpro420.blogspot.com/2010/01/data-architect-versus-database.html

Thursday, September 2, 2010

Teradata 13 Express Install

For those of you wanting to install Teradata on your own PC, Teradata supplies a free "Express" version. Go to this site to request to download it: http://www.teradata.com/downloadcenter/

You will have to register with Teradata and then request the Teradata Express by clicking the "Order Now" link in the "Teradata Express — Teradata Software for Evaluation and Development" section. You'll have to look carefully for it.

I installed version 13 on my home PC which has Windows Vista Home Premium for the OS. I had several problems which I'll detail here in case you run into these, too.

I used the "silent" install method at first. Be sure to read ALL the text with each installation step; don't assume the "usual" steps of just accepting defaults and going on. The silent install method is the default and it will merrily try to install everything, but won't give you any error messages if anything fails. So at the end, you will be told that the installation has completed, but it may not have been successful.

Thinking I had a successful installation, I followed the instructions to verify that from the user guide. From the "Teradata Service Control" I clicked the "Start Teradata" menu item. When it completed, I was told "Teradata is Running" in the lower left corner. When I clicked the "File" menu to execute "Session info", it was grayed out.

HOSTS file edited
Reading the installation documentation, I found that the HOSTS file is modified by Teradata. I looked at my HOSTS file (which is a custom one that blocks Internet advertisements and other unwanted sites) and found that the Teradata install program had botched the addition of the following line:
127.0.0.1 localtdcop1

Make sure that line is on a line all by itself. Also make sure that "localtdcop1" has not been added to the end of every line in the HOSTS file that begins with 127.0.0.1!

ODBC Driver
I also discovered that the Teradata ODBC driver had not been installed. I tried to reinstall it, this time not using the "silent" install method. This time I got an error message: "Error 2738.Could not access VBScript runtime for custom action."

The resolution for this error is to make sure VBScript is registered. Do this by following these steps:

Run Command Prompt as administrator:

Start Menu -> All Programs -> Accessories
Right click on Command Prompt and select Run as administrator

Run the following command:

If you are using 32-bit version of Windows:

Type cd %windir%\system32 into the Command Prompt and hit Enter

If you are using 64-bit version of Windows:

Type cd %windir%\syswow64 into the Command Prompt and hit Enter

Type regsvr32 vbscript.dll into the Command Prompt and hit Enter.

Now you will be able to install the Teradata ODBC driver.

Firewall
But I'm still having problems. Even though I can now start Teradata and the "Session Info" menu option is available, clicking it starts a dialog box that tells me to wait while sessions are retrieved. After 12 minutes, I cancelled it.

I opened "Teradata Administrator" and tried to connect as user tdadmin (created by default). I got the error message "ODBC HY000: WSA E HostUnreach: The Teradata server can't currently be reached over this network."

The fix for this was to turn off all firewalls. I had both the Windows Firewall and Avast's firewall running. Since my PC is behind a cable router that has its own firewall, turning both of these off poses no threat.

SQL Assistant
I found that the Teradata SQL Assistant did not get installed. Using the "not silent" install mode, I discovered that this requires the .NET Framework version 2, SP 2. I tried to install this again from the "Other Software" install option, choosing the "not silent" install mode, and discovered that this would not install on Vista. I was unable to find a version of .NET Framework version 2 that would install on Vista.

There is a "work around" to this. Teradata has a Java version of SQL Assistant that will work on Vista. You need to go to the Teradata download site to get it. In my case, for Vista on 32-bit platform, I went to http://downloads.teradata.com/download/tools/sql-assistant-java-edition-13-01-for-windows-32-bit

Find all the versions of the "SQL Assistant Java Edition 13.01" at http://downloads.teradata.com/download/tools

Data Architect, Duties of a

I have read many job postings seeking a Data Architect. Most had a few data architect duties included, but specified some duties of many data-related positions not normally considered as part of the data architect role; seems like they were looking for one person to do the work of multiple specialists. It seems to me that will get a company a "Jack/Jill of all trades, master of none."

While most Data Architects will have had experience in multiple data-centric roles doing multiple data-related tasks, few (none?) of us will have kept up-to-date skills in many of those skills. While understanding the roles of those other data specialists we will be working with is crucial, we have moved our focus to a much higher elevation. We should be able to mentor and assist other data specialists in getting their assigned duties accomplished, but we do not do the actual work involved (except in a critical, unplanned, resource shortage).

I have gleaned the following list of duties from dozens of different position position postings for Data Architects from the Internet. I have included only those I feel are germane to the postion.

Data Architect to assist with data mappings, data modeling, data profiling, query design, data flow design, etc between multiple databases across multiples platforms. Data Architect will be support and contribute to DI/ETL architecture and act as data SME (subject matter expert) with respect to data profiling assessments and participate in JAD session/interviews.
Perform data profiling and analysis to assess data quality/patterns, recommend data cleansing rules, conforming data standardardzing rules, and matching algorithm changes to [business] and data bus architecture (dimensions/facts)
Database query Performance and Tuning expertise (P&T) on SQL statements.
ETL Workflow and data mapping analysis, review, and tuning.
Strong design artifact experience in documentation of data flow and data mapping and data exchange specifications.
ETL development experience and technical adeptness at reviewing and initial prototyping tasks.
Has solid project lifecycle, database design experience, and complimentary Business Analyst skills.
(Although, I limit the BA skills to those needed to define business data items needed for database structure. I work with the BA to get these.)
Responsible for the creation of the following work deliverables:

Estimate for resources needed, effort needed, and target milestone dates.
Data requirements documents, including traceability.
Weekly status reports from my team
Compiled weekly status report to project manager for team's work.
Conceptual data model, including metadata at the entity level
Logical data model, including metadata at the entity and attribute level
Data Profiling/Gap Analysis
Data Dictionary
Data Flow Diagram
Physical data model, including metadata at the table, column, and view level
Volume metrics (expected initial load and 3-year and 5-year forecasts)
DDL Scripts (generated from modeling tool such as PowerDesigner or Erwin)
ETL Data Mapping (ultimate source to ultimate target, not every hop along the way)

Primary Responsibilities:

Building conceptual, logical and physical database designs
Participating in data development and analysis, and data dictionary management
Ensuring data consistency and accuracy across all systems
Collaborating with business analysts, ETL developers and other key internal personnel on a wide scope of data architecture functions

Monday, July 12, 2010

Calculate beginning and ending day of month in Teradata

Teradata has many date and time functions, but sometimes they are not too intuitive. If you need to calculate the first and last date for the current month as part of a batch program, try this SQL:

SELECT 
  MIN(calendar_date) BOM,
  MAX(calendar_date) EOM 
FROM
  sys_calendar.calendar
WHERE (month_of_year , year_of_calendar) IN 
  (SELECT month_of_year, year_of_calendar
   FROM sys_calendar.calendar
   WHERE calendar_date = DATE)

Friday, July 9, 2010

Teradata optimizer anomaly to beware of

I’ve been working with Teradata for more than four years and I just “discovered” this anomaly which was a problem in my SQL scripts that Teradata told me would return 2 quadrillion rows and take 6500 hours to complete.

If you include a column in the SELECT statement that does not come from a database.table/view in the FROM statement, Teradata merrily does a Cartesian join on the database.table/view.

Try this:

SELECT COUNT( <database>.<table or view>.<column>)

Teradata doesn’t seem to care that there is a missing FROM clause and reads every row from <database>.<table or view>.

If the <database>.<table or view>.<column> is part of a larger SQL that has joins to one or more tables and this particular table is not in the FROM clause, Teradata makes a Cartesian join to every other table in the SQL for every row in the "missing" <database>.<table or view>.

My problem was the result of a copy and paste operation from a query that worked, but I didn't catch the change needed for the new query.

Sunday, March 28, 2010

Data Modeling

Data Modeling is a way of describing the tables (entities) and relationships in a database schemata (see other blogs with "schema" tags). (By the way, both "modeling" and "modelling" are correct spellings. The first is more American English and the second is British English; the British claim they are right.) Because a "picture speaks a thousand words", data models help database users to more easily understand the schemata they are interested in.
There are many kinds of models, but the three most common are: Conceptual, Logical, and Physical.

Data Model Hierarchy

CONCEPTUAL

A conceptual data model is the highest level data model. It is usually created early in a database project as, or shortly after, data requirements have been gathered in what is known as the Plan and Analyze phase of the SDLC. It shows the major data entities the project needs and how they relate to each other. Entities include objects like Customers, Employees, Products, Sales, Orders, etc. Sometimes, some entities are made to be a child entity of a broader entity like Customers could be a child entity of People. The names used to describe an entity are the ones a business would use, such as Customer, not the names the table in the physical data schemata would be, such as TB_CUST.

A good definition of Conceptual Data Modeling can be found in "Conceptual Modeling - ER 2002" by Stefano Spaccapietra and Salvatore March (Springer)

"The task of conceptual modelling is to build a high-quality representation of selected phenomena in some domain. The conceptual models that result facilitate the implementation, operation, and maintenance of information systems."

LOGICAL

The next level in data modeling is the Logical data model. It is created in the Design phase of the SDLC.

At this level, data items that define what makes up an entity are added. For example, a Customer entity will have data items such as Name, Address, City, State, ZIP, etc. These data items are called "attributes" of the entity.

The logical model will show how the entities relate to each other through common attributes. These relationships will be in business terms such as "An order has a customer" and "A customer may have one or more orders."

The relationships will be connected by lines that depict what is called the "cardinality" of the relationship, or how many elements there can be in the set. For example, every Order must have one and only one customer; however, a customer can have zero or many current orders. The Customer-to-Order relationship is an optional one-to-many cardinality, while the Order-to-Customer relationship is a mandatory one-to-one cardinality.

PHYSICAL

The physical data model is the graphic depiction of the actual database schemata that contains the entities (tables) and attributes (columns) specified in the Logical data model. Sadly, many database schemata are built directly without the planning and design models being created. The physical data model should be the result of collaboration between the Data Architect (or Modeler) and the Database Administrator (DBA). The Data Architect knows what data entities and attributes are needed by the application and how they relate to each other. The DBA knows the database management system (DBMS or RDBMS) in which the data will be stored. Together, these two can design and build the physical database schemata that will provide the fastest I/O performance, be maintainable by the DBA, and will be scalable for future growth. (Scalability is a term used to indicate how well the system will be able to handle increased data volumes and more users of the database.)

Data Modeling Tools

In the beginning, these models were created by hand drawings using stencils for the limited number of shapes and lines used at the time. As models became more complicated, software to create data models was developed. At first, these software applications were no more than diagramming tools.

These simple diagramming tools have become sophisticated data modeling applications that included data dictionaries, metadata, scripting languages, requirements tracking, database definition language (DDL) generators and much more. These mega-modeling tools can be very expensive. The two most common ones that I know of and have used are Sybase's PowerDesigner and CA's Erwin.

These are great tools for large companies, but are there data modeling software tools available cheaply or freely for those who want to learn modeling skills or do data modeling for small to medium sized businesses?
Some database management systems (DBMS) come with there own data modeling tools. A couple DBMS can be downloaded freely for personal education and training. Both Oracle and MS SQL Server allow this and I believe they both have built-in data modeling tools, but it has been a long time since I used them myself so this may not be true any more.

Some people use Visio for data modeling, but others claim it is not a "real" data modeling tool. It certainly has limitations, but you can draw database schemata diagrams with it. I have found it to be useful for simple data models, but it has none of the extended features mentioned above.

Below are links to various web sites that list free and/or inexpensive data modeling tools:

Tuesday, March 16, 2010

Company Acquisition and Data Integration / Migration for EDW

Scenario

Company A purchases Company B. Company A needs to integrate Company B's operational systems into its Enterprise Data Warehouse and possibly many of A's data marts Company B's EDW will be retired. Company A and B have different application systems that do the same jobs: General Ledger, Order Management, Human Resources, etc.

Enterprise Data Warehouse

This blog will focus on the Enterprise Data Warehouse (EDW). There will be many more data stores that will need to be considered, but we'll start with this.

Planning

What needs to be planned for? What order do tasks need to be done in? What risk factors need to be considered?

Understand Company A's source data architecture and metadata.
Understand Company B's source data architecture and metadata.

Company B may use a different database system (application software and database management system) than Company B does for the same business function. All businesses need to track their financial picture. Most will have some General Ledger (GL) application to do that. If A and B do not use the same GL application, we need to understand how B's system works and stores data and how that relates to A's system. We need to understand the metadata.

To do this, we will need to work with Company B's subject matter experts (SME). We will need B's source system's entity relationship diagram(s) (ERD) and data dictionaries. We also need to know what data items are brought into the acquisition's EDW (assuming they have one). Just because they bring something into their EDW doesn't tell us why they bring it in. So we will need to understand their business requirements for bringing data into their EDW.

So, the list of documents / information we have to get for the source system is:

Database ERD (Physical Data Model (PDM))
Data dictionary
Source-to-target mappings from source database to their EDW
Business rules for data brought to EDW. Why does the business need it?

What data from A's source goes into A's EDW

Why?
How used?
Transformations?

What data from B's source goes into B's EDW

Why?
How used?
Transformations?

Gap analysis

What data from A's source to EDW is missing from B's source to EDW?
What data from B's source to EDW is missing from A's source to EDW
Why does B need "extra" data?
Why doesn't B use A's "extra" data?
Why doesn't A use B's "extra" data?

Data profiling

Which tables and columns in B's source will be coming to A's EDW?
Which tables and columns in A's source hold the same data as in B's source?
Which column's in B's source have different data types, lengths, values, etc. than found in A's source.

Data mart analysis (assume data marts are sourced from EDW)

What data marts does B have that A doesn't have?
What data marts does A have that B doesn't have?
Will existing data marts in A support B's needs?
What new data marts will A need to support B?

Report analysis

Expand on these or add more.

Wednesday, February 24, 2010

Data Integration

“Data integration involves combining data residing in different sources and providing users with a unified view of these data.” [Maurizio Lenzerini (2002). "Data Integration: A Theoretical Perspective". PODS 2002. pp. 233-246.] {emphasis added}

Data integration can be accomplished at an enterprise level, departmental level, or lower level.
The most common method of “data integration” is the Data Warehouse (DW). A DW can exist at an Enterprise level (EDW) or at lower hierarchical levels. At these lower levels, DW is really a misnomer as these data stores provide data for specific purposes whereas a DW, especially at the enterprise level, should provide a non-application-specific “single source of truth.”

Data stores at less than enterprise level are more appropriately called Data Marts (DM). DM are designed to answer specific business questions or serve a specific departmental need. These DM may be populated from the operational databases directly or, more common today, from the EDW, enforcing the “single source of truth” concept.

I have done “data integration” my entire data career but it has only recently become a named activity with the book mentioned above being the oldest publication I have found using the term (2002)

The process of getting data items from the source systems into the DW is complicated and requires significant research, analysis, and understanding of the source system's data stores. This has recently become known as “Data Profiling” but has been done “as a matter of fact” since data stores were first created. The actual movement of data is done by ETL procedures.

ETL means “Extract, Transform, and Load” although the Transformation step can be done as part of Extract step, part of the Load step, or separately from both.

The term “data integration” has recently been expanded to include sharing data directly from several operational source systems into one single presentation layer to provide real-time information. Customer information such as addresses and phone numbers may be kept in one data store, while customer sales may be kept in a different data store. Customer credit card information may be in a third data store. Gathering pieces of data from these three data stores into a single presentation layer such as a real-time Web-based GUI may provide customer service personnel the ability to provide accurate and timely help in resolving disputes while enabling instant updates and corrections to data in all three data stores.

Getting real-time data from operational data stores may be detrimental to the source system's performance. Therefore, “near real time” has become the norm. Data can be exported from the operational data stores into an Operational Data Store (ODS) at very short intervals. However, coordinating the usability of the data needs to be planned so that data from one source does not become out-of-sync (not synchronized) with data from another source. If three data stores supply data to the ODS, will all three need to be updated before the updated data is usable? Many synchronization problems must be addressed so that data from the source systems are usable and updates in the ODS are sent back to the proper source system in a timely manner. However, most ODS systems do not allow direct updates to its data, requiring changes to be made directly to the source system and then synchronized at the ODS the next time it is updated.

Friday, February 12, 2010

The Need For Data Profiling

A new discussion was started on LinkedIn, White Paper: The Data-Driven Enterprise. Just in case you can't open the LinkedIn discussion itself, the discussion references another URL for the white paper's source.

The referenced article begins with

Data is one of your enterprise’s most valuable assets. So why are so many organisations still accepting business decisions based on inconsistent, unreliable data as being “good enough”?

The data being discussed is in relation to data warehousing the operational system's data.

I have found in my career that most IT managers do not understand the premise of "Garbage In, Garbage Out" well enough to want to fix the problem at the source. Most operational database systems have evolved over the years into, quite honestly, coding nightmares. As problems were discovered with data and new data needed to be added, the quickest way to deal with these was to change the "code". The code could be in the form of Extract / Transform/ Load (ETL) batch processing scripts, stored procedures or functions, or application program interfaces (API).

APIs control the way database applications and users interface with the database. When problems arise with the data, the API can be modified to work around the bad data by "fixing" it on the fly. When new data needs to be added, a column in an existing table that is not being used can be "re-purposed" by putting the new data in that location and have the API control access to that column so no one really needs to know what table and column the data is really stored in.

I have seen databases that contained tables with columns such as "A", "B", etc. These columns were put in the table because it seemed likely that new data items would be needed in the future and this would accommodate them without having to add new columns. Some databases have also been created with extra tables to accommodate new entities such as value lists without having to create a new table. This practice is especially practiced in proprietary databases that are sold to customers but must be modified to meet the customer's needs. By using this generic database schema, the vendor can update the database and associated APIs without having to worry about each of its customers' modifications.

When this operational data needs to be stored in a data warehouse, business users, managers, and C-level executives don't understand the importance of data profiling and data quality analysis. They feel that the data, and the information that data provides, in the source operational system just needs to be moved to the data warehouse; a very simple operation. What they don't understand is the years of data manipulation that has been done within the database applications to provide that "data quality."

Also, the original documentation, if any, has not been kept updated with all the programming and/or database changes made. Original requirements documentation may be missing or out-of-date; new requirements may be in disconnected pieces of documentation that can only be found by someone who worked on those specific projects years ago.

Many times, the only data models of these legacy source systems are the ones that will be reverse-engineered from the existing database; metadata will be missing. If the original requirements and design specifications can be found, metadata may be reverse-engineered from them, but, in most cases, inferences and "best guesses" will have to be made.
Significant data profiling will be required to ensure that only good data is brought into the data warehouse. Just because a table and column is populated in the operational source that does not mean it is even used in the current version; they may be just "left overs" that can't be removed for legacy reasons and to minimize upgrade time.

Codes and values may not have lookup tables in the operational system; they may all be hard coded in the API. Foreign keys (FK) may not be enforced (sometimes called "logical" FK); I have even seen so-called child tables in operational database systems that get populated before the parent table is populated! These orphaned child records are not really orphans, they are placeholders for transactional history until the parent transaction (order, for example) has met some requirement such as at least one order item record has been created or the order has been tendered.

Thus the Extract and Transform steps of ETL become much more complicated than just moving the data. Because data profiling and data quality analysis can be so time-consuming and resource intensive, and business managers and C-level executives don't understand the need for it, project budgets don't provide time, money, or resources for these tasks. These tasks then become bundled within other tasks that then go way over budget in all aspects: time, money, and resources. The push then becomes to "do the best with what you've got and make the best educated guesses (I prefer SWAG - Sophisticated Wild Ass Guess) possible; we'll fix it later."

As Data Architects and Data Managers, assuming enterprise standards have been created, we need to start enforcing data standards and not allow them to be ignored. This requires lots of teaching and convincing of upper management people and perseverance by Data Architects and Data Managers. Unfortunately, careers are sometimes killed by doing this, but then that was probably inevitable anyway. If not done right from the beginning, the unplanned fixes will become a nightmare in the future when even more heads will roll.

If you don't have time (money, resources) to do it right, when will you have time (money, resources) to do it over?

Wednesday, February 3, 2010

What should you ask a potential client about?

I was recently asked to provide some "initial scoping" questions to be asked when meeting with potential clients. Perhaps the word clients should be expanded to include both external customers for consultants and internal business units for full-time employees of a company. Finding out what the client wants requires a consultative approach. You must bring your expertise to the table and be willing to share and advise. The client needs to know if their expectations are realistic or not; as do you, of course.

First of all, I want to know if a way of doing what they are asking to be done exists right now; this can be manually or some other computer application. If the client wants to automate (computerize) a task that they do not do manually now, my advice would be to do it manually first so you have something to check the automated results against. If it cannot be done manually, it can't be automated / computerized. If it is not being done in some way now, how will they know if the new system is doing it right?

Sometimes clients say that they have another application that needs to be upgraded / revamped / replaced. The following questions apply to both existing manual and computerized processes:

Why do you need to upgrade / revamp / replace the existing method?

Too slow
Incorrect results
Missing needs
Cannot grow
Need to provide results to larger audience

What have you already tried to improve the existing method?

What has worked?
What has not worked?

Are the business rules documented for the current method?

If not, reverse engineering of the current method can be done to determine what the current algorithms are, but this will not prove that the original business rules are being followed. I have bult systems that proved the existing system was doing it wrong, but it takes much time, effort, and documentation to prove that to the business.
If yes, will the same rules apply to the new system?

What do you like about the existing method?

Interface
Reports
Processing

What do you dislike about the existing method?

How can these be improved?

What is missing from the existing method that you want to add?

Reports
Dashboards
Functionality

What is included with the existing method (software) that you do not use that can be eliminated from the new solution?

Get rid of "bells and whistles."

What is the timeframe?

Can a phased approach be used?

What is the budget?
What are the constraints?

Must use existing infrastructure
No internal resources available
Must use existing in-house developers (and languages), databases, operating systems, etc.

What phase of this project are you in right now?

If not in the intitial Plan and Analyze phase, it may be too late to recommend any changes.

Do you have dedicated resources who will be working with me?
How will results be verified / validated / proven?
Who will take ownership of the completed system?
Who will support the completed system?

Remember, these are initial questions to help get an understanding of and manage the client's expectations. There will be many more questions to ask as the cycle progresses.

Tuesday, February 2, 2010

Data Architect Job Description

I found this definition of a Data Architect on CareerBuilder.

Defines, designs, and builds relational and/or multi-dimensional databases for warehousing of data. Confers with client, technical staff, and team members to plan, design, develop, implement, and enhance applications, scripts, procedures, and metadata for relational databases. Reviews current data structures and recommends optimizations and reconfigurations as warranted. Generates data acquisition and archive recovery strategies. Works with Database Administrators to migrate complex large scale applications into test and production environments. Integrates new data with existing warehouse structure and continuously improves system functionality. Evaluates and recommends new tools and methodologies. Consults with vendors to integrate selected packages. Deletes old data. Develops and maintains all standards in accordance with data structures and models. May work with organization departments to design and develop data marts. (Source: eDOT Job Description)

Do you agree with this definition? Does it include too much or not enough? I agree with most of it, but I question or disagree with the following points:

archive recovery strategies
This is more of a DBA or Disaster Recovery role
continuously improves system functionality
Not quite sure what is meant by this; seems to be DBA or development team role.
Evaluates and recommends new tools and methodologies
What new tools and methodologies? I might do this for data modeling tools like Erwin or PowerDesigner if they are looking for this and can do this for data architecture-related practices such as data governance, data quality, naming conventions, modeling standards, and so forth.
Consults with vendors to integrate selected packages.
This is Solution, System, or Application Architect role.
Deletes old data.
This is done by the user, application, or DBA.

Thursday, January 28, 2010

Data Warehousing: Remembering the Basics

A discussion was started in one of the LinkedIn groups I belong to about "Data Warehousing: Remembering the Basics".

There is a link to an article that was to be used as the catalyst for the discsussion, but it is broken. Therefore, I'm not sure what the focus of the article would have been. But the reference to Inmon's classic approach in the discussion originator's comments leads me to believe the article would favor the relational schema.

(Link to the article being referenced is now at http://tdwi.org/articles/2010/01/28/remembering-the-basics.aspx.)

Many original data warehouses followed Kimball's philosophy and were designed in the Star (Dimensional) schema, each satisfying a specific business need. After multiple departments within a company each had their own data warehouse, it became evident that there was no single source of truth. Different versions of truth evolved depending on what business unit created the rules for the data warehouse and its ETL.

Following Kimball's philosophy, when an effort was made to merge these diverse data warehouses into a single enterprise data warehouse using the Star schema, the complexity of this became readily apparent. Many relatively small dimensional schemas became quite difficult, if not impossible, to merge into a single source of truth, Forgetting the political correctness needed (whose naming convention and data type was "right", not to mention the "right" definitions), deciding on what should be in this single source of truth became fraught with power struggles. And the granularity derived from these many Dimensional data warehouses was often not enough to be able to "drill down" to the detail behind the "facts". Quite often, the granularity of the fact tables was not at the atomic level and, therefore, drill down to the lowest level of detail was not possible.

The Star (Dimensional) schema requires much denormalization. The term denormalization implies there was a normalized database to start with. Since dimensional modeled data warehouses are, by definition, not normalized, building an Enterprise Data Warehouse from them does not allow denormalization. The Star schem is not, in my and many other's opinion, workable for an enterprise data warehouse (EDW).

In most cases, I feel that Inmon's philosophy that a data warehouse should use a relational data model in at least 3NF is most appropriate for an EDW. Following Inmon's philosophy, the EDW is not business nor application focused. The EDW is sourced from the original operational database systems; the data marts are sourced from the EDW - the single source of truth.

In my recent experiences, the original "data warehouses" that were application / business specific became "data marts".

This is a good topic to discuss. Visit the LinkedIn discussion and post your comments there or, if you aren't a member of LinkedIn, post your comments here.

Wednesday, January 6, 2010

Data Architect versus Database Architect

Introduction

It seems that many people don't understand the difference between a Data Architect and a Database Architect and the two titles are used interchangeably. I, however, view them as separate skill sets and positions.

http://www.databaseguides.com/database-architecture: This definition of database architecture seems to start out by describing the physical infrastructure of a database system, but then switches to the architecture of the physical data model or data schema (the structure of how the data will be stored).
http://www.wisegeek.com/what-is-a-data-architect.htm: This person's definition of a Data Architect is consistent with mine.

Confusion

Perhaps the confusion begins with the term database. Is a database the structure (schema) of how the data will be stored (Hierarchical, Object, Relational, Dimensional, CODASYL, etc.) or what I call the "engine"?

The database engine is the software application, or the Database Management System (DBMS), that allows us to store the data. The most common database engines today are used for Relational Database Management Systems (RDBMS). These include Teradata, Oracle, and MS SQL Server. Within each DBMS, there may be multiple schemata, or collections of data. (Each individual collection of data within a DBMS is called a schemata; the plural is schema, but the plural form is often used in the singular.)

The term database seems to be freely used for both the schema and the DBMS (engine) and the distinction only becomes apparent through the context the word is used in. Therefore, I feel there should be a clear delineation between what a Data Architect does and what a Database Architect does.

Data Architect

A Data Architect (DA) is someone who focuses first on the data itself and how it will be used and maybe later on how it will be stored (schema) on the chosen DBMS. It is the DA who strives to ensure that the "Garbage" in "Garbage In, Garbage Out" (GIGO) won't exist. The DA should start working at the very beginning of a project in the Plan and Analyze phases.

At this point, the DA will be working with the Business Analyst (BA) and the client stakeholders to get the meta data for the data items (entities and attributes) that will be needed for the solution. It is very important to understand early in the project what the business means by Customer, Order Date, Sale Date, etc. The DA should be considering the big picture, i.e. the enterprise (global) use of the data, so the same data item won't have different definitions (meta data) and data types used in other data stores.

Typical deliverables in the plan and analyze phases are the Conceptual and Logical Data Models. These documents are not concerned with what database engine (Teradata, Oracle, MS SQL Server, etc.) the data will be stored on. Once there is consensus on what data items are needed and how they are related, then the DA may pass the Logical Data Model off to a Data Modeler or Data Analyst to create the Physical Data Model or schema; the DA may do this model, too.

This is generally done in the Design phase of the project even though it is really in the DA's Build and Deploy phases. This is because there are others who can't do their work until the data store's structure has been completed and published. At this point, the DA enters a supporting role to those who will be using the data.

Once the Logical data model is completed and agreed to with the business stakeholders, the database engine (probably a RDBMS) should be chosen. Based on data volume metrics for initial and future storage capacity needed, maximum expected number of concurrent users, batch processing needed, and other usage factors, a decision should be made as to which database engine is the best. Quite often, however, this decision has already been made because the company has invested money, time, and staff to one database engine already and does not want to invest more into a second database engine even if the existing database engine will have scalability issues.

Database Architect

When the data-centric project gets to the point of choosing what DBMS will be used to store the data, this is the person who guides the selection of a DBMS and architects the infrastructure for it. Things like how many servers will be needed, how many tiers there will be, how much disk space is needed initially and three years later, how many concurrent users must be supported, and so forth are what the Database Architect is concerned with and will be creating a solution for.

The Database Architect is generally someone who is expert in the chosen DBMS. At first, there may be several of these people, each representing a competing DBMS provider. Vendors usually provide these people to design, estimate, and (if chosen) be the company's DBMS solution provider. Once the DBMS is chosen by the program / project managers and team members and the CIO/CFO approve the selection, then a Database Architect who specializes in that DBMS is needed.

The Database Architect could be the same person who will be the Database Adminstrator (DBA); it would be better, in my opinion, to have a consultant act as the Database Architect because once the database system (DBMS and hardware) is designed and installed, there won't be much need for a Database Architect until the system needs to be upgraded, modified, or expanded. Generally, the Database Architect should be provided by the vendor of the DBMS since they will have the greatest and most current knowledge of their DBMS. A DBA should work along side of the Database Architect so there can be knowledge transfer of the infrastructure that the DBMS will reside on and the DBA will be monitoring.

Database Administrator

I included this third role in this dissertation even though it was not my original intent. It seems that most people understand the operational role of the Database Administrator (DBA), but may not understand the DBA's role in modeling the data store.

The Data Architect (or Modeler or Analyst) will work with the DBA when the Logical Data Model is transformed into the Physical Data Model (schema). It is very rare that a Utopian Logical Data Model is transformed directly to a Physical Data Model. What is great in theory (or from a logical business perspective) is not always best in reality. The DBA should be the one guiding the person doing the Physical Data Model as to what will work best with the chosen DBMS. The DBA needs to understand the relationships between the entities, constraints and data validation needed, and what attributes will need indexing for SQL performance. Once the Data Architect has gone on to other projects, the DBA will generally be the first contact for developers and users who need to understand how to get data out of the schema contained within the DBMS. The DBA may need to consult with the Data Architect, but the users will generally contact the DBA first.

Sometimes, the DBA is also the Physical Data Modeler, but I don't think this is a good approach. The DBA focuses on the performance and ease of maintenance of the DBMS; the Data Architect / Modeler / Analyst focuses on the schema, data quality, data structure, data relationships, and data use (queries). These are definitely interrelated aspects, but it is hard for one person to advocate them equally. What the modeler sees as a best solution may not be what the DBA sees as providing the best performance and least disk space consumption. It is best when two or more people can provide input for both sides of the issue and come up with a solution that is agreeable to both.