What is the difference between Surrogate Key and Primary Key? | LinkedIn
Introduction
There is no difference to the meanings below due to the data model; these are relational database management system (RDBMS) concepts. Whether you are using a normalized model or a dimensional model, the "primary key", "surrogate key", "natural key", and "alternate key" terms all have the same meanings.
A Google search turned up lots of places to read up on keys, including these:
Primary Key
A Primary Key (PK) in a database table uniquely identifies a row in the table. There can be only one PK per table. Depending on the database engine being used, a PK may require the explicit creation of a unique index or the database engine may automatically create such an index for the column or columns identified as the PK column(s).
A primary key can consist of a single column using a "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a "natural" or "meaningful" key).
Surrogate Key, aka Meaningless Key
The term "Surrogate Key" describes a particular type of Primary Key (PK). There really is no such thing as a "Surrogate Key". There are Primary Keys that use surrogate values. A PK that uses a surrogate value always consists of only one column.
The surrogate value is generally a system-generated value that nobody would know what it is to search for from a query. This PK may then be used as a foreign key (FK) in other tables and used for join conditions.
A Primary Key that uses a surrogate value is also known as a "meaningless" key. A table with only a "surrogate key" (primary key using an arbitrary value) must have an alternate key (AK) (see below). Without an AK, it is very likely that there will be "duplicate" rows in the table. An explanation of this follows below.
Natural Key, aka Meaningful Key
The other type of PK is called a "natural key". A natural (or meaningful) primary key is made up of one or more columns in a row that uniquely identify that row.
The column or columns used for a natural key contain values that have meaning to the business.
If the natural key is the PK, then there can be no surrogate key. If the natural key is not the PK, it is an alternate key.
Alternate Key
While there can be only one primary key in a table, there may be more than one "alternate" keys in a table. These alternate keys are controlled by one or more unique indexes. A table with only a "surrogate key" (primary key using an arbitrary number) and no unique index for an alternate key is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
Generally, the columns used for the alternate key(s) are the columns that people will use in their queries to find a unique row, but the joins to other tables will be done using the single PK column comprising the "meaningless key" with the PK column being used as a FK column in those other tables.
The use of surrogate key values prevents the need to duplicate multiple columns in "child" tables just to be able to find the children of the "parent" record in a JOIN (or vice versa).
Impact of Primary Key Design
Let's examine the impact and implications of designing a table's primary key. Table_1, below, is a table that has a PK made up of a surrogate value, hence "Surrogate Key", and no AK.
Table_1: This table has a surrogate primary key (PK_ID) but no alternate key.
T1_PK_ID | T1_COL1 | T1_COL2 | T1_COL3 |
1 |
A |
B |
C |
2 |
A |
B |
C |
3 |
A |
D |
E |
While the three rows above are unique in the PK value (PK_ID column), rows 1 and 2 are really not unique rows, are they.
A table with only a "surrogate key" (PK using an arbitrary value) and no unique index for an alternate key (or the natural key) is useless - you are doomed to have duplicates in rows for every column EXCEPT the surrogate key which does not help you find a unique row at all.
If Table_1 were created with a unique index on the columns that make up an alternate key, it would look like Table_2 below.
Table_2: This table has a surrogate primary key (T2_PK_ID) plus an alternate key (T2_AK1, T2_AK2) enforced by a unique index.
T2_PK_ID |
T2_AK1 |
T2_AK2 |
T2_COL3 |
1 |
A |
B |
C |
2 |
A |
D |
E |
Generally, the alternate key columns are the columns that people will use in their queries to find a row, but the join to another table will be done using the single primary key column comprising the "meaningless key" which is a FK in the table being joined to. I'll show an example of this later.
A primary key can consist of a single "surrogate" value or be made up of one or more columns in the table that will uniquely identify a row (also called a meaningful key). If Table_1 were recreated using the natural key, it would look like Table_3 below.
Table_3: This table has a meaningful primary key (T3_PK1, T3_PK2).
T3_PK1 |
T3_PK2 |
T3_COL3 |
A |
B |
C |
A |
D |
E |
If Table_3 was joined to another table, that table would have to include all the columns making up the primary key in Table_3; these would be the foreign key (FK) columns. The use of surrogate key values as in Table_2 prevents the need to duplicate multiple columns in "child" tables just to be able to find the unique "parent" records in a JOIN, and vice versa.
As an example of what I stated above, let's assume the following table structure.
Table_4: This table has a surrogate primary key (T4_PK), a natural key (T4_AK1, T4_AK2), and foreign key columns (T3_PK1, T3_PK2) needed to join this table to Table_3 ."
T4_PK |
T4_AK1 |
T4_AK2 |
T4_COL3 |
T3_PK1 |
T3_PK2 |
100 |
AA |
BA |
CA |
A |
B |
101 |
AA |
BB |
CB |
A |
D |
A query to find a row in Table_4 and join it to Table_3 to get another column would look like this:
SELECT
T4.T4_PK,
T4.T4_AK1,
T4.T4_AK2,
T4.T4_COL3,
T3.T3_COL3
FROM
Table_4 "T4"
INNER JOIN Table_3 "T3" ON
T4.T3_PK1 = T3.T3_PK1 AND
T4.T3_PK2 = T3.T3_PK2
WHERE
T4.T4_AK1 = 'AA' AND
T4.T4_AK2 = 'BA'
The result from the query above would be:
T4_PK T4_AK1 T4_AK2 T4_COL3 T3_COL3
---------------------- ------ ------ ------- -------
100 AA BA CA C
If this same query were modified to be run against Table_1, assuming T1_COL1 is the same as T3_PK1 and T1_COL2 is the same as T3_PK2, the query would look like this:
SELECT
T4.T4_PK,
T4.T4_AK1,
T4.T4_AK2,
T4.T4_COL3,
T1.T1_COL3
FROM
Table_4 "T4"
INNER JOIN Table_1 "T1" ON
T4.T3_PK1 = T1.T1_COL1 AND
T4.T3_PK2 = T1.T1_COL2
WHERE
T4.T4_AK1 = 'AA' AND
T4.T4_AK2 = 'BA'
The result from the query above would be:
T4_PK T4_AK1 T4_AK2 T4_COL3 T1_COL3
---------------------- ------ ------ ------- -------
100 AA BA CA C
100 AA BA CA C
Instead of a single row being returned, we now get two rows.
Remember we created a modified version of Table_1 giving it an AK as Table_2. If we run the same query above against Table_2, changing the table and column names in the JOIN condition, we see that the AK in Table_2 prevented the "duplicate" row and we get only one row returned.
SELECT
T4.T4_PK,
T4.T4_AK1,
T4.T4_AK2,
T4.T4_COL3,
T2.T2_COL3
FROM
Table_4 "T4"
INNER JOIN Table_2 "T2" ON
T4.T3_PK1 = T2.T2_AK1 AND
T4.T3_PK2 = T2.T2_AK2
WHERE
T4.T4_AK1 = 'AA' AND
T4.T4_AK2 = 'BA'
Results:
T4_PK T4_AK1 T4_AK2 T4_COL3 T2_COL3
---------------------- ------ ------ ------- -------
100 AA BA CA C
The table below, Table_5, has a surrogate primary key (T5_PK), a natural key (T5_AK1, T5_AK2), and foreign key columns needed to join this table to Table_1 and Table_4. Since T1_PK_ID is a surrogate (meaningless) key, it is not expected that one would know what it is. Therefore, searches for a row would be based on other columns thought to uniquely identify a row, in this case T1_COL1, T1_COL2, and T1_COL3.
Table_5:
T5_PK |
T5_AK1 |
T5_AK2 |
T5_COL4 |
T4_PK |
T1_COL1 |
T1_COL2 |
T1_COL3 |
100 |
AA |
BA |
CA |
100 |
A |
B |
C |
101 |
AA |
BB |
CB |
101 |
A |
D |
E |
In the query below, we will try to get primary key value from Table_1; let's say we need to use it in another query. The expectation is a return of a single row, but, because the rows are not truly unique based on the other columns in Table_1, two rows are returned:
SELECT
Table_1.T1_PK_ID
FROM Table_5
INNER JOIN Table_4 ON
Table_5.T4_PK = Table_4.T4_PK
INNER JOIN table_1 ON
Table_5.t1_col1 = table_1.t1_col1
AND Table_5.t1_col2 = Table_1.t1_col2
WHERE
Table_5.t1_col1 = 'A'
AND Table_5.t1_col2 = 'B'
Results:
T1_PK_ID
----------------------
1
2
If we wanted to get the value of Table_1.T1_COL3 for the "unique" values in the supposed "key" values in T1_COL1 and T1_COL2 using the SQL below, we will get an error.
SELECT
T1_COL3
FROM
Table_1
WHERE
t1_pk_id = (select t1_pk_id
from table_1
where
t1_col1 = 'A'
AND t1_col2 = 'B')
Error starting at line 1 in command:
SELECT
T1_COL3
FROM
Table_1
WHERE
t1_pk_id = (select t1_pk_id
from table_1
where
t1_col1 = 'A'
AND t1_col2 = 'B')
Error report:
SQL Error: ORA-01427: single-row subquery returns more than one row
01427. 00000 - "single-row subquery returns more than one row"
*Cause:
*Action:
Conclusion
While this started out with a question asking about the difference between a "surrogate key" and a "primary key", it evolved into a dissertaion on primary keys and how they can be designed. You should have learned:
- There really is no such thing as a "surrogate key; there are primary keys using surrogate values."
- A "primary key" can be made up from
- a single column using a surrogate value
- one or more columns that have unique values and known business meanings
- A table with a PK using a single column with a surrogate value MUST have an alternate key (AK)
- There are impacts to SQL writing if the PK is designed wrong!
I could go on about the impact of primary keys in a data warehouse environment where multiple rows with the same primary key must be maintained for historical purposes (slowly changing dimension tables), but I won't! Aren't you happy?! This disssertation should give you a "basic" understanding of primary keys and I leave it up to you to apply this understanding in your data model designs.