In my blog about relationships, I explained that we often separate information across multiple tables. This is the beauty of a relational database! The added complexity can take a lot to get used to, but having multiple relations (tables) simplifies data! It removes redundant information and has the capacity to store large amounts of complex data. The use of multiple tables suggests that we need to use keys. A key is a column within each database table that keeps an individual row unique. Not only do keys allow each row to be unique, but they allow us to connect tables by using multiple keys. I’ve used this picture before, but it illustrates things perfectly:
Here you can see that the Buyer ID attribute is connected to the User ID attribute. With referential integrity, we can make it so that we can not have a buyer who is not also in the users table! This is known as a foreign key. A foreign key references the primary key of another table! In this case, the primary key is the user ID and the foreign key is the User ID.
Now that we are talking about keys, I can explain this in more detail!
We can see from the picture we have the User ID. In an actual database application this might be titled user_id. Anyways… The User ID keeps every single row unique. Obviously multiple people could have the same name, if we had the situation where more than one Caleb became a user, the user ID is the only thing that would tell them apart.
Now that you have the fundamentals of keys, we will be talking about every kind of key!
Types of Keys
The first type of key we have is called a superkey. A superkey is any group of columns (attributes) that uniquely defines a row. Each row will be an entity, the ID is used to keep that entity unique to prevent duplicate data. This is important because we could have data like this:
Here we can see that we have duplicate data because we failed to use a primary key.
From this you can see that either A) Caleb D Curry is incorrectly in the database twice, or B) There are two people with the name Caleb D Curry. We don’t know!
A superkey can be multiple columns. For example, the superkey can be a combination of a name, phone number, address, and birthday. This would always be unique unless two people with the same name lived at the same place with the same phone number. Although this combination of columns is possible, we will learn later that it is not practical. To explain a lot of these keys, we will use groups of columns to explain concepts.
A candidate key is any superkey that uses the least amount of columns to uniquely distinguish each row.
In a table for users, the combination of first name, last name, social security number (or SSN… a number given to every American), and phone number would give us a super key. this is because we will never have two users with the same name, SSN, and phone number. The actual phone number though is not required for each row to be unique. This is because the combination of first name, last name, and SSN is more than enough to keep each row unique:
Here you can see that the candidate key is a portion of the super key. Keep in mind though that if the super key is already the least amount of columns then both the superkey and the candidate key will be the same thing.
In fact, we could probably use only the last name and SSN.
A primary key is a candidate key that is the most appropriate to use as the “main key.” The primary key is used for reference, such as “user with the ID of 8593.” The words “most appropriate” are very vague and many people argue over what is considered the best primary key, but keep in mind that the best key will depend a lot on the purpose of the database and the function of the application.
Let’s think of an example. Let’s say that we have these candidate keys in a table for users:
- First name + middle initial + phone number
- first name + last name + address
- First name + last name + social security number (if applicable to country of course).
I would say that the third option (First name + last name + social security number) would be the most practical because we will never have two people with the same name and same social security number. This would be our primary key. I chose to use the persons name because your social security number may not be unique to you, even though it should be.
An alternate key is any candidate key that you have not selected as the primary key. The alternate key is a candidate key as well so it meets the requirements of being a primary key but is just not the “best” option. So basically, it works like a primary key but is not used like one. The alternate key is an alternative to the primary key. Another name for an alternate key is secondary key.
Both the primary key and alternate key are candidate keys. the term “candidate” means that it has the potential of being the primary key. By default, any candidate key that is not chosen to be the primary key becomes an alternate key:
Here we can visually see the differences between the terms.
A simple key is any key that consists of only one column. Earlier we had the User ID column that was used as the primary key for the Users table.
A compound Key is a multiple column key where all of the columns function as simple keys.
Here you can see how a compound key is created from multiple keys. Composite keys are common in many to many relationships.
A composite key is a key of multiple columns where not all of the columns function as simple keys.
You may hear the terms compound key and composite key used interchangeably, but this is technically incorrect.
A natural key is a key that comes naturally from the information stored within a database. A good example of this is a users social security number.
A surrogate key is a key that is made up for the sole purpose of being a primary key.
The main difference between natural and surrogate keys is that natural keys have real world meaning while surrogate keys do not. For example, a social security number is a real-world thing, but a made up user ID that is never shared with the user has no real-world value.
A foreign key is a reference to a primary key in another table. Go back to the top of the page and look at the picture with the User ID and the Buyer ID… Buyer ID is a foreign key that references the user ID primary key from the other table. This is how we connect all of our tables.
So I’ve poured out a lot of information in the last few paragraphs, and if you are anything newer than an intermediate with database design, a lot of these terms are probably new to you. So the rest of this blog will be dedicated to explaining what is important when it comes to worrying about keys.
The first thing is to realize the classifications of keys. We have the hierarchy of superkey -> candidate key –> primary or alternate key.
The second thing is that these keys can then be classified into categories of simple, compound, natural, etc…
and the third thing is foreign keys, which reference a primary key from another table.
This means means that every key will either be primary, alternate, or foreign. A key will be of one of these 3 types.
Natural vs Surrogate Primary Key
So many people argue over whether or not a key should be a natural key (such as a social security number), or a surrogate key (such as a made up ID). I’m not going to go on and on about which is best and why, but I am going to say that I do prefer surrogate primary keys. Here is my reasoning:
- There should be a preference for keys that do not have real world meaning. This is because real world values (or the meaning of values) are subject to change, and primary keys should never change!
- I like consistency… If we use natural keys, we will sometimes run into the situation where we do not have any candidate keys to choose as a natural primary key. This means that some tables will have natural keys while others will have surrogate keys. This requires extra work because we must know which tables have real-world values of high importance and which tables are using made up IDs.
- I do not think keys should be known to the general public. Once a surrogate key is given to the general public it now has real-world meaning. For example, places such as schools, give students their student ID. Students use this number for so many things! Although this key is surrogate because we created it, once given to the students it develops real-world meaning and is used as a natural key. I think this motivates me to use surrogate keys because natural keys are, once again, real-world values and are subject to change. Or, the school could decide to use a new ID system, causing changes to the primary key of the students database table, and primary keys should be non-changing.
- The natural key can change at some point in time. This means that if we use our natural key for a primary key then whenever it updates we must also update all programmed SQL joins that use the primary key. With a surrogate key, it will never change.
To apply this to my database, I create my own meaningless (in the real-world, anyways) ID column to distinguish rows. Not only does this prevent change from natural causes, but it is a single column number, which typically is best for performance.
It is best though to always have a natural secondary key. This means that before you create a surrogate key you should be able to uniquely define each row. Otherwise, we can be confused due to multiple columns having duplicate data aside from the unique surrogate key. A natural key is the most common key used for a where clause in SQL. If this where clause returns multiple rows when only one should be returned, we have a natural key problem.
What should my Primary Key Look Like?
- Never changing
- Never null (we can’t have an empty key)
- Preferably a number
- Preferably surrogate
- Preferably a simple key (except in the case of intermediary tables…we’ll talk about that in future blogs).
- The data type characteristics should not potentially limit the number of rows (we’ll talk about this when we talk about data types).
Keep in mind that foreign keys go by all of the same rules because they have all of the same column characteristics and data types as the primary key they reference.
- All primary keys and alternate keys are candidate keys. all candidate keys are superkeys. But all superkeys are not candidate keys and a candidate key is either primary or alternate.
- Keys can be classified as simple, compound, or composite.
- Keys can also be classified as natural or surrogate.
- Foreign keys are a copy/reference of the primary key connecting multiple tables.
- Surrogate keys are recommended.