Introduction to Database Normalization

Any time that you design a database you will have to use normal forms. Normal forms are steps used create the best database design.

Good database design prevents non-scalar data. Non-scalar data is data that does not store only one value. Having non-scalar data violates the rule that everything should be atomic.

Good database design prevents repeating data. For example, take a look at this incorrect table:

user_idfirst_namelast_nameemail
54CalebCurryepic@legit.com
54CalebCurryCooldude@yo.com

In order for Caleb Curry to store two emails he created another column. Now we have duplicate data (two primary keys even though they should be unique, two first names, and two last names).

Good database design prevents anomalies. an anomaly is anything that goes against the normal or expected results.

The normal forms help prevent all of these by giving us rules. Following these rules to give us the best database design is known as database normalization.

There are a lot of different normal forms. I will list them all here as well as a small definition from Wikipedia. Feel free to skip over this table:

1NFFirst Normal FormThe domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain
2NFSecond Normal FormNo non-prime attribute in the table is functionally dependent on a proper subset of any candidate key
3NFThird Normal FormEvery non-prime attribute is non-transitively dependent on every candidate key in the table. The attributes that do not contribute to the description of the primary key are removed from the table. In other words, no transitive dependency is allowed.
EKNFElementary Key Normal FormEvery non-trivial functional dependency in the table is either the dependency of an elementary key attribute or a dependency on a superkey
BCNFBoyce-Codd Normal FormEvery non-trivial functional dependency in the table is a dependency on a superkey
4NFFourth Normal FormEvery non-trivial multivalued dependency in the table is a dependency on a superkey
5NFFifth Normal FormEvery non-trivial join dependency in the table is implied by the superkeys of the table
DKNFDomain/Key Normal FormEvery constraint on the table is a logical consequence of the table’s domain constraints and key constraints
6NFSixth Normal FormTable features no non-trivial join dependencies at all (with reference to generalized join operator)

Do not worry if all of this doesn’t make sense yet. I will be explaining things. First thing is first: it is not always necessary to go through all normal forms. Typically 3NF is considered “normalized.”

Higher levels of normalization remove redundant data but requires more processing power. You have to find the perfect balance.

We will be explaining the first 3 normal forms first and then continuing on with database concepts. We will address other normal forms as needed.

Fortunately, many of the tips I have already explained in previous blogs will help us get through these normal forms easily. For example, atomic values is part of 1NF. Designing binary relationships taught us how to break incorrect tables into multiple tables. This will all be done within the normal forms, as well.

First, review the following blogs:

First Normal Form (1NF)

First normal form tells us to:

  • Get rid of non atomic values,
  • Get rid of repeating row values,
  • Get rid of repeating columns.

Incorrect:

user_idfirst_namelast_nameemail
54CalebCurryepic@legit.com
54CalebCurryCooldude@yo.com

What’s wrong with it?:

We have repeating primary keys, for one. Primary keys are completely unique and you should never have a repeating primary key within a single table. You may have repeating primary keys from different tables such as a user_id of 54 and a transaction_id of 54, but these are both in different tables and are referring to a completely different entity. Foreign keys, on the other hand, can be repeated. This is because they all point back to a single primary key. We also have repeating first_name and last_name values. This is bad because it requires extra storage space and puts us at risk for anomalies. For example, Caleb Curry could change his name and update only one column, now we could have something like this:

user_idfirst_namelast_nameemail
54CalebCurryepic@legit.com
54CalebCrazyCooldude@yo.com

Now, according to this table we have two different names for the ID of 54.

incorrect example number two:

A solution that somebody might think of is to allow multiple emails in the email column:

user_idfirst_namelast_nameemails
54CalebCurryepic@legit.com Cooldude@yo.com

This gets rid of the incorrect data problem, but now we have multiple values within column. This is non-atomic data and is also incorrect.

Incorrect example number three:

user_idfirst_namelast_nameemail_1email_2

This is an incorrect solution because now we have repeating columns. This limits everyone to two emails and whenever somebody has one email they will have to leave the second column null.

Correct solution:

In order to allow people to store multiple emails we must create a one to many relationship over two tables.

Here you can see that the email was moved to another table. Each email row will have a user_id that tells who owns the email.

Now, whenever a user has multiple emails, a new row will be created in the email table with a foreign key referencing the user. This eliminates repeating data.

email:

email_iduser_idemail
154epic@legit.com
254Cooldude@yo.com

user:

user_idfirst_namelast_name
54CalebCurry

In this example, Caleb Curry’s emails were moved to the email table with a foreign key pointing back to his ID. This allows us to figure out which emails are his. The email column should be unique because each email is owned by only one person. This means we should never have the same value twice. The foreign key user_id referencing user_id in the user table can be used in multiple different rows because a user can have multiple emails. This is not repeating data, we are organizing our data in a way where each email is owned by one person all connected with a foreign key. A foreign key constraint will protect our data integrity (no disconnected primary keys). Also, if for some reason Caleb decides to change his name, we do not have to worry about anomalies because the column changed is correct and there is only one column with his name.

Basically, for first normal form, we should atomicity and uniqueness due to the dependency upon the key.

Second Normal Form (2NF)

In order for a database to be in 2NF:

  • it must already in 1NF
  • all of the non-key columns rely on every column of the composite key.

So for review, a composite key is a key that consists of multiple attributes (or columns). If each table only has one key then your table is automatically in second normal form.

Incorrect example:

Here you can see a bad table that lists every enrollment of students.

The class_name, instructor_last_name, and student_last_name are the composite key. This is because all three of these must be unique as a whole. Class name may be repeated because multiple people are enrolling in that class or because there are more than one of that class. Instructor may be repeated because an instructor can teach multiple classes. Student can be repeated because a student can enroll in multiple classes. This means that all three of the columns together must be unique: We can not have a student signing up for a class by a specific instructor twice.

Obviously there are some things wrong with this table. We have partial dependencies. A partial dependency is when a non-key column only depends on part of the key:

Here you can see the partial dependencies of our table. We can also see that description_of_major does not depend on the primary keys but rather the major of the student.

room_of_class depends on class_name because if we had a different class it would be in a different room. student_major depends upon the student_last_name because if we had a different student then the major could change (different people have different majors).

description_of_major could change if the student changed because the new student could have a different major. The only problem is that description_of_major depends mainly on the major of the student, not the student itself. This is called a transitive dependency… We will talk about this later.

In order to get this table into 2NF we must move a column of the key to another table whenever it has a partial dependency on it. Because class_name and student_last_name both had columns that were only partly dependent upon the key, we will move those to new tables as well as any transitive dependencies that come along with it. We will also move the instructor to another table because it is left by its self and the table would more appropriately be called instructor rather than enrollment. While we are at it, we will add some columns such as surrogate primary keys and first names. We must also realize that in order to keep the connection that a class is taught by an instructor, we must create a foreign key in the class table telling us the ID of the instructor (we will explain why later in this blog). We will end up with these three tables:

Here you can see that the first table is separated into three other tables.

Now in order for us to say what classes a student is taking, we can create a new table student_classes. This requires an intermediary table because it is a many to many relationship:

Correct example:

This new intermediary table (many to many relationship) accomplishes the same thing as the original enrollment table.

This accomplishes the same thing as the original enrollment table because we have a unique combination of a class and a student taking that class.

Back to the instructor_id problem… from the original table the instructor (part of the key) is dependent upon the class (because the instructor teaches the class). This requires us to remove it from the enrollment table (or the student_classes table in the updated version). Storing the instructor in the student_classes table is bad design because we would have repeating data.

We can figure out the instructor because it is in the class table. The way we have it now only has one connection to the instructor: from the class table to the instructor table.

Third Normal Form (3NF)

In order for a database to be in 3NF, we must:

  • First have our database in 2NF (and 1NF),
  • Take any transitive dependencies and take the column that is being depended on and move it to another table. We then replace that column with a foreign key.

Continuing from the previous example, we have a transitive dependency in our student table. A transitive dependency is kind of like a double dependency… description_of_major depends upon major which depends upon student_id. This is complicated and puts us at risk for repeating data and anomalies. To fix this, we move the column that is being depended upon in the transitive dependency and move it to a new table. We must then replace it with a foreign key:

Here you can see that we created a new table for major. I forgot to add a major_title in our major table.

Our final (basic) database structure:

Conclusion:

  • The data depends on the key (1NF), the whole key (2NF) and nothing but the key (3NF).

Further Study

Leave a Reply

Your email address will not be published. Required fields are marked *