Normalization in DBMS: 1NF, 2NF, 3NF and BCNF with Examples

Posted in SQL
Normalization in DBMS: 1NF, 2NF, 3NF and BCNF with Examples

When developing the schema of a relational database, one of the most important aspects to be taken into account is to ensure that the duplication is minimized. This is done for 2 purposes:

  • Reducing the amount of storage needed to store the data.
  • Avoiding unnecessary data conflicts that may creep in because of multiple copies of the same data getting stored.

Normalization in DBMS

Database Normalization is a technique that helps in designing the schema of the database in an optimal manner so as to ensure the above points. The core idea of database normalization is to divide the tables into smaller subtables and store pointers to data rather than replicating it. For a better understanding of what we just said, here is a simple DBMS Normalization example:

To understand (DBMS)normalization in the database with example tables, let's assume that we are supposed to store the details of courses and instructors in a university. Here is what a sample database could look like:

Course code Course venue Instructor Name Instructor’s phone number
CS101 Lecture Hall 20 Prof. George +1 6514821924
CS152 Lecture Hall 21 Prof. Atkins +1 6519272918
CS154 CS Auditorium Prof. George +1 6514821924

Here, the data basically stores the course code, course venue, instructor name, and instructor’s phone number. At first, this design seems to be good. However, issues start to develop once we need to modify information. For instance, suppose, if Prof. George changed his mobile number. In such a situation, we will have to make edits in 2 places. What if someone just edited the mobile number against CS101, but forgot to edit it for CS154? This will lead to stale/wrong information in the database.

This problem, however, can be easily tackled by dividing our table into 2 simpler tables:

Table 1 (Instructor):

  • Instructor ID
  • Instructor Name
  • Instructor mobile number

Table 2 (Course):

  • Course code
  • Course venue
  • Instructor ID

Now, our data will look like the following:

Table 1 (Instructor): 

Insturctor's ID Instructor's name Instructor's number
1 Prof. George +1 6514821924
2 Prof. Atkins +1 6519272918

Table 2 (Course): 

Course code Course venue Instructor ID
CS101 Lecture Hall 20 1
CS152 Lecture Hall 21 2
CS154 CS Auditorium 1

Basically, we store the instructors separately and in the course table, we do not store the entire data of the instructor. We rather store the ID of the instructor. Now, if someone wants to know the mobile number of the instructor, he/she can simply look up the instructor table. Also, if we were to change the mobile number of Prof. George, it can be done in exactly one place. This avoids the stale/wrong data problem.

Further, if you observe, the mobile number now need not be stored 2 times. We have stored it at just 1 place. This also saves storage. This may not be obvious in the above simple example. However, think about the case when there are hundreds of courses and instructors and for each instructor, we have to store not just the mobile number, but also other details like office address, email address, specialization, availability, etc. In such a situation, replicating so much data will increase the storage requirement unnecessarily.

The above is a simplified example of how database normalization works. We will now more formally study it.

Types of DBMS Normalization

There are various database “Normal” forms. Each normal form has an importance which helps in optimizing the database to save storage and to reduce redundancies.

First Normal Form (1NF)

The First normal form simply says that each cell of a table should contain exactly one value. Let us take an example. Suppose we are storing the courses that a particular instructor takes, we can store it like this:

Instructor's name Course code
Prof. George (CS101, CS154)
Prof. Atkins (CS152)

Here, the issue is that in the first row, we are storing 2 courses against Prof. George. This isn’t the optimal way since that’s now how SQL databases are designed to be used. A better method would be to store the courses separately. For instance:

Instructor's name Course code
Prof. George CS101
Prof. George CS154
Prof. Atkins CS152

This way, if we want to edit some information related to CS101, we do not have to touch the data corresponding to CS154. Also, observe that each row stores unique information. There is no repetition. This is the First Normal Form.

Second Normal Form (2NF)

For a table to be in second normal form, the following 2 conditions are to be met:

  1. The table should be in the first normal form.
  2. The primary key of the table should compose of exactly 1 column.

The first point is obviously straightforward since we just studied 1NF. Let us understand the first point - 1 column primary key. Well, a primary key is a set of columns that uniquely identifies a row. Basically, no 2 rows have the same primary keys. Let us take an example.

Course code Course venue Instructor Name Instructor’s phone number
CS101 Lecture Hall 20 Prof. George +1 6514821924
CS152 Lecture Hall 21 Prof. Atkins +1 6519272918
CS154 CS Auditorium Prof. George +1 6514821924

Here, in this table, the course code is unique. So, that becomes our primary key. Let us take another example of storing student enrollment in various courses. Each student may enroll in multiple courses. Similarly, each course may have multiple enrollments. A sample table may look like this (student name and course code):

Student name Course code
Rahul

CS152
Rajat CS101
Rahul CS154
Raman CS101

Here, the first column is the student name and the second column is the course taken by the student. Clearly, the student name column isn’t unique as we can see that there are 2 entries corresponding to the name ‘Rahul’ in row 1 and row 3. Similarly, the course code column is not unique as we can see that there are 2 entries corresponding to course code CS101 in row 2 and row 4. However, the tuple (student name, course code) is unique since a student cannot enroll in the same course more than once. So, these 2 columns when combined form the primary key for the database.

As per the second normal form definition, our enrollment table above isn’t in the second normal form. To achieve the same (1NF to 2NF), we can rather break it into 2 tables:

Students:

Student name Enrolment number
Rahul 1
Rajat 2
Raman 3

Here the second column is unique and it indicates the enrollment number for the student. Clearly, the enrollment number is unique. Now, we can attach each of these enrollment numbers with course codes.

Courses:

Course code Enrolment number
CS101 2
CS101 3
CS152 1
CS154 1

These 2 tables together provide us with the exact same information as our original table.

Third Normal Form (3NF)

Before we delve into details of third normal form, let us understand the concept of a functional dependency on a table.

Column A is said to be functionally dependent on column B if changing the value of A may require a change in the value of B. As an example, consider the following table:

Course code Course venue Instructor's name Department
MA214 Lecture Hall 18 Prof. George CS Department
ME112 Auditorium building Prof. John Electronics Department

Here, the department column is dependent on the professor name column. This is because if in a particular row, we change the name of the professor, we will also have to change the department value. As an example, suppose MA214 is now taken by Prof. Ronald who happens to be from the Mathematics department, the table will look like this:

Course code Course venue Instructor's name Department
MA214 Lecture Hall 18 Prof. Ronald Mathematics Department
ME112 Auditorium building Prof. John Electronics Department

Here, when we changed the name of the professor, we also had to change the department column. This is not desirable since someone who is updating the database may remember to change the name of the professor, but may forget updating the department value. This can cause inconsistency in the database.

Third normal form avoids this by breaking this into separate tables:

Course code Course venue Instructor's ID
MA214 Lecture Hall 18 1
ME112 Auditorium building, 2

Here, the third column is the ID of the professor who’s taking the course.

Instructor's ID Instructor's Name Department
1 Prof. Ronald Mathematics Department
2 Prof. John Electronics Department

Here, in the above table, we store the details of the professor against his/her ID. This way, whenever we want to reference the professor somewhere, we don’t have to put the other details of the professor in that table again. We can simply use the ID.

Therefore, in the third normal form, the following conditions are required:

  • The table should be in the second normal form.
  • There should not be any functional dependency.

Boyce-Codd Normal Form (BCNF)

Boyce-Codd Normal form is a stronger generalization of third normal form. A table is in Boyce-Codd Normal form if and only if at least one of the following conditions are met for each functional dependency A → B:

  • A is a superkey
  • It is a trivial functional dependency.

Let us first understand what a superkey means. To understand BCNF in DBMS, consider the following BCNF example table:

Course code Course venue Instructor Name Instructor’s phone number
CS101 Lecture Hall 20 Prof. George +1 6514821924
CS152 Lecture Hall 21 Prof. Atkins +1 6519272918
CS154 CS Auditorium Prof. George +1 6514821924

Here, the first column (course code) is unique across various rows. So, it is a superkey. Consider the combination of columns (course code, professor name). It is also unique across various rows. So, it is also a superkey. A superkey is basically a set of columns such that the value of that set of columns is unique across various rows. That is, no 2 rows have the same set of values for those columns. Some of the superkeys for the table above are:

  • Course code
  • Course code, professor name
  • Course code, professor mobile number

A superkey whose size (number of columns) is the smallest is called as a candidate key. For instance, the first superkey above has just 1 column. The second one and the last one have 2 columns. So, the first superkey (Course code) is a candidate key.

Boyce-Codd Normal Form says that if there is a functional dependency A → B, then either A is a superkey or it is a trivial functional dependency. A trivial functional dependency means that all columns of B are contained in the columns of A. For instance, (course code, professor name) → (course code) is a trivial functional dependency because when we know the value of course code and professor name, we do know the value of course code and so, the dependency becomes trivial.

Let us understand what’s going on:

A is a superkey: this means that only and only on a superkey column should it be the case that there is a dependency of other columns. Basically, if a set of columns (B) can be determined knowing some other set of columns (A), then A should be a superkey. Superkey basically determines each row uniquely.

It is a trivial functional dependency: this means that there should be no non-trivial dependency. For instance, we saw how the professor’s department was dependent on the professor’s name. This may create integrity issues since someone may edit the professor’s name without changing the department. This may lead to an inconsistent database. There are also 2 other normal forms:

Fourth normal form

A table is said to be in fourth normal form if there is no two or more, independent and multivalued data describing the relevant entity.

Fifth normal form

A table is in fifth Normal Form if:

  • It is in fourth normal form.
  • It cannot be subdivided into any smaller tables without losing some form of information.

Summary

The various forms of database normalization are useful while designing the schema of a database in such a way that there is no data replication which may possibly lead to inconsistencies. While designing the schema for applications, we should always think about how can we make use of these forms.

People are also reading:

Aman Goel

Aman Goel

Entrepreneur, Coder, Speed-cuber, Blogger, fan of Air crash investigation! Aman Goel is a Computer Science Graduate from IIT Bombay. Fascinated by the world of technology he went on to build his own start-up - AllinCall Research and Solutions to build the next generation of Artificial Intelligence, Machine Learning and Natural Language Processing based solutions to power businesses. View all posts by the Author

Leave a comment

Your email will not be published
Cancel
Kwaku
Kwaku

In you BCNF, why don't you use only instruter_id as FK but rather use instructer_name and instructor_phone.
Is that not duplicate?

Saraa
Saraa

In your 2NF example, after creating the enrollment numbers, table 1 comes in 2NF, what about table 2? It still contains repeated course ids as well as repeated enrollment numbers.

Sagar Jaybhay
Sagar Jaybhay 30 Points

Very very nice explanation

Maryam bibi
Maryam bibi

hello!
what is the primary key in the table .
it so confusing .

Hackr Team
Hackr Team 0 Points

This video might be helpful to you: https://www.youtube.com/watch?v=B5r8CcTUs5Y

Doug Mather
Doug Mather

Does database normalization reduce the database size?

Oliver Watson
Oliver Watson

Normalization removes the duplicate data and helps to keep the data error free. This helps to ensure that the size of the database doesn’t grow large with duplicate data. At the same time, the speed of some types of operations can be slower in a non-normalized form. Normalization increases the efficiency of the database.

Hackr User
Hackr User

Which normal form can remove all the anomalies in DBMS?

Kristi Jackson
Kristi Jackson

Normalization makes a table or relation free from insert/update/delete anomalies and saves the space by releasing the duplicate data. Basically, the 3NF is enough to remove all the anomalies from your database. Higher NFs can reduce the level and will affect maintaining all those tables and reporting with several JOINS.

Olive Yu
Olive Yu

Can database normalization reduce number of tables?

Ann Neal
Ann Neal

Normalization removes redundant data so sometimes it increases the number of tables.

Jack Graw
Jack Graw

What is the alternative to database normalization?

Judy Peterson
Judy Peterson

There is no alternative to normalization. This depends on your application needs that it requires normalization or not. If you are working with or designing an OLTP application where more independent tables are actually given a benefit of storing data in the more optimal way. There is no requirement of normalization when reading the data from many normalized tables.
There are other techniques available like star schema, denormalization etc. but it all depends on your need.

Peg Lee
Peg Lee

What is the purpose or need of normalization in database?

Dwayne Hicks
Dwayne Hicks

Database normalization is the process of organizing data and minimizes the data redundancy. This is the main purpose of normalization. The basic need of normalization is to prevent anomalies from messing up the data. The reasons why we use data normalization are to minimize duplicate data, to minimize or avoid data modification issues, and to simplify queries.
2 primary advantages of normalization:
• Easier object to data mapping
• Increase consistency

Wanda Lee
Wanda Lee

Difference between BCNF and 3NF?

Sandra Bowen
Sandra Bowen

The difference between 3NF and BCNF is subtle.

3NF
Definition
A relation is in 3NF if it is in 2NF and no non-prime attribute transitively depends on the primary key. In other words, a relation R is in 3NF if for each functional dependency X ⟶ A in R at least one of the following conditions are met:

X is a key or superkey in R
A is a prime attribute in R
Example
Given the following relation:

EMP_DEPT(firstName, employeeNumber, dateOfBirth, address, departmentNumber, departmentName)

An employee can only work in one department and each department has many employees.

The candidate key is employeeNumber.

Consider the following functional dependencies:

employeeNumber ⟶ firstName, dateOfBirth, address, departmentNumber
departmentNumber ⟶ departmentName
Given the definition above it is possible to conclude that the relation EMP_DEPT is not in 3NF because the second functional dependency does not meet any of the 2 conditions of the 3NF:

departmentNumber is not a key or superkey in EMP_DEPT
departmentName is not a prime attribute in EMP_DEPT
BCNF
Definition
A relation R is in BCNF if it is in 3NF and for each functional dependency X ⟶ A in R, X is a key or superkey in R. In other words, the only difference between 3NF and BCNF is that in BCNF it is not present the second condition of the 3NF. This makes BCNF stricter than 3NF as any relation that is in BCNF will be in 3NF but not necessarily every relation that is in 3NF will be in BCNF.

Example
Given the following relation:

STUDENT_COURSE(studentNumber, socialSecurityNumber, courseNumber)

A student can assist to many courses and in a course there can be many students.

The candidate keys are:

socialSecurityNumber, courseNumber
studentNumber, courseNumber
Consider the following functional dependencies:

studentNumber ⟶ socialSecurityNumber
socialSecurityNumber ⟶ studentNumber
Given the definitioin above it is possible to conclude that STUDENT_COURSE is not in BCNF as at least studentNumber is not a key or superkey in STUDENT_COURSE.

Source: https://stackoverflow.com/questions/19749913/what-is-the-difference-between-3nf-and-bcnf