Database Normalization Demystified: How to Improve Data Integrity and Reduce Redundancy
Yugandhara Saste
Data Engineer @ Bristlecone | Azure Data Factory | Databricks | Pyspark | AWS (S3, EMR, Athena) | Python | Snowflake | MYSQL | Airflow
What is Normalization?
Normalization is the process of organizing data in a database to reduce redundancy and improve data integrity. In a normalized database, data is structured across multiple tables, each focusing on a specific topic, and relationships between tables are established to maintain a logical data structure.
The main goals of normalization are to:
Example: E-commerce Order System
Consider an e-commerce platform where order information is stored. Here is a table structure before applying First Normal Form (1NF):
In this table:
After Applying 1NF
To comply with 1NF, we need to break down the Products and Quantities columns into separate rows where each row has a single product and quantity per order.
2. Second Normal Form (2NF)
builds on First Normal Form (1NF) by eliminating partial dependencies in a relational database. For a table to be in 2NF, it must:
Example: E-commerce Orders with 2NF
Consider a case where we store order and product information in one table
In this table:
After Applying 2NF
To achieve 2NF, we split the data into two tables: one for Orders (with customer information) and another for OrderDetails (with product information linked to each order).
Explanation
3. Third Normal Form (3NF) :
Removing transitive dependencies. For a table to be in 3NF, it must:
A transitive dependency occurs when a non-key attribute depends on another non-key attribute rather than directly on the primary key. By removing these dependencies, 3NF ensures that each non-key attribute is directly and only dependent on the primary key, making the data structure more efficient.
Example: E-commerce Platform in 3NF
let’s say we have an Orders table where we store some customer-related information that could cause a transitive dependency.
In this table:
This structure has a transitive dependency because CustomerName and Address depend on CustomerID, which itself depends on the primary key OrderID.
领英推荐
After Applying 3NF
To eliminate the transitive dependency, we can split this table into two separate tables: Orders and Customers.
Explanation
This setup ensures that:
4. Fourth Normal Form (4NF) :
Addresses multi-valued dependencies in a relational database. A table is in 4NF if it meets the following criteria:
Example: E-commerce Platform in 4NF
Suppose we have a table that tracks products and their associated features.
In this table:
After Applying 4NF
To eliminate the multi-valued dependencies, we can split this table into two separate tables: Products and ProductVariants.
Explanation
This restructuring resolves the multi-valued dependency because:
5. Fifth Normal Form (5NF) :
Also known as Project-Join Normal Form (PJNF), is a level of database normalization that deals primarily with the representation of data in such a way that it eliminates redundancy caused by the relationships between multiple entities. A table is in 5NF if:
Example: E-commerce Platform in 5NF
Sells products to different customers, where some products can belong to multiple categories.
In this table:
Here, the ProductID can belong to multiple Category entries, leading to redundancy.
After Applying 5NF
To achieve 5NF, we can break the table into three separate tables: Customers, Products, and Categories.
Explanation
By separating the data into these three tables, we eliminate redundancy and ensure that the information can be reconstructed through joins without any loss of data integrity.
Conclusion :
In conclusion, normalization is essential for creating efficient, reliable, and maintainable databases. While it may require more initial design effort and a deeper understanding of the data relationships, the long-term benefits in terms of data integrity, performance, and scalability make it a critical practice in database management. By implementing normalization, organizations can ensure their data is well-organized, accessible, and accurately reflects the relationships inherent in their operations.
Very informative Thanks for sharing