DATABASE NORMALIZATION

DATABASE NORMALIZATION

This is a topic that you have probably heard of it and maybe you really dont know what it entails. well let me turn from a novice to a pro with this indepth article.This will be part one of how to handle data normalization in a database we will just be doing the overview of it and in part two i will actually be modelling our data to the tables that we are going to create in the following project ,from creating dimension tables to fact tables using Common Table Expressions CTE ‘s and with the help of ERD daigrams we will understand relationships between our tables ,and so part two of this project we will actully be implementing all this data normalization using DBT and we will be using snowflake to handle our data .Its a project you dont want to miss.


What is data /data base normalization? Data normalization is the process of effectively designing a database effectively to avoid data redundancy,and to avoid anomalies.

LEVELS OF DATA NORMALIZATION.

We have levels of data normalisation namely

  1. First Normal Form (1NF)
  2. Second Normal Form(2NF)
  3. Third Normal Form (3NF)


We have upto 6 NF but for industries’ normal normalization it extends upto 3NF


We are going to visualize the steps that we will be undertaking using a dataset that we have downloaded from kaggle .Thus if we have an indentical data in an industry ,this would be an ideal way of normalizing the database for optimum perfomance.We will be highlighting some few columns of our data to fully understand our data.We will be using Excel.

we have the foloowing columns Row_id,Order_id,Order_date,Ship_date,Ship_mode,Customer_id,Customer_name,Segment,Country,City,State,Postal_code,Region,Product_id,Category,Sub_category,Product_name,Sales.


FIRST NORMAL FORM (1NF)

We have two rules that the 1NF follows

  1. Every column/attribute must have only a single value.
  2. All non key attributes must be fully dependent on candidate key -Candidate key these are all sets of columns which uniquely identify a re cord. If a non key column has a partial dependency on candidate key we split them into separate tables

I can expound on the above guideline in the above table we have the order id and the order date means that the order_date column is fully dependent on the Order_id column.we also have the product id and the city column,Means there is no/partial dependency on the columns meaning we have to split in to different columns .Thus in the below table the candidate keys includes ; Orderid,Customer_id and the product_id.

In regards to this we will split the data in the following ways ;

Table 1 will have order_id,Order_date,Ship_date,Ship_mode

Table 2 will have Customer_id,Customer_name,Segment, Country,City,State,Postal_code,Region

Table 3 will have Product_id,Category,Sub_category,Product_name,Sales.


Every table should have a primary key and the relationship between tables should be formed using foreign keys.

We will be visualizing all these below

orders table
customers table
products table
relationships


Now this is how our database /datawarehouse should look like.We will be implementing CTE sql commands to create these tables using DBT .We will create an ERD diagram that expalains properly our data model and database.This is done for easier understanding to someone who is not conversant with tis field. We will move on to 3NF


THIRD NORMAL FORM

It follows the following rules

  1. Must be in 2NF FORM
  2. Avoid Transitive Relationship — lets say we have the table T and we have columns A,B and C .A is depedent on B and B is depedent on C,Then we can say A is transitively dependent on C .Thus we should avoid having this in the same table.As we can see in the above data above we can clearly see that we dont have transitive dependencies.

DIFFERENT ANOMALIES THAT WE OVERCOME

  1. INSERTION ANOMALY — We have if a customer makes different orders ,means the customerinformation becomes redundant while in other columns the data is changing.
  2. DELETION ANOMALY — If we have lets say an order is made and then canceled then we will only delete the data for the order table
  3. UPDATE ANOMALY — Lets say we are updating the price of an item we will just update a single row in the order table rather than updating all the rows that have that item.

ADVANTAGES OF NORMALIZATION OF A DATABASE.

  1. Reduces data redundancy and inconsistency.
  2. Makes writing to the database faster.
  3. Speeds up Update,Insert and Delete
  4. Maintains data integrity.
  5. Able to make more joins
  6. Optimizsed data storage saves on space.


While some people may be not be using databases as often ,they have shifted to data warehousing in various cloud providers such as big query, Amazon redshift and snowflake .In this datawarehouses we now move to data modelling.Data modelling is done using dbt and in part two of this project i will be guiding on dbt and how we can create data models and later create visulizations in power bi.all in all data base and data normalization is a must tool for proefficeint data handling. The dbt project has already been done at this link https://github.com/stilinsk/dbt_zero_to_hero


要查看或添加评论,请登录

Simon Ngugi的更多文章

  • KAFKA

    KAFKA

    Kafka is a streaming platform capable of handling trillions of events a day. At its core, it is distributed…

  • DBT ZERO TO HERO

    DBT ZERO TO HERO

    In this article we focus on dbt ,its functionalities ,its integrations and its components in the buiding and scaling…

社区洞察

其他会员也浏览了