DATABASE NORMALIZATION
Simon Ngugi
DATA ENGINEER||ANALYTICS ENGINEER |||Transforming data to business value
This is a topic that you have probably heard of it and maybe you really dont know what it entails. well let me turn from a novice to a pro with this indepth article.This will be part one of how to handle data normalization in a database we will just be doing the overview of it and in part two i will actually be modelling our data to the tables that we are going to create in the following project ,from creating dimension tables to fact tables using Common Table Expressions CTE ‘s and with the help of ERD daigrams we will understand relationships between our tables ,and so part two of this project we will actully be implementing all this data normalization using DBT and we will be using snowflake to handle our data .Its a project you dont want to miss.
What is data /data base normalization? Data normalization is the process of effectively designing a database effectively to avoid data redundancy,and to avoid anomalies.
LEVELS OF DATA NORMALIZATION.
We have levels of data normalisation namely
We have upto 6 NF but for industries’ normal normalization it extends upto 3NF
We are going to visualize the steps that we will be undertaking using a dataset that we have downloaded from kaggle .Thus if we have an indentical data in an industry ,this would be an ideal way of normalizing the database for optimum perfomance.We will be highlighting some few columns of our data to fully understand our data.We will be using Excel.
we have the foloowing columns Row_id,Order_id,Order_date,Ship_date,Ship_mode,Customer_id,Customer_name,Segment,Country,City,State,Postal_code,Region,Product_id,Category,Sub_category,Product_name,Sales.
FIRST NORMAL FORM (1NF)
We have two rules that the 1NF follows
I can expound on the above guideline in the above table we have the order id and the order date means that the order_date column is fully dependent on the Order_id column.we also have the product id and the city column,Means there is no/partial dependency on the columns meaning we have to split in to different columns .Thus in the below table the candidate keys includes ; Orderid,Customer_id and the product_id.
In regards to this we will split the data in the following ways ;
Table 1 will have order_id,Order_date,Ship_date,Ship_mode
Table 2 will have Customer_id,Customer_name,Segment, Country,City,State,Postal_code,Region
Table 3 will have Product_id,Category,Sub_category,Product_name,Sales.
领英推荐
Every table should have a primary key and the relationship between tables should be formed using foreign keys.
We will be visualizing all these below
Now this is how our database /datawarehouse should look like.We will be implementing CTE sql commands to create these tables using DBT .We will create an ERD diagram that expalains properly our data model and database.This is done for easier understanding to someone who is not conversant with tis field. We will move on to 3NF
THIRD NORMAL FORM
It follows the following rules
DIFFERENT ANOMALIES THAT WE OVERCOME
ADVANTAGES OF NORMALIZATION OF A DATABASE.
While some people may be not be using databases as often ,they have shifted to data warehousing in various cloud providers such as big query, Amazon redshift and snowflake .In this datawarehouses we now move to data modelling.Data modelling is done using dbt and in part two of this project i will be guiding on dbt and how we can create data models and later create visulizations in power bi.all in all data base and data normalization is a must tool for proefficeint data handling. The dbt project has already been done at this link https://github.com/stilinsk/dbt_zero_to_hero