登录查看更多内容

DATABASE NORMALIZATION

Simon Ngugi

DATA ENGINEER||ANALYTICS ENGINEER |||Transforming data to business value

发布日期: 2024年4月12日

This is a topic that you have probably heard of it and maybe you really dont know what it entails. well let me turn from a novice to a pro with this indepth article.This will be part one of how to handle data normalization in a database we will just be doing the overview of it and in part two i will actually be modelling our data to the tables that we are going to create in the following project ,from creating dimension tables to fact tables using Common Table Expressions CTE ‘s and with the help of ERD daigrams we will understand relationships between our tables ,and so part two of this project we will actully be implementing all this data normalization using DBT and we will be using snowflake to handle our data .Its a project you dont want to miss.

What is data /data base normalization? Data normalization is the process of effectively designing a database effectively to avoid data redundancy,and to avoid anomalies.

LEVELS OF DATA NORMALIZATION.

We have levels of data normalisation namely

First Normal Form (1NF)
Second Normal Form(2NF)
Third Normal Form (3NF)

We have upto 6 NF but for industries’ normal normalization it extends upto 3NF

We are going to visualize the steps that we will be undertaking using a dataset that we have downloaded from kaggle .Thus if we have an indentical data in an industry ,this would be an ideal way of normalizing the database for optimum perfomance.We will be highlighting some few columns of our data to fully understand our data.We will be using Excel.

we have the foloowing columns Row_id,Order_id,Order_date,Ship_date,Ship_mode,Customer_id,Customer_name,Segment,Country,City,State,Postal_code,Region,Product_id,Category,Sub_category,Product_name,Sales.

FIRST NORMAL FORM (1NF)

We have two rules that the 1NF follows

Every column/attribute must have only a single value.
All non key attributes must be fully dependent on candidate key -Candidate key these are all sets of columns which uniquely identify a re cord. If a non key column has a partial dependency on candidate key we split them into separate tables

I can expound on the above guideline in the above table we have the order id and the order date means that the order_date column is fully dependent on the Order_id column.we also have the product id and the city column,Means there is no/partial dependency on the columns meaning we have to split in to different columns .Thus in the below table the candidate keys includes ; Orderid,Customer_id and the product_id.

In regards to this we will split the data in the following ways ;

Table 1 will have order_id,Order_date,Ship_date,Ship_mode

Table 2 will have Customer_id,Customer_name,Segment, Country,City,State,Postal_code,Region

Table 3 will have Product_id,Category,Sub_category,Product_name,Sales.

领英推荐

Simple, Fundamental Data Quality Checks

David Freitag 10 个月前

Weekly Newsletter: Data Days 2024 Edition!

IDERA Software 3 周前

Utilizing DENSE_RANK for Data Deduplication in SQL

StrataScratch 3 个月前

Every table should have a primary key and the relationship between tables should be formed using foreign keys.

We will be visualizing all these below

Now this is how our database /datawarehouse should look like.We will be implementing CTE sql commands to create these tables using DBT .We will create an ERD diagram that expalains properly our data model and database.This is done for easier understanding to someone who is not conversant with tis field. We will move on to 3NF

THIRD NORMAL FORM

It follows the following rules

Must be in 2NF FORM
Avoid Transitive Relationship — lets say we have the table T and we have columns A,B and C .A is depedent on B and B is depedent on C,Then we can say A is transitively dependent on C .Thus we should avoid having this in the same table.As we can see in the above data above we can clearly see that we dont have transitive dependencies.

DIFFERENT ANOMALIES THAT WE OVERCOME

INSERTION ANOMALY — We have if a customer makes different orders ,means the customerinformation becomes redundant while in other columns the data is changing.
DELETION ANOMALY — If we have lets say an order is made and then canceled then we will only delete the data for the order table
UPDATE ANOMALY — Lets say we are updating the price of an item we will just update a single row in the order table rather than updating all the rows that have that item.

ADVANTAGES OF NORMALIZATION OF A DATABASE.

Reduces data redundancy and inconsistency.
Makes writing to the database faster.
Speeds up Update,Insert and Delete
Maintains data integrity.
Able to make more joins
Optimizsed data storage saves on space.

While some people may be not be using databases as often ,they have shifted to data warehousing in various cloud providers such as big query, Amazon redshift and snowflake .In this datawarehouses we now move to data modelling.Data modelling is done using dbt and in part two of this project i will be guiding on dbt and how we can create data models and later create visulizations in power bi.all in all data base and data normalization is a must tool for proefficeint data handling. The dbt project has already been done at this link https://github.com/stilinsk/dbt_zero_to_hero

要查看或添加评论，请登录

Simon Ngugi的更多文章

KAFKA

2024年4月16日

KAFKA

Kafka is a streaming platform capable of handling trillions of events a day. At its core, it is distributed…
DBT ZERO TO HERO

2024年4月10日

DBT ZERO TO HERO

In this article we focus on dbt ,its functionalities ,its integrations and its components in the buiding and scaling…

DATABASE NORMALIZATION

Simon Ngugi

DATA ENGINEER||ANALYTICS ENGINEER |||Transforming data to business value

领英推荐

Simon Ngugi的更多文章

社区洞察

其他会员也浏览了

A Step-by-Step Guide: How to Convert Tables to Graph

The top 3 things data engineers can stop spending time on

Data Definition

Optimizing ClickHouse Indexes for Efficient Point Select Query Performance

how we organize data defines how we can search for it

OLAP vs OLTP: The Unsung Heroes of the Data World

Disable Foreign Key Checks in ClickHouse for Faster Data Loading

The ONE thing to know to understand Business Intelligence

A Comprehensive Guide to DataTables, DataSets, and DataAdapters in C#.NET

"ASOF JOIN: Bridging the Time Gap in Data Analysis"

领英推荐

Simon Ngugi的更多文章

KAFKA

DBT ZERO TO HERO

社区洞察

其他会员也浏览了

A Step-by-Step Guide: How to Convert Tables to Graph

The top 3 things data engineers can stop spending time on

Data Definition

Optimizing ClickHouse Indexes for Efficient Point Select Query Performance

how we organize data defines how we can search for it

OLAP vs OLTP: The Unsung Heroes of the Data World

Disable Foreign Key Checks in ClickHouse for Faster Data Loading

The ONE thing to know to understand Business Intelligence

A Comprehensive Guide to DataTables, DataSets, and DataAdapters in C#.NET

"ASOF JOIN: Bridging the Time Gap in Data Analysis"