Two Million Rows + Two Data Analysis Tool (2021 Iowa Liquor Sales)
Large dataset transformation using Excel and Power BI (2,622,711 records)
|| PART 1 of a series ||
Part one of a series of posts will include this case study. This post’s main focus is utilizing MS Excel and Power BI to extract, transform, and clean a huge dataset from?2021 Iowa Liquor Sales.
While Power BI (free desktop version) can manage datasets up to 1GB in size, MS Excel can only handle 1 million rows of data.
In this case study, I'll attempt to transform a dataset of more than 2 million sales records using Excel and Power BI.
You may access the data source at?https://data.iowa.gov/Sales-Distribution/2021-Iowa-Liquor-Sales/cc6f-sgik/data
Using the iowa.gov website’s filter tool, a preliminary search produced a dataset with 2,622,711 records and 24 columns (from January 1, 2021 to December 31, 2021).
The full datasets will be cleaned using a batch-based methodology. Here are the actions:
- Extract data in CSV format from the source (total of 12 sets, equivalent to 12 months for the year 2021).
- Stage 1 data cleaning in Excel (retain only relevant columns, format the date column)
- Stage 2 data cleaning in Power BI (transform the table for one CSV file only)
- Bulk cleaning in Power BI that is automated for the remaining sets of CSV files (build an automated function from step #3).
Extract Data From Source
To extract the datasets, use the filter feature on the data.iowa.gov website. The filter settings are displayed here for the entire month of January 2021 (from 01/01/2021 to 01/31/2021).
Save the CSV in Excel format. Follow the same steps for the remaining months, a year’s worth of data needs to be analyzed (total of 12 datasets).
Rename the CSV files downloaded. The sets of data that were taken from the source and their matching filenames are listed below.
File "2021.01.cvs" refers to the month of January 2021. Rest of the file refers to succeeding months, in chronological order. The last file "2021.12.cvs" refers to the month of December 2021.
Stage 1 Cleaning in Excel
Open the 2021.01.csv file in excel. The file contains an initial 24 columns.
The 24 columns’ names are listed below. On the CSV file, the highlighted (yellow) column names will be kept, while the others will be removed.
Here is the result file after the columns were removed.
The file’s date column is not correctly recognized by Excel. Excel, for instance, interprets the cell containing “01/04/2021†as?April 4, 2021.
Select all of the dates in the Date Column to fix this. Select “Text to Columns†from the Data tab.
Following a number of steps in the “Text to Columnsâ€, Excel will display a Columns Wizard. Select “Date†and “MDY†as the data formats to ensure proper date formatting.
MDY will format the cell "01/04/2021" as "January 4, 2021".
When the preceding step is completed, Excel will recognize the cells as having the right date format.
Repeat these steps for the remaining file sets. Save the files after editing each of the 12 sets of CSV files before moving on to the next step.
Stage 2 Cleaning in Power BI
Upload the first file (2021.01.csv) in Power BI.
Note: The Power Query Editor plane within Power BI will be used to carry out the following activities.
First, promote the first row as header for the table.
Next, for some identified columns in the table, carry out the following actions: (a) Capitalize Each Word; (b) Trim; and (c) Clean.
The?City,?County,?Category Name, and?Item Description?columns in the table should be converted using the aforementioned procedures.
The results of the preceding Power BI tasks will be this.
Next, duplicate the Date Column.
Make two duplicates of the Date column, as seen below. These columns will be useful in the Data Visualization stage?(the Data Visualization will be discuss on another part of this series).
Next, rename the two new columns as “Month Name†and “Monthâ€, respectively.
Extract the month name and month number for the above columns in the Power Query plane.
To get the month name, go to Transform tab, click Date>Month>Name of Month.
领英推è
To get the corresponding month number, click Date>Month>Month
Once completed, below is the output.
Next, go to Home tab, click the Choose Columns and tick those columns as shown below:
Create an Automated Function in Power Query
Below is a list of the tasks from the previous section that were completed on the table. In Power BI’s Power Query plane, they are referred to as “APPLIED STEPSâ€.
To automate the previous section’s steps, duplicate the “Sales†table in Power Query.
Go to Advanced Editor of the “Sales (2)†table, edit the M language of this table.
M Code of “Sales (2)†table.
M or "mash-up" is the language use in Power Query Editor.
Modify the first two lines of the code.
Existing Code:
let
Source = Csv.Document(File.Contents(“C:\Users\PC\Desktop\datasets\Set03\2021 Sales\2021.01.csvâ€),
Modified Code:
(month as text) =>
let
Source = Csv.Document(File.Contents(“C:\Users\PC\Desktop\datasets\Set03\2021 Sales\â€& month &â€.csvâ€),
Power BI will generate an automated function (fx) in response to the modifications made to the code in the Advanced Editor. The function will perform the “APPLIED STEPS†once a parameter?“month�is type in the text box as show below. The?month?is the file name of source of data i.e?2021.01,?2021.02?till?2021.12
This custom function will pull the data inside an specified folder for a cluster of CSV files found inside that folder.
Folder path: C:\Users\PC\Desktop\datasets\Set03\2021 Sales\
To process the remaining batch of CSV files, create a new table in Power BI. In the Power Query Editor plane, go to Home tab, click “Enter Dataâ€
Once the Create Table appears, type the names of CSV files as shown below.
A new Table will appear in the Query section:
To invoke the custom function, go to Add Column tab and click “Invoke Custom Functionâ€
Power BI will start automating the tasks of data transformation without requiring the user to manually open and edit each of the 12 CVS files. To populate the Table, click the button on the newly added column “Sales (2)â€
The expanded table is shown as below:
Click both the “Date†and “Sale (Dollars)†column, go to Transform tab and click “Detect Data Typeâ€.
These two columns will be altered in Power BI with the appropriate Data Type.
Date -> as Date and Sale (Dollars) -> as Decimal Number
Rename the?Table?as?Sales 2021.
Parting Words
At this point, using Excel and Power BI, we had successfully transformed a sizable dataset (2,622,711 records).
Excel is used for batches of approximately “acceptable†file size during the first step of data cleaning. Referring to the 2,622,711 records and divided by 12 files equals to 218,600 records for each file. Simple tasks like date format adjustment and column deletion can be handled by Excel.
Power BI has stepped up the pace in the second stage of data cleaning by changing a sample file and then allowing us to design an automated procedure to finish the cleaning.
In the second stage of data cleaning, Power BI has pickup the pace by transforming a sample file and then allowing us to create an automate function to complete the chuck of the cleaning.
By doing a first stage modification in Excel, we reduce the “processing power†in Power Query instead of providing the raw CSV file directly to Power BI. The good news is that Power BI can still gather updated data even if changes are made in Excel afterwards. Power BI can retrieve the updates from the source CSV file by simply pressing the “Refresh Preview†button in the Power Query Editor plane.
We will examine the profile of the data we clean in the following series.