Two Million Rows + Two Data Analysis Tool (2021 Iowa Liquor Sales)
image created in Canva

Two Million Rows + Two Data Analysis Tool (2021 Iowa Liquor Sales)

Large dataset transformation using Excel and Power BI (2,622,711 records)

|| PART 1 of a series ||

Part one of a series of posts will include this case study. This post’s main focus is utilizing MS Excel and Power BI to extract, transform, and clean a huge dataset from?2021 Iowa Liquor Sales.

While Power BI (free desktop version) can manage datasets up to 1GB in size, MS Excel can only handle 1 million rows of data.

In this case study, I'll attempt to transform a dataset of more than 2 million sales records using Excel and Power BI.
        

You may access the data source at?https://data.iowa.gov/Sales-Distribution/2021-Iowa-Liquor-Sales/cc6f-sgik/data

Using the iowa.gov website’s filter tool, a preliminary search produced a dataset with 2,622,711 records and 24 columns (from January 1, 2021 to December 31, 2021).

No alt text provided for this image

The full datasets will be cleaned using a batch-based methodology. Here are the actions:

  1. Extract data in CSV format from the source (total of 12 sets, equivalent to 12 months for the year 2021).
  2. Stage 1 data cleaning in Excel (retain only relevant columns, format the date column)
  3. Stage 2 data cleaning in Power BI (transform the table for one CSV file only)
  4. Bulk cleaning in Power BI that is automated for the remaining sets of CSV files (build an automated function from step #3).

Extract Data From Source

To extract the datasets, use the filter feature on the data.iowa.gov website. The filter settings are displayed here for the entire month of January 2021 (from 01/01/2021 to 01/31/2021).

Save the CSV in Excel format. Follow the same steps for the remaining months, a year’s worth of data needs to be analyzed (total of 12 datasets).

No alt text provided for this image

Rename the CSV files downloaded. The sets of data that were taken from the source and their matching filenames are listed below.

File "2021.01.cvs" refers to the month of January 2021. Rest of the file refers to succeeding months, in chronological order. The last file "2021.12.cvs" refers to the month of December 2021.        
No alt text provided for this image

Stage 1 Cleaning in Excel

Open the 2021.01.csv file in excel. The file contains an initial 24 columns.

No alt text provided for this image

The 24 columns’ names are listed below. On the CSV file, the highlighted (yellow) column names will be kept, while the others will be removed.

No alt text provided for this image

Here is the result file after the columns were removed.

No alt text provided for this image

The file’s date column is not correctly recognized by Excel. Excel, for instance, interprets the cell containing “01/04/2021” as?April 4, 2021.

No alt text provided for this image

Select all of the dates in the Date Column to fix this. Select “Text to Columns” from the Data tab.

No alt text provided for this image

Following a number of steps in the “Text to Columns”, Excel will display a Columns Wizard. Select “Date” and “MDY” as the data formats to ensure proper date formatting.

MDY will format the cell "01/04/2021" as "January 4, 2021".        
No alt text provided for this image

When the preceding step is completed, Excel will recognize the cells as having the right date format.

No alt text provided for this image

Repeat these steps for the remaining file sets. Save the files after editing each of the 12 sets of CSV files before moving on to the next step.

Stage 2 Cleaning in Power BI

Upload the first file (2021.01.csv) in Power BI.

Note: The Power Query Editor plane within Power BI will be used to carry out the following activities.

No alt text provided for this image

First, promote the first row as header for the table.

No alt text provided for this image

Next, for some identified columns in the table, carry out the following actions: (a) Capitalize Each Word; (b) Trim; and (c) Clean.

No alt text provided for this image

The?City,?County,?Category Name, and?Item Description?columns in the table should be converted using the aforementioned procedures.

No alt text provided for this image

The results of the preceding Power BI tasks will be this.

No alt text provided for this image

Next, duplicate the Date Column.

No alt text provided for this image

Make two duplicates of the Date column, as seen below. These columns will be useful in the Data Visualization stage?(the Data Visualization will be discuss on another part of this series).

No alt text provided for this image

Next, rename the two new columns as “Month Name” and “Month”, respectively.

No alt text provided for this image

Extract the month name and month number for the above columns in the Power Query plane.

No alt text provided for this image

To get the month name, go to Transform tab, click Date>Month>Name of Month.

No alt text provided for this image

To get the corresponding month number, click Date>Month>Month

No alt text provided for this image

Once completed, below is the output.

No alt text provided for this image

Next, go to Home tab, click the Choose Columns and tick those columns as shown below:

No alt text provided for this image
No alt text provided for this image

Create an Automated Function in Power Query

Below is a list of the tasks from the previous section that were completed on the table. In Power BI’s Power Query plane, they are referred to as “APPLIED STEPS”.

No alt text provided for this image

To automate the previous section’s steps, duplicate the “Sales” table in Power Query.

No alt text provided for this image

Go to Advanced Editor of the “Sales (2)” table, edit the M language of this table.

No alt text provided for this image

M Code of “Sales (2)” table.

M or "mash-up" is the language use in Power Query Editor.        
No alt text provided for this image

Modify the first two lines of the code.

Existing Code:

let

Source = Csv.Document(File.Contents(“C:\Users\PC\Desktop\datasets\Set03\2021 Sales\2021.01.csv”),        

Modified Code:

(month as text) =>

let
Source = Csv.Document(File.Contents(“C:\Users\PC\Desktop\datasets\Set03\2021 Sales\”& month &”.csv”),        
No alt text provided for this image

Power BI will generate an automated function (fx) in response to the modifications made to the code in the Advanced Editor. The function will perform the “APPLIED STEPS” once a parameter?“month”?is type in the text box as show below. The?month?is the file name of source of data i.e?2021.01,?2021.02?till?2021.12

This custom function will pull the data inside an specified folder for a cluster of CSV files found inside that folder.

Folder path: C:\Users\PC\Desktop\datasets\Set03\2021 Sales\        
No alt text provided for this image

To process the remaining batch of CSV files, create a new table in Power BI. In the Power Query Editor plane, go to Home tab, click “Enter Data”

No alt text provided for this image

Once the Create Table appears, type the names of CSV files as shown below.

No alt text provided for this image

A new Table will appear in the Query section:

No alt text provided for this image

To invoke the custom function, go to Add Column tab and click “Invoke Custom Function”

No alt text provided for this image
No alt text provided for this image

Power BI will start automating the tasks of data transformation without requiring the user to manually open and edit each of the 12 CVS files. To populate the Table, click the button on the newly added column “Sales (2)”

No alt text provided for this image

The expanded table is shown as below:

No alt text provided for this image

Click both the “Date” and “Sale (Dollars)” column, go to Transform tab and click “Detect Data Type”.

No alt text provided for this image
No alt text provided for this image

These two columns will be altered in Power BI with the appropriate Data Type.

Date -> as Date and Sale (Dollars) -> as Decimal Number        
No alt text provided for this image

Rename the?Table?as?Sales 2021.

No alt text provided for this image


Parting Words

At this point, using Excel and Power BI, we had successfully transformed a sizable dataset (2,622,711 records).

Excel is used for batches of approximately “acceptable” file size during the first step of data cleaning. Referring to the 2,622,711 records and divided by 12 files equals to 218,600 records for each file. Simple tasks like date format adjustment and column deletion can be handled by Excel.

Power BI has stepped up the pace in the second stage of data cleaning by changing a sample file and then allowing us to design an automated procedure to finish the cleaning.

In the second stage of data cleaning, Power BI has pickup the pace by transforming a sample file and then allowing us to create an automate function to complete the chuck of the cleaning.

By doing a first stage modification in Excel, we reduce the “processing power” in Power Query instead of providing the raw CSV file directly to Power BI. The good news is that Power BI can still gather updated data even if changes are made in Excel afterwards. Power BI can retrieve the updates from the source CSV file by simply pressing the “Refresh Preview” button in the Power Query Editor plane.

We will examine the profile of the data we clean in the following series.

要查看或添加评论,请登录

Allan Miranda的更多文章

社区洞察

其他会员也浏览了