Cleaning, Preparing, And Analyzing IMDb Box Office Data in Power BI

Cleaning, Preparing, And Analyzing IMDb Box Office Data in Power BI

Before analyzing data and creating dashboards, the important steps of preparing and cleaning data must be performed. In Power BI, we can use Power Query to not only clean our data but also transform the data and combine tables or create aggregations.

In this project, we are given a goal to analyze some box office data, but the data is messy so it first needs to be cleaned and prepared.

The Scenario

A movie company wants to get quick insights into the difference between the total amount generated across different zones for 2019 and 2020.

The Business Goal

As a data analyst at IMDbPro, we have been asked to provide quick insights in the form of a table to compare the total amount made in the worldwide, domestic, and foreign zones. Use the Boxoffice 2019 and 2020 data that is already imported from the web. These data sets are messy so we will need to spend time tidying up the data. Then we can make recommendations based on the result of the generated table.

The Data

The two datasets we have come from the web. They are the 2019 and 2020 box office IMDB data that we can import into Power BI using the "Web" Source.

2019 data:

https://www.boxofficemojo.com/year/world/2019/

2020 data:

https://www.boxofficemojo.com/year/world/2020/

The Process

In order to achieve the desired results, we need to:

  • Clean the two data sets
  • Rename column names
  • Replace errors, detect and change data types
  • Drop unnecessary columns
  • Append the two data sets
  • Perform aggregation on the data using group by on the transform tab (Group by Year)

Step 1:

Start by importing the 2019 dataset into Power BI:

No alt text provided for this image

Click Connect and then we see the tables pop up like this:

No alt text provided for this image

Table 2 is the one we want so we will click on that and then OK. Rename the table to "BoxOffice-2019."


Now we will import the 2020 data the same way. Rename that table "BoxOffice-2020."

No alt text provided for this image
Note: This opens up the tables in Power Query, which is where we will do all of our data cleaning and prep.

We can see that both tables have 7 columns and 200 rows (this is important to know for future steps):

No alt text provided for this image

Step 2:

On the BoxOffice 2019 table, rename Release Group as "Movie". Rename any other columns that have a space (best practice is to have no spaces in column names, as it makes analysis and calculations easier).

Step 3:

We want to select the "Worldwide," "Domestic," and "Foreign" column and then "Detect Data Type" from the Transform tab:

No alt text provided for this image
No alt text provided for this image

?We can see the Foreign column has some errors:

No alt text provided for this image

Select just the Foreign column. Replace the errors with 0:

No alt text provided for this image

Now the errors have been fixed:

No alt text provided for this image

Step 4:

Let's delete the %_1 and % columns, as they are not relevant for our analyses:

No alt text provided for this image

Step 5:

We notice that there are $ signs in the Domestic column. We don't want them in our cells (we just want the amounts):

No alt text provided for this image

So we will create a new column without the dollar signs. We can do this by clicking on the Domestic column and then the "Column From Examples" tab -> "From Selection."

No alt text provided for this image

Then we can type in just the number in the first cell and click Enter. We can see that the rest of the cells are auto-filled.

No alt text provided for this image

When we click OK, we get this new column called "Text After Delimiter":

No alt text provided for this image

We can delete the column labeled "Domestic" since we no longer need it, and rename this new column above as "Domestic." We also want to change the datatype to "Fixed decimal number":

No alt text provided for this image

Now our 2019 dataset is clean and ready for analysis.

Step 6:

Repeat the above steps for the 2020 data, renaming any column names that have spaces, fixing datatypes, and creating new columns that only contain the amounts (no currency labels).

Note that in the 2020 data, both the Domestic and Foreign tabs have $ signs in front of the amounts, so we can repeat the new column process for both columns:

No alt text provided for this image

After we create our new columns, change the datatypes, and delete the original columns, we end up with these:

No alt text provided for this image

Step 7:

We want our two tables to have the same column order, so let's move "Worldwide" so that it is beside Domestic and Foreign in both tables:

No alt text provided for this image

Step 8:

Next, we want to amend the tables. Why Append and not Merge?

No alt text provided for this image

A few notes about the differences between merging and appending in Power BI:

  • Append works like Union All in SQL
  • Joins deal with columns while Append deals with rows.
  • That is, Append makes datasets with the same structure LONGER. Joins makes datasets WIDER (need not have the same structure but must have a related column to merge the join on).

So that is why it was important that we checked earlier that the tables both had the same number of rows, and that is also why we moved the columns in both tables so that they line up. In order to Append, tables need to have the same number of rows, and the same datatypes in each row.

We will click on the 2019 table and then Home -> Combine -> Append Queries as New:

No alt text provided for this image

Now we can append that to the 2020 table as follows:

No alt text provided for this image

Our resulting table is added to the Power Query editor, which we can rename to "BoxOfficeData". The resulting appended table now has 6 columns, and 400 rows.

No alt text provided for this image

Step 9:

Since our task is to compare the total amount made in the worldwide, domestic, and foreign zones, we need to perform some aggregation on our new table.

Go to Transform -> Group By and select Year as the grouping, and create three Sum columns:

No alt text provided for this image

Since we grouped by year, we end up with 2 rows in our table, which is what we want (so that we can compare 2019 and 2020 data):

No alt text provided for this image
No alt text provided for this image

Analysis and Insights

Looking at the numbers, we can see that the 2019 values are higher. This could be due to COVID-19 impacting the box office 2020 values, as businesses were in lockdown during that time.

We can also see that the Foreign total is larger than the domestic total. That is means that more people in foreign countries went to the movies during those times (again, could be due to different regulations regarding business closures in other countries during the pandemic).

One recommendation based on these insights is for movie theaters to invest more in domestic markets to get sales up, perhaps by enticing customers with discounted movie tickets on certain days of the week, or rewarding regular customers with deals so that they will return.

Final Thoughts

This data cleaning and preparation is complete, and we were able to gain some cursory insights from our data. This project was to show how we can get our data ready in Power BI by using Power Query transformations to create a dataset that we can use to perform some quick analysis. For future analysis, we can take a deeper dive into the data by creating some visualizations and seeing what the different trends are.


Thank you for reading!

I hope you enjoyed this project write-up.

If you would like to see more posts and articles, please follow or connect with me on LinkedIn.

Happy learnings!

Succinct and cleanly analysed. Love the report and the final recommendation. Sarah Rajani, CTP

Trevor Maxwell

Technical Business Analyst | Data Nerd | (SQL : Python : Tableau : PowerBI)

1 年

Nice job! I enjoy looking at movie data.

The instructions are very clear and easy to understand for anyone new to data analysis, especially tools such as power query and power bi. Also use of screenshot make the instructions much clearer. Good job Sarah Rajani, CTP ??????

Maheshkumar Badmera

Data Analyst | Data aficionado

1 年

It's amazing Sarah Rajani, CTP

Timothy Zajac

Data Analyst | Excel | SQL | Tableau | Top Secret Security Clearance | Army Veteran

1 年

Great article! I'm working on a project analyzing NASCAR race result data by season and I think the Append feature that you described here is the way to solve that problem.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了