Cleaning, Preparing, And Analyzing IMDb Box Office Data in Power BI
Sarah Rajani, CTP
'Data with Sarah' -- Data Analyst || Proficient in SQL, Excel, Tableau, Power BI, Looker, and Looker Studio || Turning Complex Data into Actionable Insights
Before analyzing data and creating dashboards, the important steps of preparing and cleaning data must be performed. In Power BI, we can use Power Query to not only clean our data but also transform the data and combine tables or create aggregations.
In this project, we are given a goal to analyze some box office data, but the data is messy so it first needs to be cleaned and prepared.
The Scenario
A movie company wants to get quick insights into the difference between the total amount generated across different zones for 2019 and 2020.
The Business Goal
As a data analyst at IMDbPro, we have been asked to provide quick insights in the form of a table to compare the total amount made in the worldwide, domestic, and foreign zones. Use the Boxoffice 2019 and 2020 data that is already imported from the web. These data sets are messy so we will need to spend time tidying up the data. Then we can make recommendations based on the result of the generated table.
The Data
The two datasets we have come from the web. They are the 2019 and 2020 box office IMDB data that we can import into Power BI using the "Web" Source.
2019 data:
https://www.boxofficemojo.com/year/world/2019/
2020 data:
https://www.boxofficemojo.com/year/world/2020/
The Process
In order to achieve the desired results, we need to:
Step 1:
Start by importing the 2019 dataset into Power BI:
Click Connect and then we see the tables pop up like this:
Table 2 is the one we want so we will click on that and then OK. Rename the table to "BoxOffice-2019."
Now we will import the 2020 data the same way. Rename that table "BoxOffice-2020."
Note: This opens up the tables in Power Query, which is where we will do all of our data cleaning and prep.
We can see that both tables have 7 columns and 200 rows (this is important to know for future steps):
Step 2:
On the BoxOffice 2019 table, rename Release Group as "Movie". Rename any other columns that have a space (best practice is to have no spaces in column names, as it makes analysis and calculations easier).
Step 3:
We want to select the "Worldwide," "Domestic," and "Foreign" column and then "Detect Data Type" from the Transform tab:
?We can see the Foreign column has some errors:
Select just the Foreign column. Replace the errors with 0:
Now the errors have been fixed:
Step 4:
Let's delete the %_1 and % columns, as they are not relevant for our analyses:
Step 5:
We notice that there are $ signs in the Domestic column. We don't want them in our cells (we just want the amounts):
So we will create a new column without the dollar signs. We can do this by clicking on the Domestic column and then the "Column From Examples" tab -> "From Selection."
领英推荐
Then we can type in just the number in the first cell and click Enter. We can see that the rest of the cells are auto-filled.
When we click OK, we get this new column called "Text After Delimiter":
We can delete the column labeled "Domestic" since we no longer need it, and rename this new column above as "Domestic." We also want to change the datatype to "Fixed decimal number":
Now our 2019 dataset is clean and ready for analysis.
Step 6:
Repeat the above steps for the 2020 data, renaming any column names that have spaces, fixing datatypes, and creating new columns that only contain the amounts (no currency labels).
Note that in the 2020 data, both the Domestic and Foreign tabs have $ signs in front of the amounts, so we can repeat the new column process for both columns:
After we create our new columns, change the datatypes, and delete the original columns, we end up with these:
Step 7:
We want our two tables to have the same column order, so let's move "Worldwide" so that it is beside Domestic and Foreign in both tables:
Step 8:
Next, we want to amend the tables. Why Append and not Merge?
A few notes about the differences between merging and appending in Power BI:
So that is why it was important that we checked earlier that the tables both had the same number of rows, and that is also why we moved the columns in both tables so that they line up. In order to Append, tables need to have the same number of rows, and the same datatypes in each row.
We will click on the 2019 table and then Home -> Combine -> Append Queries as New:
Now we can append that to the 2020 table as follows:
Our resulting table is added to the Power Query editor, which we can rename to "BoxOfficeData". The resulting appended table now has 6 columns, and 400 rows.
Step 9:
Since our task is to compare the total amount made in the worldwide, domestic, and foreign zones, we need to perform some aggregation on our new table.
Go to Transform -> Group By and select Year as the grouping, and create three Sum columns:
Since we grouped by year, we end up with 2 rows in our table, which is what we want (so that we can compare 2019 and 2020 data):
Analysis and Insights
Looking at the numbers, we can see that the 2019 values are higher. This could be due to COVID-19 impacting the box office 2020 values, as businesses were in lockdown during that time.
We can also see that the Foreign total is larger than the domestic total. That is means that more people in foreign countries went to the movies during those times (again, could be due to different regulations regarding business closures in other countries during the pandemic).
One recommendation based on these insights is for movie theaters to invest more in domestic markets to get sales up, perhaps by enticing customers with discounted movie tickets on certain days of the week, or rewarding regular customers with deals so that they will return.
Final Thoughts
This data cleaning and preparation is complete, and we were able to gain some cursory insights from our data. This project was to show how we can get our data ready in Power BI by using Power Query transformations to create a dataset that we can use to perform some quick analysis. For future analysis, we can take a deeper dive into the data by creating some visualizations and seeing what the different trends are.
Thank you for reading!
I hope you enjoyed this project write-up.
If you would like to see more posts and articles, please follow or connect with me on LinkedIn.
Happy learnings!
Technical Writing
1 年Succinct and cleanly analysed. Love the report and the final recommendation. Sarah Rajani, CTP
Technical Business Analyst | Data Nerd | (SQL : Python : Tableau : PowerBI)
1 年Nice job! I enjoy looking at movie data.
The instructions are very clear and easy to understand for anyone new to data analysis, especially tools such as power query and power bi. Also use of screenshot make the instructions much clearer. Good job Sarah Rajani, CTP ??????
Data Analyst | Data aficionado
1 年It's amazing Sarah Rajani, CTP
Data Analyst | Excel | SQL | Tableau | Top Secret Security Clearance | Army Veteran
1 年Great article! I'm working on a project analyzing NASCAR race result data by season and I think the Append feature that you described here is the way to solve that problem.