Combine Files With Different Column Names in Power Query
I'm going to look at situations that appear to be fairly common: loading data into Power Query where the source data has columns with different names.
For example, in the following file received in August 2021, I have data that relates to various items over a 12 month period.
Each month I get a new file showing values for that month, the next 6 months and the preceding 5.
So for Sep 21 the file looks like this
It contains data for the months Apr 21 to Mar 22. It still contains 12 months of data, but it's moved along by 1 month compared to the previous month's file.
Clearly each file has the same number of columns and whilst some of those columns have the same names, the order of those columns changes every month.
If I load these files into PQ and combine them let's see what I get.
In Power BI Desktop, I load data from a folder: Get data -> More -> Folder
After browsing to the folder containing the CSV files, and selecting it, I choose to?Combine & Transform?the files.
In the data preview I accept the defaults and click OK.
The result of combining the files is this
It's not exactly what I want. For a start the columns contain the Month-Year names and these aren't consistent down the columns because for each new file, the months shift along by one.
What I want is a column for the item and a column for the months (dates) and a column for the numeric values. How to achieve this?
The first thing I'll do is duplicate this query so I can play around with it a little and investigate how to solve this problem.
With this done, I'm going to back up a bit and delete the last 2 steps
This leaves me at the point in the query where I have the 3 CSV files as Tables in one of?my?columns.
Clicking on the first table (Aug 21) shows the data in that file
Promoting the first row to column headers is the thing to do here
Which does this
To get all the months/dates into a column and all the values into another column, I can select the Item column and then?Unpivot Other Columns
This gives me the data from this file in a format that I can use so I need to apply these transformations to all the files.
Back in?my?main query where I left it with just the files as tables in a column
Add a Custom Column
Using the?Table.PromoteHeaders?function I can manipulate each of the tables inside the?Transform File?column using this code
Clicking in the cell beside the August-21 table shows me that the first row has been promoted to headers, and it is the same for all 3 tables.
Next I need to unpivot the dates/values. Add another Custom Column and enter this code
Note that I am referring to the?Custom?column as the location for the tables, and that I am naming the first column created by the unpivot?Date
I get another column containing tables with the dates and values unpivoted into their own columns
Almost done. Next delete the first 3 columns leaving me with just the?Custom.1?column.
I can expand the tables in this column now. Uncheck?Use original column name as prefix
Giving me this
There's one more thing to do. Remember that each month I get data for another month, so I'm going to end up with a lot of duplicate values for months that are contained in several files.
To see this, first change the Date column to the Date type
Next sort the?Item?column in ascending order, then sort the?Date?column in ascending order.
It's plain to see now that there are duplicate values
领英推荐
To fix this, click on the?Item?column then click on the?Date?column, then right click on either column and choose?Remove Duplicates.
Different Column Names - Same Data
?this scenario I'm receiving data about clients and the values represent the amount I'm charging for some work done, on the dates shown.
In the first CSV file, for January, you can see there are 3 columns, the first one named Client.
In the February file there are also 3 columns but the first is named Customer.
If I load these files with Power Query (using the same process as in the first example) I get this.
I get nulls in the first column because in the Feb CSV file that column is called?Customer, and in Jan it's called?Client. The Feb file has been loaded first so the first column takes its name from that file. As the Jan file does not have a?Customer?column, Power Query enters nulls here, and doesn't know what to do with the data in Jan's?Client?column so omits that entirely.
I have 4 solutions to this problem for you.
Solution 1
Let's back up a bit and get rid of these steps.
I want to be left with the source CSV files as tables in a column.
I want to demote the column headers to become the first row in the tables. To achieve this, add a Custom Column with this code using the?Table.DemoteHeaders?function.
The table(s) being acted upon are in the?Transform File (2)?column.
With the headers demoted in these tables I can skip the first row in each using the?Table.Skip?function. Add a Custom Column with this code
With the first rows removed each table is left with just the data. I can now delete the first 3 columns from the query and expand the?Custom.1?column
All that is left is to rename the columns, set column data types and (if you wish) sort by date.
Solution 2
Duplicate the query from Solution 1
Delete all the steps after?Added Custom?which demotes the table headers.
Delete the first 2 columns from the query and expand the tables in the?Custom?column.
Filter column 3 to remove the text "Date"
Now you can rename the columns, set data types and sort as desired.
Solution 3
Duplicate the query from Solution 1, delete all the steps after?Added Custom?and delete the first 2 columns in the query to leave a single column?Custom.
Right click on the?Column 3?header and remove the errors
Promote headers : Transform tab -> Use First Row as Headers
I can now filter out the header row from the table by filtering any of the columns to remove the appropriate text value. In this case I'll filter the?Date?column to remove the text "Date"
Hopefully this will help you deal with situations where you also face importing columns with names that don't match up.
Let me know if you have any other scenarios with different column names causing you issues.