Mimicking Python Dictionaries in Power Query M: how, why, and is it better than a join/merge?
Note: This is a technical post. A more friendly version for those not as familiar with Power Query will follow shortly.
ETL or data munging is what Power Query is all about, but limiting ourselves to the UI functions shortchanges M's capabilities and the efficiency of our ETL processes. The "Functions in Focus: Power Query/M Language" series is set to introduce functions you'd only see using the "Advanced Editor" and writing the script yourself.
Don't panic! I'll include plenty of use cases, coding, and "what-ifs" to help along the way.
In today's article, we are going to focus on using Record.FieldOrDefault() in place of Table.NestedJoin() for many-to-one or one-to-one or one-to-many replacements. By replacements, I mean converting values like "US" and "USA" to "United States" or converting cumbersome fact table descriptors (like City or State) to integer dimension table keys.
Why are records special? And you mentioned Python...
I know, I know. I promised some amazing conversion of Python dictionaries in Power Query...and I lied, because it already exists. Power Query's record data type is nearly identical to Python's dictionary type. For R lovers (count me in), this is similar to a named vector. Records hold key:value pairs with unique keys and non-unique values. Going back to our "United States" example, a record would look something like:
[ US = "United States", USA = "United States" ]
Lookups into record or dictionary types are exceptionally fast and provide some error-checking benefits we will cover later. It's also a simple function call to convert a mapping table (let's say from a csv) to a record. You'll need to rename the key column "Name" and the value column "Values", then pass it to Record.FromTable(). That's it!
Ok, so why use it?
There are a few answers to that question:
- Record lookups are significantly faster than joins and merges.
- They are easier to write and understand than long if-statements.
- Record.FieldOrDefault() allows you to specify a default value, reducing the code and calculations to if-not-in-then statements often done with a series of Table.AddColumn().
- They throw an error if your key is not unique.
Faster than Joins and Merges
More than anything, using records is significantly faster for lookups than using a join or merge. In fact, in a test run of 500 iterations using Record.FieldOrDefault() versus Table.NestedJoin() (and expanding out the column) in series, Record.FieldOrDefault() was 4.5x faster on average with a third of the variability over each run.
That's a significant improvement!
Simplify Code
While you have to convert your mapping table to records, the code for using the mapping in another query is significantly simpler than using a join/merge. To add a column, you'd use 1 line instead of 2 lines. To implement if-not-in-then logic you'd use 1 line instead of 5 lines. The code sets below will generate the exact same table
Below is the M code to load the base data. I've used (...) in cases to shorten the code blocks. Let me know in the comments if there is anything missing or confusing.
let Source = Csv.Document( File.Contents("data.csv"), [ Delimiter=",", Columns=20, Encoding=1252, QuoteStyle=QuoteStyle.None ] ), #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]), #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers", {...}) in #"Changed Type"
Table.NestedJoin() Code
The mapping table query, called #"TblMapping", simply loads in a mapping file. This file has a column of many (e.g. "US" and "USA") and a column of ones (e.g. "United States") that will replace 7 different text values.
let Source = Csv.Document( File.Contents("mapping.csv"), [ Delimiter=",", Columns=2, Encoding=1252, QuoteStyle=QuoteStyle.None ] ), #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]), #"Changed Type" = Table.TransformColumnTypes( #"Promoted Headers", {...} ) in #"Changed Type"
Now that we have our source data and a mapping file, we can go ahead and join them. Remember that Table.NestedJoin() will give us a nested column of tables and has to be expanded.
let Source = Table.NestedJoin( SourceData, //103 MB csv ~ 0.5 million rows {"RatecodeID"}, //merge column TblMapping, //a short table for mapping ~7 rows {"RatecodeID"}, //merge column "TblMapping", //name of column in the table JoinKind.LeftOuter //outer join ), #"Expanded TblMapping" = Table.ExpandTableColumn( //column has to be expanded out Source, "TblMapping", {"RatecodeDescription"}, {"RatecodeDescription"} ) in#"Expanded TblMapping"
Record.FieldOrDefault() Code
Converting our mapping query to a records-based map is only a rename and conversion. The query below is named #"RcdMapping".
let Source = #"TblMapping" #"Renamed Columns" = Table.RenameColumns( #"Changed Type", { {"RatecodeID", "Name"}, {"RatecodeDescription", "Value"} } ), RecordVals = Record.FromTable(#"Renamed Columns") in RecordVals
With #"RcdMapping", we can generate the same table that we did with the join/merge, but with a Table.AddColumn() instead.
let Source = Table.AddColumn( SourceData, "RatecodeDescription", each Record.FieldOrDefault( #"RcdMapping", Text.From([RatecodeID]), null ), Text.Type ) in Source
The Record.FieldOrDefault() code simplifies the join to a simple Table.AddColumn() at the expense of some additional code in the mapping table/records. This trade off, however, is still significantly faster and makes the overall code more legible.
But wait...there's more!
I've often had a series of values that need replaced BUT the list doesn't have ALL of the values in the main table...so we'll get some nulls after the join/merge. Invariably, we end up having to generate a bunch of columns with if-statements to get where we want to go.
Using some generic code and Record.FieldOrDefault() we can replace this:
let Source = SomeData, MergeColumns = Table.NestedJoin( SomeData, {"Original Value"}, //e.g. US, USA, etc MappingTable, {"Original Value"}, "Nested Column", JoinKind.Outer ), ExpandColumn = Table.ExpandTableColumn( ..., {"Matched Value"}, //e.g. United States {"Matched Value"} ), LogicColumn = //Uses the original value if there there is no match. Table.AddColumn( ExpandColumn, "Fixed Value", each if [Matched Value] = null then [Original Value] else [Matched Value] ), RemoveOriginal = Table.RemoveColumns( LogicColumn, {"Original Value"} ), RenameColumn = Table.RenameColumns( RemoveOriginal, {{"Fixed Value", "Original Value"}} in RenameColumn
With this:
let Source = SomeData, TransformedOriginal = Table.TransformColumns( Source, { { "Original Value", each Record.FieldOrDefault( MappingRecords, //the lookup table [Original Value], //for the lookup [Original Value] //if it comes back null ) } } ) in TransformedOriginal
The downside here is that you have to physically write the code in "Advanced Editor" rather than use the UI, but you get a much faster and more understandable code block.
Error Handling
For many of us, error handling isn't something (about which) we regularly think (about). Record.FieldOrDefault() helps us raise or handle errors in 2 ways:
- The last argument of Record.FieldOrDefault() allows us to enter a flag value (e.g. "PROBLEM HERE") or raise an error with Error.Record() if there isn't a match when there should be one.
- Since the "name" column (or key side of the record) must be unique, the mapping query will throw an error if we have duplicates. This is better than a Table.Distinct() which make the code work but leaves us searching through code and data files trying to find a problem if one arises.
Special Note: One-to-Many Substitutions
So far, I've only dealt with many-to-one replacements (and it's one-to-one extension). Record.FieldOrDefault() can also be used for one-to-many replacements with some additional logic in the record-based queries.
#"RcdMapping" needs to group the "many" into a list associated with a "one".
let Source = #"TblMapping" #"Renamed Columns" = Table.RenameColumns( #"Changed Type", { {"RatecodeID", "Name"}, {"RatecodeDescription", "Value"} } ), GroupedVals = Table.Group( #"Renamed Columns", {"Name"}, { {"Value", each [Value], List.Type} //this will return a list } ), RecordVals = Record.FromTable(GroupedVals) in RecordVals
And our final query needs to account for bringing in lists and expanding the list column:
let Source = Table.AddColumn( SourceData, "RatecodeDescription", each Record.FieldOrDefault( #"RcdMapping", Text.From([RatecodeID]), {} //change to empty list ), List.Type //changed to a List.Type ), ExpandList = Table.ExpandListColumn( in Source
Is it still faster? Absolutely! In 500 runs, Record.FieldOrDefault() ran 6 times faster with only 5% of the variation* compared to the join/merge method. *The asterisk is because there seem to be a significant number of outliers that are pushing out the variability. Eyeballing the chart, the variation is at least better than the 1/3 we saw on the many-to-one test.
Conclusion
For those familiar with Python dictionaries or R named vectors, this process is old-hat. Power Query being Power Query though, it's always a question if axioms from other languages will hold. In this case, it holds quite well (and way faster).
On a final note, records can hold many different data types and Record.FieldOrDefault() can operate like the SWITCH() function in DAX. With some ingenuity, we could even invoke different functions based on a column's values without having to write complicated if-statements.
This is my first article for LinkedIn and I hope you enjoyed it. Let me know in the comments what you'd like to see changes and what you'd like to see in future articles. I can also post up a series for DAX functions and we can explore use cases or "why does M/DAX do this?" situations together.
Farewell and happy coding! Jay S.
Important Links
Photo/Image Credits
Photo of Billy Mays: goon meet (Photo of Billy Mays) by Sharese Ann Frederick from Purchase, N.Y. retrieved 5/13/2019 from https://commons.wikimedia.org/wiki/File:Billy_Mays_Portrait_Cropped.jpg
About the Author
I'm a long-time programming hobbyist starting with BASIC and logic gates. Today, I work mostly in DAX, M, R, and Python building data visualizations at Discovery Communications. I'm happy to write on just about anything and love to collaborate, so if there is something you want to see or work on together, let me know.