登录查看更多内容

Mimicking Python Dictionaries in Power Query M: how, why, and is it better than a join/merge?

Jay Sumners

Manager, Data & Analytics

发布日期: 2019年5月13日

Note: This is a technical post. A more friendly version for those not as familiar with Power Query will follow shortly.

ETL or data munging is what Power Query is all about, but limiting ourselves to the UI functions shortchanges M's capabilities and the efficiency of our ETL processes. The "Functions in Focus: Power Query/M Language" series is set to introduce functions you'd only see using the "Advanced Editor" and writing the script yourself.

Don't panic! I'll include plenty of use cases, coding, and "what-ifs" to help along the way.

In today's article, we are going to focus on using Record.FieldOrDefault() in place of Table.NestedJoin() for many-to-one or one-to-one or one-to-many replacements. By replacements, I mean converting values like "US" and "USA" to "United States" or converting cumbersome fact table descriptors (like City or State) to integer dimension table keys.

Why are records special? And you mentioned Python...

I know, I know. I promised some amazing conversion of Python dictionaries in Power Query...and I lied, because it already exists. Power Query's record data type is nearly identical to Python's dictionary type. For R lovers (count me in), this is similar to a named vector. Records hold key:value pairs with unique keys and non-unique values. Going back to our "United States" example, a record would look something like:

[
    US = "United States",
    USA = "United States"  
]

Lookups into record or dictionary types are exceptionally fast and provide some error-checking benefits we will cover later. It's also a simple function call to convert a mapping table (let's say from a csv) to a record. You'll need to rename the key column "Name" and the value column "Values", then pass it to Record.FromTable(). That's it!

Ok, so why use it?

There are a few answers to that question:

Record lookups are significantly faster than joins and merges.
They are easier to write and understand than long if-statements.
Record.FieldOrDefault() allows you to specify a default value, reducing the code and calculations to if-not-in-then statements often done with a series of Table.AddColumn().
They throw an error if your key is not unique.

Faster than Joins and Merges

More than anything, using records is significantly faster for lookups than using a join or merge. In fact, in a test run of 500 iterations using Record.FieldOrDefault() versus Table.NestedJoin() (and expanding out the column) in series, Record.FieldOrDefault() was 4.5x faster on average with a third of the variability over each run.

That's a significant improvement!

Simplify Code

While you have to convert your mapping table to records, the code for using the mapping in another query is significantly simpler than using a join/merge. To add a column, you'd use 1 line instead of 2 lines. To implement if-not-in-then logic you'd use 1 line instead of 5 lines. The code sets below will generate the exact same table

Below is the M code to load the base data. I've used (...) in cases to shorten the code blocks. Let me know in the comments if there is anything missing or confusing.

let
    Source = 
        Csv.Document(
            File.Contents("data.csv"),
            [
                Delimiter=",", 
                Columns=20, 
                Encoding=1252, 
                QuoteStyle=QuoteStyle.None
            ]
         ),

    #"Promoted Headers" = 
        Table.PromoteHeaders(Source, [PromoteAllScalars=true]),

    #"Changed Type" = 
        Table.TransformColumnTypes(#"Promoted Headers", {...})

in
    #"Changed Type"

Table.NestedJoin() Code

The mapping table query, called #"TblMapping", simply loads in a mapping file. This file has a column of many (e.g. "US" and "USA") and a column of ones (e.g. "United States") that will replace 7 different text values.

let
    Source = 
        Csv.Document(
            File.Contents("mapping.csv"),
            [
                Delimiter=",", 
                Columns=2, 
                Encoding=1252, 
                QuoteStyle=QuoteStyle.None
            ]
        ),

    #"Promoted Headers" = 
        Table.PromoteHeaders(Source, [PromoteAllScalars=true]),

    #"Changed Type" = 
        Table.TransformColumnTypes(
            #"Promoted Headers",
            {...}
        )

in
    #"Changed Type"

Now that we have our source data and a mapping file, we can go ahead and join them. Remember that Table.NestedJoin() will give us a nested column of tables and has to be expanded.

let
    Source = 
        Table.NestedJoin(
            SourceData,        //103 MB csv ~ 0.5 million rows
            {"RatecodeID"},    //merge column
            TblMapping,        //a short table for mapping ~7 rows
            {"RatecodeID"},    //merge column
            "TblMapping",      //name of column in the table
            JoinKind.LeftOuter //outer join
        ),
 
    #"Expanded TblMapping" =      
        Table.ExpandTableColumn( //column has to be expanded out
            Source, 
            "TblMapping", 
            {"RatecodeDescription"}, 
            {"RatecodeDescription"}
        )

in#"Expanded TblMapping"

Record.FieldOrDefault() Code

Converting our mapping query to a records-based map is only a rename and conversion. The query below is named #"RcdMapping".

let
    Source = #"TblMapping"

    #"Renamed Columns" = 
      Table.RenameColumns(
          #"Changed Type",
          {
              {"RatecodeID", "Name"}, 
              {"RatecodeDescription", "Value"}
          }
       ),

    RecordVals = Record.FromTable(#"Renamed Columns")

in
    RecordVals

With #"RcdMapping", we can generate the same table that we did with the join/merge, but with a Table.AddColumn() instead.

let
    Source =
        Table.AddColumn(
            SourceData,           
            "RatecodeDescription", 
            each 
                Record.FieldOrDefault(
                    #"RcdMapping",       
                    Text.From([RatecodeID]), 
                    null
                ), 
            Text.Type
        )

in
    Source

The Record.FieldOrDefault() code simplifies the join to a simple Table.AddColumn() at the expense of some additional code in the mapping table/records. This trade off, however, is still significantly faster and makes the overall code more legible.

But wait...there's more!

I've often had a series of values that need replaced BUT the list doesn't have ALL of the values in the main table...so we'll get some nulls after the join/merge. Invariably, we end up having to generate a bunch of columns with if-statements to get where we want to go.

Using some generic code and Record.FieldOrDefault() we can replace this:

let
    Source = SomeData,

    MergeColumns =
        Table.NestedJoin(
            SomeData,
            {"Original Value"},  //e.g. US, USA, etc
            MappingTable,
            {"Original Value"},
            "Nested Column",
            JoinKind.Outer
        ),

    ExpandColumn =
        Table.ExpandTableColumn(
            ...,
            {"Matched Value"},   //e.g. United States
            {"Matched Value"}
        ),

    LogicColumn = 
        //Uses the original value if there there is no match.
        Table.AddColumn(
            ExpandColumn,
            "Fixed Value",
            each
                if [Matched Value] = null
                then [Original Value]
                else [Matched Value]
         ),

    RemoveOriginal =
        Table.RemoveColumns(
            LogicColumn,
            {"Original Value"}
        ),

    RenameColumn =
        Table.RenameColumns(
            RemoveOriginal,
            {{"Fixed Value", "Original Value"}}

in
    RenameColumn

With this:

let
    Source = SomeData,

    TransformedOriginal =
        Table.TransformColumns(
            Source,
            {
                {
                    "Original Value",
                    each
                        Record.FieldOrDefault(
                            MappingRecords,     //the lookup table
                            [Original Value],   //for the lookup
                            [Original Value]    //if it comes back null
                        )
                }
            }
        )

in
    TransformedOriginal

The downside here is that you have to physically write the code in "Advanced Editor" rather than use the UI, but you get a much faster and more understandable code block.

Error Handling

For many of us, error handling isn't something (about which) we regularly think (about). Record.FieldOrDefault() helps us raise or handle errors in 2 ways:

The last argument of Record.FieldOrDefault() allows us to enter a flag value (e.g. "PROBLEM HERE") or raise an error with Error.Record() if there isn't a match when there should be one.
Since the "name" column (or key side of the record) must be unique, the mapping query will throw an error if we have duplicates. This is better than a Table.Distinct() which make the code work but leaves us searching through code and data files trying to find a problem if one arises.

Special Note: One-to-Many Substitutions

So far, I've only dealt with many-to-one replacements (and it's one-to-one extension). Record.FieldOrDefault() can also be used for one-to-many replacements with some additional logic in the record-based queries.

#"RcdMapping" needs to group the "many" into a list associated with a "one".

let
    Source = #"TblMapping"

    #"Renamed Columns" = 
      Table.RenameColumns(
          #"Changed Type",
          {
              {"RatecodeID", "Name"}, 
              {"RatecodeDescription", "Value"}
          }
       ),

    GroupedVals =
        Table.Group(
            #"Renamed Columns",
            {"Name"},
            {
                {"Value", each [Value], List.Type} //this will return a list
            }
         ),

    RecordVals = Record.FromTable(GroupedVals)

in
    RecordVals

And our final query needs to account for bringing in lists and expanding the list column:

let
    Source =
        Table.AddColumn(
            SourceData,           
            "RatecodeDescription", 
            each 
                Record.FieldOrDefault(
                    #"RcdMapping",       
                    Text.From([RatecodeID]), 
                    {}                        //change to empty list
                ), 
            List.Type                         //changed to a List.Type
        ),

    ExpandList =
        Table.ExpandListColumn(
            

in
    Source

Is it still faster? Absolutely! In 500 runs, Record.FieldOrDefault() ran 6 times faster with only 5% of the variation* compared to the join/merge method. *The asterisk is because there seem to be a significant number of outliers that are pushing out the variability. Eyeballing the chart, the variation is at least better than the 1/3 we saw on the many-to-one test.

Conclusion

For those familiar with Python dictionaries or R named vectors, this process is old-hat. Power Query being Power Query though, it's always a question if axioms from other languages will hold. In this case, it holds quite well (and way faster).

On a final note, records can hold many different data types and Record.FieldOrDefault() can operate like the SWITCH() function in DAX. With some ingenuity, we could even invoke different functions based on a column's values without having to write complicated if-statements.

This is my first article for LinkedIn and I hope you enjoyed it. Let me know in the comments what you'd like to see changes and what you'd like to see in future articles. I can also post up a series for DAX functions and we can explore use cases or "why does M/DAX do this?" situations together.

Farewell and happy coding! Jay S.

Important Links

Photo/Image Credits

Photo of Billy Mays: goon meet (Photo of Billy Mays) by Sharese Ann Frederick from Purchase, N.Y. retrieved 5/13/2019 from https://commons.wikimedia.org/wiki/File:Billy_Mays_Portrait_Cropped.jpg

About the Author

I'm a long-time programming hobbyist starting with BASIC and logic gates. Today, I work mostly in DAX, M, R, and Python building data visualizations at Discovery Communications. I'm happy to write on just about anything and love to collaborate, so if there is something you want to see or work on together, let me know.

要查看或添加评论，请登录

Jay Sumners的更多文章

Calculating Contribution to Percent Change

2020年4月9日

Calculating Contribution to Percent Change

When measuring a KPI over time, it can be helpful to see how much the constituent parts of the KPI contribute to the…

21 条评论
Debugging DAX in Power BI

2019年10月19日

Debugging DAX in Power BI

DAX expressions rely heavily on context and model relationships which can make debugging somewhat difficult. Below are…

1 条评论
Replacing Expensive If-Statements in DAX

2019年10月13日

Replacing Expensive If-Statements in DAX

If-statements are easy-to-understand constructions that mimic natural language (moreso in some languages than others)…

5 条评论
Understanding RANKX in DAX Measures

2019年6月2日

Understanding RANKX in DAX Measures

Good June everyone! Last week was a bit rough and I didn't get out a post, but this one was recommended by one of my…
Mimicking Python Dictionaries in Power Query M: Multi-Key Joins

2019年5月19日

Mimicking Python Dictionaries in Power Query M: Multi-Key Joins

Note 1: This is the second post on this topic. The first post has some background information, so if you haven't read…

See all articles

Mimicking Python Dictionaries in Power Query M: how, why, and is it better than a join/merge?

Jay Sumners

Manager, Data & Analytics

Why are records special? And you mentioned Python...

Ok, so why use it?

Faster than Joins and Merges

Simplify Code

Error Handling

Special Note: One-to-Many Substitutions

Conclusion

Important Links

Photo/Image Credits

About the Author

Jay Sumners的更多文章

社区洞察

其他会员也浏览了

Clean up a .CSV file with Regular Expressions, Pandas, and Python

Web Scraping Data Using Python

Harnessing the Power of Data Analytics with Atoti: A Python-based API for CSV Input

Learn Aggregation and Data Wrangling with Python

F1 project - loading to the database

The Indispensable Data Toolkit: Master These 7 Python Modules for Actionable Insights

Data cleaning with help of python with pandas.

Handling JSON/XML Data in Cosmos DB with Python and PySpark

Best Way to Use Pydantic in FastAPI: A Detailed Guide

Doing Data Analysis and Data Science in Python with pandas

Why are records special? And you mentioned Python...

Ok, so why use it?

Faster than Joins and Merges

Simplify Code

Error Handling

Special Note: One-to-Many Substitutions

Conclusion

Important Links

Photo/Image Credits

About the Author

Jay Sumners的更多文章

Calculating Contribution to Percent Change

Debugging DAX in Power BI

Replacing Expensive If-Statements in DAX

Understanding RANKX in DAX Measures

Mimicking Python Dictionaries in Power Query M: Multi-Key Joins

社区洞察

其他会员也浏览了

Clean up a .CSV file with Regular Expressions, Pandas, and Python

Web Scraping Data Using Python

Harnessing the Power of Data Analytics with Atoti: A Python-based API for CSV Input

Learn Aggregation and Data Wrangling with Python

F1 project - loading to the database

The Indispensable Data Toolkit: Master These 7 Python Modules for Actionable Insights

Data cleaning with help of python with pandas.

Handling JSON/XML Data in Cosmos DB with Python and PySpark

Best Way to Use Pydantic in FastAPI: A Detailed Guide

Doing Data Analysis and Data Science in Python with pandas