How to Choose a Good Partition Key?
Swapnil Mule
Data Engineer | Azure Data Factory | Azure Synapse | Databricks | SQL | Spark | SnowFlake | Power BI | AxiomSL
Horizontal partitioning is one of the most effective and important ways of partitioning big data. As a consequence, the art of choosing a good partition key is vital to any data engineer.
Let's imagine we have this stock table from before, and we will try to choose several partition keys to see what things can happen. One possible partition key could be the Has Stock column. It seems reasonable because usually one only buys items in stock, and that is a normal filter in any e?commerce platform. Can you think about what could go wrong? In case you thought of it, the problem this partition key has is low cardinality. Let me go through what that means. As our platform grows, we will serve more and more items. Probably a lot of those will get archived. However, our partition key only has two possible values, yes and no. That means that eventually both tables will get gigantic and we would have lost the value partition. This phenomenon is referred to as low cardinality. Furthermore, even if we have a 10?node deployment we only can populate 2 of them with data, making the rest impossible to leverage. We want our sharding key to having high cardinality to ensure we can allocate as many nodes as possible to our deployment and ensure we scale limitlessly.
Another partition key we have is the Last Ordered Date. We could manually assign N partitions based on the ordering we naturally have in that column. Can you think about what could go wrong? In case you thought of it, the problem this partition key has is partition reshuffling. Let me go through what that means. Initially, this seems like a good key because it is not bounded above, you can group items naturally, and it seems great for reporting, right? The main issue is that this column is not the date of the order, but the date of the last order. That means that the following scenario will happen all the time. On one partition, we have, for example, the Blue Yeti microphone. That is on the partition assigned for April 2020. When a new Blue Yeti is bought, this table needs an update in the column Last Ordered Date. That means that now the Blue Yeti needs to be moved into the latest partition. This has two downsides. It generates contention because that adds a lock on two partitions on an insert operation and it generates the oldest partitions to have significantly less frequency because only the archived items will not get bought ever again.
For a final scenario of what could go wrong, let's think about what would happen if we chose the price as the partition key. Can you think of the issue here? Think about it for a final minute. In case you thought of it, the problem this partition key has is frequency. Let me go through what that means. Usually, the price of microphones in our table will distribute like a bell shape, what we mathematicians would call a Gaussian distribution. Therefore, once we set up a partition for price most of the items would be in the middle partitions, leaving the other ones underpopulated. Also, normally people search and buy items in the middle price range, making these partitions not only the most accessed but also the most populated. That partition is called to be a hot partition and is a cause of a frequency imbalance in our partition key.
Up to this point I think I transmitted to you how complex it is to choose a good partition key and avoid disasters. It is that important for your architecture. To recap, a good partition key must meet the following requirements.
领英推è
Ensure it has enough cardinality to populate all the nodes of the deployment as we scale up.
Ensure it is static, which means items will need to be reshuffled to another partition once inserted.
?Ensure it distributes frequency evenly to avoid unevenly large shards and to avoid hot partitions.
Each of the cases we have seen before did not cover one or more of these characteristics. A good partition key for the stocks table would be, for example, the key or the name.
Great work Swapnil Mule - thanks for sharing ??