Advanced Filtering Techniques With DynamoDB
Author image

Advanced Filtering Techniques With DynamoDB

Welcome to the 12th edition of Excelling With DynamoDB!

In this week’s issue I’ll guide you through some advanced filtering techniques in DynamoDB and explain how they can be used to perform powerful data filtering.

Here’s the breakdown of what I’ll discuss:

  • How filtering works differently than filtering in other NoSQL databases
  • The 2 methods of filtering in DynamoDB and how they compare
  • When to apply which filtering method

In SQL databases filtering queries are straightforward. You have an expressive syntax and can add any number of filter conditions by specifying WHERE clauses.

In SQL, you can join tables and filter on the value of the join.

Simply declaring “WHERE age > 20” will instantly filter your table items.

DynamoDB is radically different with filters.

DynamoDB, being a key-value database allows you to query in 2 different manners. One is highly efficient, the other is a little less so but more flexible.

Compared to SQL databases, DynamoDB’s recommended filtering is much faster and scalable; let’s see why.

The 2 Methods Of Filtering

  1. Primary key filtering - this is done by cleverly designing our partition and sort keys
  2. Filter Expressions - filtering done on the server side but after a scan of all items in a partition

Let’s understand these 2 methods in detail.

Primary Key Filtering

Primary key filtering is the filtering technique that is the most different and has a higher learning curve for developers learning DynamoDB.

This filtering method, while more complex, remains the most efficient and highly scalable.

It involves carefully modelling our data, using carefully designed partition and sort key combinations to enable efficient filtering.

This is best understood with an example.

Say we wish to design a movies database. We wouldn’t design it the same way we would for a relational database. Instead, based on our required data access patterns, we will do 2 things

  1. Use our base table for our main access pattern, which is usually something like “get a movie by movieID”.
  2. Create a global secondary index for our other access patterns like “get movies by year, by genre and by title”.

To satisfy our data access patterns above we need to model our data accordingly.

A typical movie item in our base movies table will look like this:

Author Image

The data model here isn’t in any way special; it simply allows you to query for a single movie by providing a movieID.

But take a look at the next data model for our global secondary index:

It’s radically different, especially the second column sort key “genreTitle”. We are also using the year as the partition key.

We model the data in this way for 2 reasons:

First, we define the year as the partition key because it is a low cardinality key (limited items in the partition). This also allows us to get all movies by a given year.

Second, we define “genreTitle” as the sort key which contains strings of a particular movie’s genre and title delimited by ‘#’ characters.

This is where the filtering takes place.

Let’s demonstrate how we can perform powerful filtering with this data modelling and satisfy our data access patterns defined earlier.

To get all movies for a given year we can run the following pseudo query on our global secondary index:

year = 2023 AND begins_with(genreTitle, ‘#’)        

This query will find the “2023” partition and get all items whose ‘genreTitle’ (sort key) attribute begins with “#” - which is all movies released in 2023.

On a side note, even though this could return many items, keep in mind DynamoDB will always return a maximum of 1MB’s worth of items. This is why Alex Debrie often says:

“DynamoDB won’t let you write a query that won’t scale”

This is also a design which makes DynamoDB powerful and highly scalable.

Now the above query returned a base filter of items for a given year only. But say we want to perform an even deeper filter.

Here’s what we can do.

We’ll filter in movies from 2023 AND movies that belong to the history genre as well.

Here is our filtered pseudo query:

year = 2023 AND begins_with(genreTitle, ‘#history’)        

This would return items “Troy” and “Oppenheimer”.

We can further filter the results by providing the movie title:

year = 2023 AND begins_with(genreTitle, ‘#history#oppenheimer’)        

This would of course return the item “Oppenheimer”.

We’ve now satisfied all of our access patterns. We can now easily get:

  • all movies from a given year
  • all movies from a given genre
  • a movie from a given title

Additionally, If we wanted to filter even deeper and had for example a tv shows table, we could apply filters down to the season and episode level.

All we require to modify for this TV Shows table is to add the season and episode numbers in our sort key string like so:

Then we can further filter for tv shows by providing a string for a particular season:

yearStart = 2011 AND begins_with(genreTitleSeasonEpisode, ‘#fantasy#game of thrones#s1)         

Now we are filtering on the year (startYear for TV Shows), the genre, the title and on the season (we can also filter on the episode if we need to by appending #e1 to the filter string).

The above query would return all episodes that are in “Game of Thrones” season 1.


The examples we’ve seen are all examples of filtering using the primary keys of our table/secondary indexes.

Every filter query we have performed above is efficient and highly scalable.

This means even if we had millions or billions of items in our movies or tv shows table, the latency would be the same as if we had 10 items - an astonishing 2-digit milliseconds.

The second type of filtering we’ll take a look at are Filter Expressions. These are less efficient but more flexible as we’ll see.

Filter Expressions

Filter Expressions is filtering that is performed on our query after a query or scan operation has run.

This is the filtering method most of you coming from other NoSQL databases would be familiar with.

With Filter Expressions you can provide a query expression like year = 2023, to which DynamoDB scans all items in the given year “2023” partition, and then applies your given filters and returns the matching items.

There are two problems with this “traditional” filtering in DynamoDB.

First, DynamoDB must scan every item in the partition, once it has scanned all those items, it then applies your filters. This naturally adds higher latency to your query.

Second, since DynamoDB charges you per data scanned, you are effectively being charged for all items scanned and not the filtered items that are returned.

When you perform any query against your DynamoDB table, in the response you’ll get your data, as well as an attribute “ScannedCount” and “Count”.

ScannedCount is the number of items DynamoDB had to scan, regardless of your filter.

Count is the number of items DynamoDB returned after your filter.

In terms of costs, you are charged for the ScannedCount number not the Count number.

This is why primary key filtering is more efficient and cost-effective - here the ScannedCount will be the same as the Count.

However, Filter Expressions are not all doom and gloom. They provide you with more flexibility than primary key filtering.

Recall with primary key filtering, we are limited to filtering on attributes that we pre-define and cannot modify after creating our table; such as genre, title and season.

With Filter Expressions we can filter our query on any attribute we like. We can use the genre or season attributes non-sequentially, contrary to primary key filtering - allowing for more expressive filtering.

Filter Expression Caveats

There is one caveat however. Because of DynamoDB’s 1MB limit, our filter expressions are not effective for retrieving items that amount to larger than 1MB in size.

To understand this limitation, I’ve used an example from Alex Debrie’s DynamoDB Guide:

“However, the key point to understand is that the Query and Scan operations will return a maximum of 1MB of data, and this limit is applied in step 1, before the filter expression is applied.

Imagine your music table was 1GB in size, but the songs that were platinum were only 100KB in size. You might expect a single Scan request to return all the platinum songs, since it is under the 1MB limit. However, since the filter expression is not applied until after the items are read, your client will need to page through 1000 requests to properly scan your table. Many of these requests will return empty results as all non-matching items have been filtered out.” [1]

The bottom line here is to be careful with filter expressions as they often will not retrieve exactly what you needed.

When To Apply Which Filtering Method

The question now remains when should you apply either filtering method?

According to Alex Debrie, you should avoid Filter Expressions whenever you can.

I tend to stick to primary key filtering for the strong majority of the time and recommend to only use Filter Expressions for the following 2 cases:

  • When you have an important access pattern which cannot be satisfied with the primary keys of your table or index
  • When your table contains a limited and low number of items, and latency or cost is not a concern.

One thing to keep in mind is that DynamoDB will charge you for each global secondary index. If you have already created several GSIs it may be more efficient to use a filter expression, especially if the first point above applies.

Conclusion

Filtering in DynamoDB is performed differently than in most NoSQL databases. Understanding how DynamoDB works is crucial to designing effective and optimized filters on your tables.

To write performant filtering, you must first work on carefully modelling your table data and designing your primary keys intelligently.

In this article, we went through the 2 types of filtering in DynamoDB; primary key filtering - a powerful and efficient filtering technique that optimizes for latency and costs, and Filter Expressions which are mostly recommended on smaller tables where latency and costs are not a concern.


If you enjoyed this post, please consider subscribing and sharing the newsletter with your network: https://www.dhirubhai.net/newsletters/7181386750357893121/


?? My name is Uriel Bitton and I hope you learned something in this edition of Excelling With DynamoDB.

?? You can share the article with your network to help others learn as well.

?? If you enjoyed this you can subscribe to my newsletter on Medium to get my latest articles on DynamoDB and serverless by email.

?? my blog website is coming soon, so stay tuned for that.

?? I hope to see you in the next week's edition!

Uriel


References

  1. The DynamoDB Guide. Alex Debrie. https://www.dynamodbguide.com.

要查看或添加评论,请登录

Uriel Bitton的更多文章

社区洞察

其他会员也浏览了