登录查看更多内容

Bloom Filter in Snowflake

Minzhen Yang

发布日期: 2024年8月3日

A Bloom Filter is a probabilistic data structure that is used to test whether an element is a member of a set. It is highly space-efficient and allows for fast membership checks with a trade-off: it may yield false positives, but it will never yield false negatives.

How a Bloom Filter Works

A Bloom Filter is implemented using:

A Bit Array: An array of bits, initially all set to 0.
Hash Functions: A set of independent hash functions that map an element to multiple positions in the bit array.

Insertion Process

When an element is added to the Bloom Filter:

The element is hashed using each of the hash functions.
Each hash function produces an index in the bit array.
The bits at all these indices are set to 1.

Membership Check

To check if an element is in the set:

The element is hashed using the same set of hash functions.
Each hash function produces an index in the bit array.
If all the bits at these indices are 1, the element is considered to be in the set (though there might be a false positive).
If any of the bits at these indices is 0, the element is definitely not in the set.

What a Bloom Filter can do

Advantages

Space Efficiency: Bloom Filters are more space-efficient than other data structures like hash tables or arrays, especially for large datasets.
Speed: Membership checks and insertions are very fast (constant time operations).

Disadvantages

False Positives: Bloom Filters can indicate that an element is in the set when it is not. The probability of false positives increases with the number of elements added.
No Deletion: Standard Bloom Filters do not support the removal of elements. Removing an element might unset bits that were set by other elements, leading to incorrect results.

Applications

Network Security: For filtering malicious URLs or emails.
Database Systems: For quickly checking if a data element is present in a distributed database.
Web Caching: To check if an item is already cached.

Example

Here’s a simple example to illustrate the concept:

Initialization: A bit array of size 10, all set to 0.

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Hash Functions: Two hash functions (for simplicity).

hash1(element)
hash2(element)

Insert Element "A":

hash1("A") -> 2
hash2("A") -> 5

Set bits at indices 2 and 5 to 1.

[0, 0, 1, 0, 0, 1, 0, 0, 0, 0]

Insert Element "B":

领英推荐

How to Read Graph DataBase Benchmarks (Part-1)

Ultipa 2 年前

Collibra – Trino/Starburst – Apache Ranger Integration

Lorang Technologies Private Limited 1 年前

The Data Scientist's Dilemma: When NULL Isn't Just…

Varun Varia 5 个月前

hash1("B") -> 3
hash2("B") -> 7

Set bits at indices 3 and 7 to 1.

[0, 0, 1, 1, 0, 1, 0, 1, 0, 0]

Check Membership for "A":

hash1("A") -> 2
hash2("A") -> 5

Bits at indices 2 and 5 are both 1, so "A" is probably in the set.

Check Membership for "C":

hash1("C") -> 1
hash2("C") -> 6

Bit at index 1 is 0, so "C" is definitely not in the set.

This is a basic overview of Bloom Filters. They are widely used due to their efficiency and effectiveness in various scenarios.

Bloom Filter in Snowflake

Bloom Filter (a.k.a JOIN filter) is used as one of the query optimization techniques in Snowflake. It is applied on the probe side of JOIN during runtime from transferring the build side filter.? When only a fraction of the data in a table is needed for a query against a table or to evaluate a join condition, the execution engine determines the appropriate conditions while the query is running, and broadcasts that information to all the nodes that are reading the table so that they can avoid unnecessary network transmission by sending only the subset of rows that match the join keys across the network.

Here is a simplified example to learn how to interpret Bloom Filter's exploitation using a query profile.

create or replace table tb_bfilter1 (c1 date, c2 int);
insert into tb_bfilter1 values 
('1999', 1999), ('2000', 2000), ('2001', 2001), ('2010', 2010);


create or replace table tb_bfilter2 (d1 date, d2 int);
insert into tb_bfilter2 values 
('1999', 1999), ('2000', 2000), ('2001', 2001);


select d1 
from tb_bfilter2 
where d2 in (
select c2 
from tb_bfilter1 
where c2 between 2000 and 2005
);

The query reads a number of rows from table tb_bfilter2 via a local filter on d2 column with in-list subquery, whose key values are evaluated at run time due to in-list subquery.??

This query has been optimized and transferred the in-list subquery to a JOIN.

As you can see from Join#1 of above profile, the joining key is on c2 of tb_bfilter1 and d2 of tb_bfilter2, that is (TB_BFILTER1.C2 = TB_BFILTER2.D2).

From the query profile, you can see JoinFilter#5 which is the bloom filter coming from Join#1 (original join id 1). The build side is on table tb_bfilter1, the probe side is on table tb_bfilter2. JoinFilter#5 is a bloom filter generated from the join key (TB_BFILTER1.C2 = TB_BFILTER2.D2) using filter on the build side (TB_BFILTER1.C2 >= 2000) AND (TB_BFILTER1.C2 <= 2005).

From table function, you can get similar information,

select step_id, operator_id, parent_operators, operator_type, operator_attributes
    from table(get_query_operator_stats('01b58e0d-080a-a067-0000-62b1006f3736'))
--where operator_type = 'JoinFilter'
    order by EXECUTION_TIME_BREAKDOWN:overall_percentage desc;

Creating a bloom filter allows for pruning to be done on the probe side of a JOIN based on BloomFilter (a.k.a JoinFilter) values during the TableScan.? It’s one of the key optimization techniques used for runtime pruning.

Disclaimer:

As this is my personal blog, any views, opinions, or advice represented in it are my own and belong solely to me.

Vishwas Patel

Python-Django developer

7 个月

Thanks for the article Minzhen Yang , any advice on how we would go about deciding length of bit array to be used?

查看更多评论

要查看或添加评论，请登录

Minzhen Yang的更多文章

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

2024年10月27日

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

In this blog, we are going to lay out strategies that can help your queries to prune more efficiently from both…

1 条评论
Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

2024年10月11日

Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

In the previous session, we discussed cases of runtime evaluation leading to inefficient pruning. Inefficient pruning…
Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

2024年10月6日

Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

In Snowflake, efficient pruning generally depends on the optimizer being able to determine predicate values at compile…

2 条评论
Pruner and Pruning - Part 2/5: How to Check Pruning

2024年8月25日

Pruner and Pruning - Part 2/5: How to Check Pruning

How can I check if pruning is happening or not in my queries? There are three main ways to check if pruning happens. 1.

1 条评论
Pruner and Pruning in Snowflake - Part 1/5: Overview

2024年8月11日

Pruner and Pruning in Snowflake - Part 1/5: Overview

Pruners refer to the objects (called pruners) defined in the compiler to prune data in the compilation phase. Pruning…

3 条评论
Query Compilation Overview in Snowflake

2024年7月28日

Query Compilation Overview in Snowflake

SQL compilation involves parsing and optimizing SQL statements into an executable format that the Snowflake execution…
Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

2024年7月21日

Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

The field of data analysis using analytical functions is evolving rapidly, driven by advancements in technology, the…
Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

2024年7月18日

Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

Analytical functions in SQL are pivotal for deriving actionable and meaningful insights and improving decision-making…
Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

2024年7月6日

Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

Advanced analytical scenarios using analytical functions in data analysis involve leveraging complex SQL…
Analytical Functions for Big Data Analysis - Part 2 of 5: Query Optimization Techniques and Best Practices

2024年6月29日

Analytical Functions for Big Data Analysis - Part 2 of 5: Query Optimization Techniques and Best Practices

Query Optimization Techniques Query optimization is crucial for improving the performance of analytical queries in…

3 条评论

See all articles

Bloom Filter in Snowflake

Minzhen Yang

How a Bloom Filter Works

Insertion Process

Membership Check

What a Bloom Filter can do

Advantages

Disadvantages

Applications

Example

领英推荐

Bloom Filter in Snowflake

Minzhen Yang的更多文章

社区洞察

其他会员也浏览了

8 Data Structures Powering Modern Databases-Scaler

Series: Introduction to Columnar Databases

Why Multi-Hop Queries Are Easier in a Graph Database

Should Data Engineers be Domain Competent?

The Sequel of Relational Databases: From Military Laboratories to Backbone Of Modern Infrastructure

Optimizing Data Fetching in GraphQL: Best Practices and Tips!

Cron Jobs vs Events for async data processing

DATA STRUCTURE

learn Data Structures:

OA Digest - February 2023 Edition

How a Bloom Filter Works

Insertion Process

Membership Check

What a Bloom Filter can do

Advantages

Disadvantages

Applications

Example

领英推荐

Bloom Filter in Snowflake

Minzhen Yang的更多文章

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

Pruner and Pruning - Part 2/5: How to Check Pruning

Pruner and Pruning in Snowflake - Part 1/5: Overview

Query Compilation Overview in Snowflake

Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

Analytical Functions for Big Data Analysis - Part 2 of 5: Query Optimization Techniques and Best Practices

社区洞察

其他会员也浏览了

8 Data Structures Powering Modern Databases-Scaler

Series: Introduction to Columnar Databases

Why Multi-Hop Queries Are Easier in a Graph Database

Should Data Engineers be Domain Competent?

The Sequel of Relational Databases: From Military Laboratories to Backbone Of Modern Infrastructure

Optimizing Data Fetching in GraphQL: Best Practices and Tips!

Cron Jobs vs Events for async data processing

DATA STRUCTURE

learn Data Structures:

OA Digest - February 2023 Edition