Bloom Filter in Snowflake
A Bloom Filter is a probabilistic data structure that is used to test whether an element is a member of a set. It is highly space-efficient and allows for fast membership checks with a trade-off: it may yield false positives, but it will never yield false negatives.
How a Bloom Filter Works
A Bloom Filter is implemented using:
Insertion Process
When an element is added to the Bloom Filter:
Membership Check
To check if an element is in the set:
What a Bloom Filter can do
Advantages
Disadvantages
Applications
Example
Here’s a simple example to illustrate the concept:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
hash1(element)
hash2(element)
hash1("A") -> 2
hash2("A") -> 5
Set bits at indices 2 and 5 to 1.
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0]
领英推荐
hash1("B") -> 3
hash2("B") -> 7
Set bits at indices 3 and 7 to 1.
[0, 0, 1, 1, 0, 1, 0, 1, 0, 0]
hash1("A") -> 2
hash2("A") -> 5
Bits at indices 2 and 5 are both 1, so "A" is probably in the set.
hash1("C") -> 1
hash2("C") -> 6
Bit at index 1 is 0, so "C" is definitely not in the set.
This is a basic overview of Bloom Filters. They are widely used due to their efficiency and effectiveness in various scenarios.
Bloom Filter in Snowflake
Bloom Filter (a.k.a JOIN filter) is used as one of the query optimization techniques in Snowflake. It is applied on the probe side of JOIN during runtime from transferring the build side filter.? When only a fraction of the data in a table is needed for a query against a table or to evaluate a join condition, the execution engine determines the appropriate conditions while the query is running, and broadcasts that information to all the nodes that are reading the table so that they can avoid unnecessary network transmission by sending only the subset of rows that match the join keys across the network.
Here is a simplified example to learn how to interpret Bloom Filter's exploitation using a query profile.
create or replace table tb_bfilter1 (c1 date, c2 int);
insert into tb_bfilter1 values
('1999', 1999), ('2000', 2000), ('2001', 2001), ('2010', 2010);
create or replace table tb_bfilter2 (d1 date, d2 int);
insert into tb_bfilter2 values
('1999', 1999), ('2000', 2000), ('2001', 2001);
select d1
from tb_bfilter2
where d2 in (
select c2
from tb_bfilter1
where c2 between 2000 and 2005
);
The query reads a number of rows from table tb_bfilter2 via a local filter on d2 column with in-list subquery, whose key values are evaluated at run time due to in-list subquery.??
This query has been optimized and transferred the in-list subquery to a JOIN.
As you can see from Join#1 of above profile, the joining key is on c2 of tb_bfilter1 and d2 of tb_bfilter2, that is (TB_BFILTER1.C2 = TB_BFILTER2.D2).
From the query profile, you can see JoinFilter#5 which is the bloom filter coming from Join#1 (original join id 1). The build side is on table tb_bfilter1, the probe side is on table tb_bfilter2. JoinFilter#5 is a bloom filter generated from the join key (TB_BFILTER1.C2 = TB_BFILTER2.D2) using filter on the build side (TB_BFILTER1.C2 >= 2000) AND (TB_BFILTER1.C2 <= 2005).
From table function, you can get similar information,
select step_id, operator_id, parent_operators, operator_type, operator_attributes
from table(get_query_operator_stats('01b58e0d-080a-a067-0000-62b1006f3736'))
--where operator_type = 'JoinFilter'
order by EXECUTION_TIME_BREAKDOWN:overall_percentage desc;
Creating a bloom filter allows for pruning to be done on the probe side of a JOIN based on BloomFilter (a.k.a JoinFilter) values during the TableScan.? It’s one of the key optimization techniques used for runtime pruning.
Disclaimer:
As this is my personal blog, any views, opinions, or advice represented in it are my own and belong solely to me.
Python-Django developer
7 个月Thanks for the article Minzhen Yang , any advice on how we would go about deciding length of bit array to be used?