Bloom Filter in Snowflake

Bloom Filter in Snowflake

A Bloom Filter is a probabilistic data structure that is used to test whether an element is a member of a set. It is highly space-efficient and allows for fast membership checks with a trade-off: it may yield false positives, but it will never yield false negatives.

How a Bloom Filter Works

A Bloom Filter is implemented using:

  1. A Bit Array: An array of bits, initially all set to 0.
  2. Hash Functions: A set of independent hash functions that map an element to multiple positions in the bit array.

Insertion Process

When an element is added to the Bloom Filter:

  1. The element is hashed using each of the hash functions.
  2. Each hash function produces an index in the bit array.
  3. The bits at all these indices are set to 1.

Membership Check

To check if an element is in the set:

  1. The element is hashed using the same set of hash functions.
  2. Each hash function produces an index in the bit array.
  3. If all the bits at these indices are 1, the element is considered to be in the set (though there might be a false positive).
  4. If any of the bits at these indices is 0, the element is definitely not in the set.

What a Bloom Filter can do

Advantages

  • Space Efficiency: Bloom Filters are more space-efficient than other data structures like hash tables or arrays, especially for large datasets.
  • Speed: Membership checks and insertions are very fast (constant time operations).

Disadvantages

  • False Positives: Bloom Filters can indicate that an element is in the set when it is not. The probability of false positives increases with the number of elements added.
  • No Deletion: Standard Bloom Filters do not support the removal of elements. Removing an element might unset bits that were set by other elements, leading to incorrect results.

Applications

  • Network Security: For filtering malicious URLs or emails.
  • Database Systems: For quickly checking if a data element is present in a distributed database.
  • Web Caching: To check if an item is already cached.

Example

Here’s a simple example to illustrate the concept:

  • Initialization: A bit array of size 10, all set to 0.

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]        

  • Hash Functions: Two hash functions (for simplicity).

hash1(element)
hash2(element)        

  • Insert Element "A":

hash1("A") -> 2
hash2("A") -> 5        

Set bits at indices 2 and 5 to 1.

[0, 0, 1, 0, 0, 1, 0, 0, 0, 0]        

  • Insert Element "B":

hash1("B") -> 3
hash2("B") -> 7        

Set bits at indices 3 and 7 to 1.

[0, 0, 1, 1, 0, 1, 0, 1, 0, 0]        

  • Check Membership for "A":

hash1("A") -> 2
hash2("A") -> 5        

Bits at indices 2 and 5 are both 1, so "A" is probably in the set.

  • Check Membership for "C":

hash1("C") -> 1
hash2("C") -> 6        

Bit at index 1 is 0, so "C" is definitely not in the set.

This is a basic overview of Bloom Filters. They are widely used due to their efficiency and effectiveness in various scenarios.

Bloom Filter in Snowflake

Bloom Filter (a.k.a JOIN filter) is used as one of the query optimization techniques in Snowflake. It is applied on the probe side of JOIN during runtime from transferring the build side filter.? When only a fraction of the data in a table is needed for a query against a table or to evaluate a join condition, the execution engine determines the appropriate conditions while the query is running, and broadcasts that information to all the nodes that are reading the table so that they can avoid unnecessary network transmission by sending only the subset of rows that match the join keys across the network.

Here is a simplified example to learn how to interpret Bloom Filter's exploitation using a query profile.

create or replace table tb_bfilter1 (c1 date, c2 int);
insert into tb_bfilter1 values 
('1999', 1999), ('2000', 2000), ('2001', 2001), ('2010', 2010);


create or replace table tb_bfilter2 (d1 date, d2 int);
insert into tb_bfilter2 values 
('1999', 1999), ('2000', 2000), ('2001', 2001);


select d1 
from tb_bfilter2 
where d2 in (
select c2 
from tb_bfilter1 
where c2 between 2000 and 2005
);        

The query reads a number of rows from table tb_bfilter2 via a local filter on d2 column with in-list subquery, whose key values are evaluated at run time due to in-list subquery.??

This query has been optimized and transferred the in-list subquery to a JOIN.


IN List

As you can see from Join#1 of above profile, the joining key is on c2 of tb_bfilter1 and d2 of tb_bfilter2, that is (TB_BFILTER1.C2 = TB_BFILTER2.D2).


JOIN Filter

From the query profile, you can see JoinFilter#5 which is the bloom filter coming from Join#1 (original join id 1). The build side is on table tb_bfilter1, the probe side is on table tb_bfilter2. JoinFilter#5 is a bloom filter generated from the join key (TB_BFILTER1.C2 = TB_BFILTER2.D2) using filter on the build side (TB_BFILTER1.C2 >= 2000) AND (TB_BFILTER1.C2 <= 2005).

From table function, you can get similar information,

select step_id, operator_id, parent_operators, operator_type, operator_attributes
    from table(get_query_operator_stats('01b58e0d-080a-a067-0000-62b1006f3736'))
--where operator_type = 'JoinFilter'
    order by EXECUTION_TIME_BREAKDOWN:overall_percentage desc;        


Creating a bloom filter allows for pruning to be done on the probe side of a JOIN based on BloomFilter (a.k.a JoinFilter) values during the TableScan.? It’s one of the key optimization techniques used for runtime pruning.


Disclaimer:

As this is my personal blog, any views, opinions, or advice represented in it are my own and belong solely to me.

Vishwas Patel

Python-Django developer

7 个月

Thanks for the article Minzhen Yang , any advice on how we would go about deciding length of bit array to be used?

回复

要查看或添加评论,请登录

Minzhen Yang的更多文章

社区洞察

其他会员也浏览了