Snowflake: Row Pattern Recognition

Snowflake: Row Pattern Recognition

Pattern matching is extremely powerful in Data Analytics, be it user behaviour while visiting an e-commerce website or detecting suspicious activities. We generally use RegEx (Regular Expressions) to identify patterns within a string. My experience in pattern matching with SQL so far has been checking some patterns within a single column. I came across a medium article which talks about a recently released Snowflake SQL function called "MATCH_RECOGNIZE". At a first glance, I found it pretty interesting to see the power it adds to SQL which is a key tool used by Data professionals.

Out of my curiosity, I read further on this topic and found that this concept is called Row Pattern Recognition in SQL and it already existed in Oracle 12c onwards. It's a part of SQL: 2016. Being an active user of modern cloud DW Snowflake, having this function available is pretty exciting. I don't think other than Oracle and Snowflake any other data warehouse has such a feature at this point.

To be very honest this concept is very new to me and I have not used it for any of the real data analytics use cases so far. My experience with this function is just trying and playing around with it for few hours with a very basic toy example in mind to demonstrate the feature. I wrote this article mainly to improve my writing ability and to find and fix the blind spots in my learning.

Now let's jump straight to the use case with a step-by-step approach of building the SQL query for it. 

Problem Statement : 

Let's assume we are a bank and we are looking for suspicious money transfers. Let's build our own regulations to identify suspicious transfers.

  • 3 or more small transfers(< 2k $)
  • Large transfer following 3 or more small transfers (≥ 1M $)
  • All these events happening within a date range of 30 days.

Sample Data:  

In a real scenario, we will have transactions for all the customers of the bank. But for illustration purposes, I have reduced the records to one customer and for a limited date range.

No alt text provided for this image

As per the highlighted rows, this customer (DEEPAK) meets the 2 criteria defined for suspicious transfer (3 small transactions followed by a large transaction within a 30-day window).

It seems to be a daunting task to find out all such patterns of large transfers by customers. In fact, it is. If we try to do it in SQL it might involve a lot of CTEs, window functions and self joins to find the result. After working so many years in SQL, I am even not comfortable writing such queries.

Here comes MATCH_RECOGNIZE to rescue.

Before starting to write a query for this, the pre-requisite to get the data ingested into Snowflake. In modern data engineering practice, we have dozens of options to get the data extracted, cleaned up and loaded to Snowflake. This is not the topic of discussion here so we can move forward with the assumption that the data exists in a snowflake table named "TRANSACTION_FACT"

Writing this query is a multi-step process :

STEP1: Grouping the Data and Sequencing the Transactions

In our case, we want to group it by "USER_ID" and order it by "DATE". Once the transactions are in sequence then we can easily establish the 30-day window.

PARTITION BY user_id ORDER BY date


STEP2: Establishing Pattern of Events

This step needs a little RegEx knowledge. Let's give names to the patterns.

SMALL_TRANSFER and LARGE_TRANSFER are pattern variable names.

PATTERN (SMALL_TRANSFER{3,} LARGE_TRANSFER)


SMALL_TRANSFER{3,} => 3 or more small transfers.

LARGE_TRANSFER => Large Transfer (Follows the previous 3 or more small transfer patterns).

STEP3: Describe/Define the Pattern

This is an important step where we write expressions on the data. For our toy example,

DEFINE
          SMALL_TRANSFER as (amount < 2000) AND
                            LAST_VALUE(date) - FIRST_VALUE(date) < 30,
          

          
          LARGE_TRANSFER as (amount >= 1000000 AND
          
                            LAST_VALUE(date) - FIRST_VALUE(date) < 30)


Here the FIRST_VALUE and LAST_VALUE is working on the partition window we have defined in the beginning.

Snowflake Documentation has extensive details on this part as it has different behaviours with the usage of different types of windows function and there are few limitations too.

STEP4: Define what to return from the Dataset

MEASURES
         
     MATCH_NUMBER() as match_number,
     MATCH_SEQUENCE_NUMBER() as match_sequence_number,
     CLASSIFIER() as classifier,
     FIRST(SMALL_TRANSFER.date),
     LARGE_TRANSFER.date as last_transfer_date,
     LARGE_TRANSFER.amount as largest_amount_transferred

ONE ROW PER MATCH


The First 3 functions are snowflake functions(not required to display). We want the first and last date of the transfer and the largest amount which is >1M $ in order to know the suspicious transfer.

Now assemble all the pieces together as per the syntax. 

select * 
from
(select * from TRANSACTION_FACT where event='TRANSFER' )
MATCH_RECOGNIZE(
    PARTITION BY user_id ORDER BY date
    MEASURES  
         MATCH_NUMBER() as match_number,
         MATCH_SEQUENCE_NUMBER() as match_sequence_number,
         CLASSIFIER() as classifier,
         FIRST(SMALL_TRANSFER.date) as first_transfer_date,
         LARGE_TRANSFER.date as last_transfer_date,
         LARGE_TRANSFER.amount as largest_amount_transferred
    ONE ROW PER MATCH 
    PATTERN (SMALL_TRANSFER{3,} LARGE_TRANSFER)
    DEFINE
          SMALL_TRANSFER as (amount < 2000) AND
                     LAST_VALUE(date) - FIRST_VALUE(date) < 30,
          LARGE_TRANSFER as (amount >= 1000000 AND
                     LAST_VALUE(date) - FIRST_VALUE(date) < 30)


);

Here is the output of Snowflake UI with row which is the culprit.

No alt text provided for this image

I do see limitless use cases of this in various domains and will definitely explore more on this in near future.

Below are few helpful links to understand the usage of "match_recognize" on real-world use cases.

Snowflake Documenation (Stock Price Rise/Fall Pattern example)

Ecommerce Funnel Analytics

Real world use cases from Oracle

Russell Leighton

Chief Architect at Panther Labs | ex-Amazon

3 年

Nice example! This kind of sequence detection is very useful in cyber security as well. A series of weak signals in the right order become a strong signal.

Felipe Hoffa

Ex-Google, Ex-Snowflake, Always Me

4 年

This is a wonderful use case. I'm so happy to have inspired you to discover this and share, thanks!

回复
Sravan Kumar Reddy Keesara

Enterprise Data Architect at Parallon

4 年

Good info!

回复

要查看或添加评论,请登录

Deepak Rout的更多文章

  • Custom Metadata in Delta Table History

    Custom Metadata in Delta Table History

    If you’ve ever lived in the pre-cloud ETL world, you’ll remember how ETL tools or stored procedures were the backbone…

    1 条评论
  • Do You Know What the RELY Option in a Primary Key Does on Databricks?

    Do You Know What the RELY Option in a Primary Key Does on Databricks?

    If you're working with Databricks SQL and want to supercharge your query performance, it's time to explore the RELY…

  • Be Careful with NULL Handling During Database Migrations

    Be Careful with NULL Handling During Database Migrations

    When migrating databases, especially when using window functions like ROW_NUMBER(), RANK(), or DENSE_RANK(), one subtle…

    1 条评论
  • YAML Engineers

    YAML Engineers

    In the data engineering field, YAML files have become a beloved tool, much like in DevOps. Over the years, data…

  • Lost in translation

    Lost in translation

    I was scrolling through my LinkedIn feed the other day when I stumbled upon this hilarious (and kinda sad) interaction.…

  • The Modern Alternative to Makefiles

    The Modern Alternative to Makefiles

    Have you ever stared at a Makefile, feeling lost in a sea of colons, tabs, and cryptic syntax? You're not alone. Many…

  • Where Is Everyone ?

    Where Is Everyone ?

    Ever sent out a status update, shared code for review, or posted in a team channel only to be met with..

  • SQL Productivity Hack

    SQL Productivity Hack

    As a data consultant, I often find myself writing SQL queries to move data between tables with some transformations…

    1 条评论
  • "ASOF JOIN: Bridging the Time Gap in Data Analysis"

    "ASOF JOIN: Bridging the Time Gap in Data Analysis"

    You must have been hearing more and more about a new type of JOIN called . Modern databases are adding this feature to…

  • Value-Focused Framework

    Value-Focused Framework

    As leaders in the digital age, we must look beyond technology and patterns to embrace a holistic view of architecture…

社区洞察

其他会员也浏览了