Your 4 step guide to copying data from AWS S3 to Snowflake using Matillion

Your 4 step guide to copying data from AWS S3 to Snowflake using Matillion

Thank you for reading my latest article Your 4 step guide to copying data from AWS S3 to Snowflake using Matillion.?

Here at LinkedIn I regularly write about modern data platforms and technology trends.To read my future articles simply join my network here or click 'Follow'. Also feel free to connect with me via YouTube.

------------------------------------------------------------------------------------------------------------

In this article I am going to take you break down a part of a proof of value solution I worked on few months ago for a client. The focus of this article is not to talk about the purpose or outcomes behind the proof of value but rather how to use Matillion and Snowflake to load files from AWS S3 efficiently and easily.?

I have broken this into 4 steps:

  1. Create the workflow
  2. Obtain the list of files to load
  3. Iterate over the list and load the data into Snowflake
  4. Write out operational audit metadata

What you'll learn is how you can iterate over files in S3 using Matillion and then push the data into target tables in Snowflake.?

This kind of work is a common task which most data engineers need to work on, but to do it using legacy, non-cloud native tools can quickly become complex, and often introduces unnecessary complexity when compares to the more modern tools we have available in the market today.

This is one of the reasons I wanted to share with you a neat and elegant solution that you can use as a pattern or a starting point for your own projects. The pattern should also prove transferrable between different cloud storage locations (e.g. Google Cloud Storage, Azure blob storage) and target databases (e.g. Redshift, Bigquery).

Step 1. Create the workflow

Once I have logged into Matillion and navigated to the project I’ve created an orchestration job. If you're not familiar with Matillion there are two kinds of jobs you can use. The first are orchestration jobs and the other are transformation jobs.

  • Orchestration jobs are the workflow of what tasks and/or transformation jobs need to run and in what order. You can set different precedents between each component which tell Matillion whether to run the next , subsequent job if this preceding one fails or succeeds.
  • Transformation jobs are the ones that actually apply the transformational logic.?

The diagram below show the workflow. The workflow firstly calls the ‘01 Extract Initial Load Fixed Iterator' job.

No alt text provided for this image
Orchestration job workflow

Looking at the '01 Extract Initial Load Fixed Iterator job' in the diagram below at first glance it looks very simple and basic. However, there's actually quite a lot going on behind the scenes and you will see I've annotated within the job what's actually happening here so let's break this down next.

No alt text provided for this image
The '01 Extract Initial Load Fixed Iterator job'

Step 2. Obtain the list of files to load

As this was a proof of value, the aim was to demonstrate how best to leverage the technical capabilities of Matillion correctly to support the basic requirements of the workflow. Of course, this pattern could easily be enhanced further from this starting point as needed based on specific requirements.

In this scenario, we needed to provide a fixed set of file names to be loaded into Snowflake from S3. We knew the naming convention of the files to be loaded which allowed us to target those specific files. But don’t think this was a sequential operation, we also set this up to run in parallel, which we’ll cover a little later in this article.

Within the fixed iterator component properties I am using a variable to iterate over. It's using a file name variable that I've set up previously and it'll be much easier to see this in the video (link at the bottom of this article) as a screenshot won't be very clear.

The file name variable holds a list of values (see image below) to iterate over which is the list of files I expect to see in Amazon S3. I could also use wildcards against these as well if I wanted.

No alt text provided for this image
List of the files to load from AWS

One thing I really like about Matillion is on the face of it a very simple feature, but when you're working with Matillion, it actually makes it very productive and efficient to work with.

It’s something called ‘text mode’. So instead of clicking the green plus button to add a new row before adding in the text manually for each and every filename, I can just click text mode.

This switches the dialog box to a notepad-style page - meaning I can copy and paste a list of table values directly onto the page. Unchecking text mode switches the window back to the grid with all the individual values populated on rows.

No alt text provided for this image
Using text mode - a simple yet great feature!

Step 3. Iterate over the list and load the data into Snowflake

The fixed iterator component iterates over the file names provided and every time it picks up one it's going to call the orchestration job in the image below. This orchestration job takes the filename variable value and loads the file into Snowflake. The S3 load component takes a folder prefix where the S3 files are sitting. You can then specify a pattern for the files.

No alt text provided for this image
The orchestration job leveraging variables to extract the files and loading them to S3

Here I'm taking a directory name variable here, along with the file name which I've passed in, and added a wildcard which will pick up any type of file which matches the pattern.

I've also got my target table set to the same file name variable value which is getting passed into to this job. This results in Matillion loading the file to a target table (in Snowflake) with the exact same name as the file name.

No alt text provided for this image
Using a variable value for the target table

If you wanted to name your target table differently, you could create another variable and base it on the file name and then add expressions to create a table name. In this case I'm just using the file name as is, which provides a one-to-one mapping between the S3 file and the target table and Snowflake that I'm going to load the data into.

To briefly recap this step 3 - For every single file I pick up, I execute a orchestration task. This task loads the files directly from S3 into Snowflake, effectively ingesting that data onto Snowflake (in a landing area).

Please note, there are also a range of options associated with the S3 load component. For example, you can also manage how you want to handle errors and other options such as to completely truncate the target table in Snowflake or forcible load the data (Snowflake knows what data has been loaded in the past 64 days and won’t load the same file again unless you override this default behavior).

Step 4. Write out operational audit metadata

Finally, I call a transformation job which populates an audit table. I have a fixed flow component in here - this allows you to generate lines of fixed input - think tabular structure, or input from variables. In this case I am using these audit parameter values which are provided out of the box by Matillion (see image below).

No alt text provided for this image
System variables making up the fixed flow component properties

Next I write this fixed flow out to an ETL audit table by mapping the fixed flow fields to the corresponding columns in the target table. This process captures metrics such as when the component ran, the status (e.g. success, error or running), duration and row count.

No alt text provided for this image
Mapping to the target table

This sets up the basic, generic and repeatable pattern to capture and write operational metadata to a target table which can be extended over time.

Summary (and bonus note on concurrency)

A quick recap of what we’ve put together so far - we firstly run the initial load fixed iterated job. This has a fixed iterator picking up a list of files I expect to see in S3. For each file name we trigger a job which moves the files from S3 into a target table in Snowflake before triggering the populate audit table job.

Before I wrap up I wanted to touch on the concurrency point I mentioned earlier in this article. I can run this job sequentially - so each filename one by one - or I can run them concurrently based upon how many threads I have on the underlying EC2 instance Matillion is running on.

No alt text provided for this image

Here I have selected the concurrent value. This is because I know there's no dependencies between these individual files and the aim is to load them into Snowflake as quickly as possible leveraging the concurrency of Snowflake’s virtual warehouses. I want to get all of these files in as quickly as possible from S3 into Snowflake.

I hope this pattern then gives you a good indication of where to start and breaks down the fundamentals to kick start your journey working with Matillion to ingest data from any cloud storage bucket, AWS, Microsoft or Google into a support target data platform. Although I have focused on AWS S3 and Snowflake the fundamental approach could remain very similar regardless of your cloud storage location and target database.

Don't forget if you're looking to validate your Matillion skills and you'd like to get your Matillion Associate Certification FAST! Look no further than my top rated practice exams.

To stay up to date with the latest business and tech trends in data and analytics, make sure to subscribe to my newsletter, follow me on LinkedIn, and YouTube, and, if you’re interested in taking a deeper dive into Snowflake check out my books ‘Mastering Snowflake Solutions’ and ‘SnowPro Core Certification Study Guide’.

------------------------------------------------------------------------------------------------------------

About Adam Morton

Adam Morton is an experienced data leader and author in the field of data and analytics with a passion for delivering tangible business value. Over the past two decades Adam has accumulated a wealth of valuable, real-world experiences designing and implementing enterprise-wide data strategies, advanced data and analytics solutions as well as building high-performing data teams across the UK, Europe, and Australia.?

Adam’s continued commitment to the data and analytics community has seen him formally recognised as an international leader in his field when he was awarded a Global Talent Visa by the Australian Government in 2019.

Today, Adam works in partnership with Intelligen Group, a Snowflake pureplay data and analytics consultancy based in Sydney, Australia. He is dedicated to helping his clients to overcome challenges with data while extracting the most value from their data and analytics implementations.

He has also developed a signature training program that includes an intensive online curriculum, weekly live consulting Q&A calls with Adam, and an exclusive mastermind of supportive data and analytics professionals helping you to become an expert in Snowflake. If you’re interested in finding out more, visit www.masteringsnowflake.com.

Suresh Kakke

Senior Software Engineer at Cognizant ( Tableau / PowerBI ) | Ex -Cognizant| Ex - (OSI) Akamai | Ex - Tata Communications

1 年

Thanks for sharing

Harshad Dhuru

CXO Relationship Manager

1 年

thank u so much for sharing

要查看或添加评论,请登录

Adam Morton的更多文章

  • Breaking Free: Why Composable Architecture is Your Data Strategy's Missing Piece

    Breaking Free: Why Composable Architecture is Your Data Strategy's Missing Piece

    Thank you for reading my latest article Breaking Free: Why Composable Architecture is Your Data Strategy's Missing…

    2 条评论
  • When Everyone's a Data Scientist, No One Is

    When Everyone's a Data Scientist, No One Is

    Thank you for reading my latest article When Everyone's a Data Scientist, No One Is At Future Proof, I regularly…

  • The Hidden Cost of AI Enthusiasm

    The Hidden Cost of AI Enthusiasm

    Thank you for reading my latest article The Hidden Cost of AI Enthusiasm. Here at LinkedIn I regularly write about…

    1 条评论
  • The Roof is on FIRE!

    The Roof is on FIRE!

    Thank you for reading my latest article The Roof is on FIRE. At Future Proof, I regularly explore the evolving…

  • Snowflake Expands to Mexico and South Korea: What It Means for You

    Snowflake Expands to Mexico and South Korea: What It Means for You

    Thank you for reading my latest article Snowflake Expands to Mexico and South Korea: What It Means for You. Here at…

    1 条评论
  • Snowflake’s New Cloning Optimization Explained

    Snowflake’s New Cloning Optimization Explained

    Thank you for reading my latest article Snowflake’s New Cloning Optimization Explained. Here at LinkedIn I regularly…

  • Sustainable Technology Examples

    Sustainable Technology Examples

    Thank you for reading my latest article Skills Over Degrees - Is this the future of tech careers? At Future Proof, I…

  • 5 best practices for unlocking Document AI

    5 best practices for unlocking Document AI

    Thank you for reading my latest article 5 best practices for unlocking Document AI. Here at LinkedIn I regularly write…

  • Courage to Speak

    Courage to Speak

    Thank you for reading my latest article Courage to Speak At Future Proof, I regularly explore the evolving landscape of…

  • Truth About AI Hallucinations: Why Transparency Matters

    Truth About AI Hallucinations: Why Transparency Matters

    Thank you for reading my latest article The Truth About AI Hallucinations: Why Transparency Matters. Here at LinkedIn I…

    1 条评论

社区洞察

其他会员也浏览了