登录查看更多内容

All the S#*!@(stuff) you CAN DO with Snowflake?

Nick Akincilar

Analytics, AI & Cloud Data Architect | Solutions Whisperer | Tech Writer

发布日期: 2020年10月16日

Sequel to my previous article about all the S#*!@(stuff) you no longer have to do with Snowflake due to it's automated & intelligent features requiring near 0 management. Now it is time to talk about all the S#*!@ that you can do with it to enhance your business in terms of all things related to data.

Most people know Snowflake as the leading cloud-native data warehouse that is highly scalable which everyone is talking about. This is an awesome use case that allows you to take your traditional data warehouse and move it to any cloud provider and be able to scale & do things that with your data you couldn't even imagine before. Essentially, moving from a world where everything was confined, limited, had to be pre-sized & scaled ahead of time to the cloud world where you are no longer burdened with any limitations in terms of volume, velocity data or the computation resources of the data because you gained access to unlimited amounts of both compute power & storage that you could scale up or down within seconds. Yet, it was super-easy, super-secure & didn't require a rocket scientist to operate.

Usually, people expect Snowflake to deliver a better & faster version of what they already had however the difference Snowflake makes for many organizations is eye-opening the least.

The initial expectation is this will be like giving up your old Nokia brick phone & ending up with slightly newer a Motorola flip phone(Sorry Millenials, you may have to google these bad boys if you don’t know).

However, once you start using Snowflake, it turns out to be more like moving from an ancient Nokia brick phone to the latest iPhone or Samsung smartphone with all the apps, features, automation, performance & connectivity.

Just imagine how much your life changed after the introduction of smartphones with apps for every need, always keeping you connected to the entire world & the ease of use. That’s the kind of difference Snowflake can make to your business in terms of data & analytics.

So at this point, the real question is what else you can do with Snowflake to help your business other than just using it as a data warehouse?

The answer is A LOT !!!! Let's walk through how Snowflake is being utilized other than a traditional warehouse role by our customers.

Data or feature engineering means different things to different people. It ranges from being a traditional ETL or ELT data pipelines for BI & Reporting to feature engineering that collects & binds hundreds of attributes from various sources to be used for Data Science & ML use-cases. Regardless of what it means, the result is that this process takes a lot of compute resources to turn data from raw to ready for analytics. A mix of commercial & open-source ETL, ELT & Spark based tools are common in these pipelines & the throughput is usually limited by how much data they can handle, the types of data they can handle & how fast they execute.

If you look into any typical business with a lot of data, you would usually see a mix of different tools & solutions for data engineering using SQL, some proprietary tech, or Spark to handle different types of data with different teams running each one using a different set of skills. The manner these pipelines run is also highly dependent on these tools and their limitations. You’ll likely see pipelines executing in a serial manner where they run only a few jobs at a time in parallel due to compute limitations to create temp datasets that are then copied to the next set of tools to be processed further down the pipeline. Structured data is usually handled by traditional ETL/ELT tools but semi-structured data such a Json, XML & parquet is usually handled by more complex Spark-based tools. What you see is the same set of data hopping between multiple tools being replicated multiple times to go from raw to analytics ready. The process of using multiple technologies serially takes a lot of time & very error-prone where a problem in any step can easily halt the entire process.

The end result is business users can only get some of the data they need and not at the speeds & the frequencies that they want and definitely not without many unexpected delays.

So how does Snowflake help? It can store, process & query both structured but also semi-structured data natively which means one platform to handle both types. We also do all this by using simple SQL for everything which everyone already knows. This means any SQL-based ETL/ELT tool will do & no more need for a complex set of tools requiring rocket scientists to operate them. On top of this, we also give you instant access to unlimited amounts of compute resources that you can start within seconds and keep them active only for the duration of time that you need them and not a second more. This is huge because not everyone can afford to spend millions to run a 96 node Hadoop cluster 24x7 to process massive amounts of data but most Snowflake customers can easily afford to run a 128-node 3XL warehouse running only for 97 seconds and twice a day just enough to run through that big ETL job which would cost less than $14 a day. Just imagine what you can do with this much power for your business if you didn't have to spend millions.

So what does it all mean for your business? It means you can deliver as much data as business wants & as fast as they need. You can deliver data coming in any format & you can deliver it much more frequently than ever before. That is because you can create as many independent compute clusters(virtual warehouses) as needed to run all your jobs in parallel and not have to replicate the resultsets between each step as Snowflake can access the same copy of the data from any number of independent clusters simultaneously for both read & write operations. Because pipelines are much quicker, much more reliable & not taking away resources from BI & Reporting, you can now deliver data much more often throughout the day instead of one time nightly jobs. On top of running everything in parallel using multiple clusters, you can also size each cluster independently ranging from a 1 node(XS) to 256 nodes(4XL) to increase the processing speeds per each workload which essentially allows you to complete the same job in 12 mins using an XSmall(1 node) cluster or do it in half the time in 6 mins using a Small(2) or use a Medium(4 nodes) and finish in 3 mins or even faster with larger sizes. Yet, the most awesome part is that they all cost the same !!! This is because billing is based on the cluster size and the amount of time they run in seconds.

1 x 12mins = 12 mins
2 x 6 mins = 12 mins
4 x 3 mins = 12 mins

Essentially Snowflake is super-simple, much faster & more reliable way to run data engineering workloads than anything else as it can natively handle both structured & semi-structured data using nothing but simple SQL. It makes such a difference that we have many customers processing billions of rows of data where their ETL windows went from weekly jobs running hours & sometimes days down to hours/minutes and running numerous times a day after using Snowflake. Yet costs stayed relatively the same or more often less. In the cases where their costs went up, it was mainly due to customers realizing what they could do & deciding to process so much more data than they previously did. So if you haven't tried Snowflake for data engineering so far, it is time to give it a try. If you don't believe me, go find the biggest job you got & try it with Snowflake.

Data lakes are usually a common occurrence when the volume, velocity & variety of data is high enough that where a staging area is needed as a buffer prior to doing analytics. After talking to many customers, below are the common reasons that keep coming up for using a data lakes:

We got so many incoming data feeds and not have enough bandwidth or resources to handle all the data & in a timely manner to deliver to business
We got a ton of semi-structured data(Json, Parquet, XML & etc.) where we can't directly load it into our warehouse w/o preprocessing it or the data scientists want access to the raw data because they don't always know what they need from all the attributes and the attributes in those files may change in time.
The volume of data is so big (often with IoT) that we just can't possibly put all of it into our data warehouse.

Regardless of the reason, data lakes often turn into a giant data trash cans where you have huge amounts of data stored as thousands of files within thousands of folders where a ton of those files are duplicates because users didn't know data was already in the lake. This is because hardly anyone knows what is the lake & where things are or let alone how to use it especially if you are on the business side. This is if you figured out the data security & governance within a lake or the lack of it. Trying to apply a folder or file-level security on a giant hard drive across an enterprise is pretty much impossible. This is why you’ll often see data lakes are typically limited to a few highly skilled individuals & not rolled out to business users across the enterprise.

So how do we help? You know can handle both structured and semi-structured data natively using nothing but simple SQL. This probably covers 99% of data you usually find in data lakes. We not only can query semi-structured formats but can also ingest & store them in their raw formats using our VARIANT type data column. This means you can ingest billions of JSON data nodes including inner arrays into a Snowflake table and still keep the original JSON format in that table. Once in Snowflake, you access massive amounts of JSON data using plain SQL and at speeds that are comparable to data warehouse workloads. After all, Snowflake is built to handle PB scale datasets whether data is structured or semi-structured. Once in Snowflake, you also achieve a fully secured, encrypted & governed means to access this data that comes with role-based access model and end-to-end encryption that is built into our platform. Take all this & add the instantly scalable & independent compute clusters that you can use to either query, transform this data or even write it to external blob stores in various formats, you have no reason to use a traditional file-based data lake with all of its problems. Snowflake essentially provides you a totally usable & secure data lake based on ANSI-SQL which any business user, data engineer or data scientist can use via any of their favorite tool of choice.

But what if you already have a data lake? Well, in that case, we can augment your lake to make it more usable & secure. You can leave existing data in your lake and use Snowflake as a query layer to expose the data to your users using SQL which also introduces more fine-tuned role based security into the picture. You can also use Snowflake as the ETL/ELT engine to process the data in your lake where we can read datasets from your lake, perform transformations using our massively scalable compute resources much faster than your current methods and write the results back to your lake in a partitioned fashion using any structured or semi-structured format.

Above are the 3 main ways Snowflake can be used within the context of a data lake & various benefits:

Augment an existing lake as a SQL query engine.
Augment an existing lake as a SQL query engine + Data Processing engine
Be the data lake by replacing an existing lake.

Data science & Machine learning is another big area where Snowflake can make a big impact. In my experience, data most data scientists spend their time as follows.

Data Exploration which is roaming through various datasets from different sources to see what each dataset contains to find valuable attributes (40%)
Data Engineering which is about compiling all of the data found in various sources & formats into singular tables to combine all of the attributes for a given subject matter (%30)
Data Cleansing which is normalizing and cleaning up bad data to a standardized set of values (20%)
Creating Actual Data Science & Machine Learning algorithms (10%)
Tweak, Rinse & Repeat until things work as expected.

As you can see, data science is more about finding, wrangling, cleaning, and combining many different datasets in many formats than the actual cool DS & ML work that people think about when they hear these terms. A typical workflow is to store all these different datasets in a data lake. This is because the semi-structured datasets that are very common in ML make it difficult to use a data warehouse as the primary data source. Because everything is in files and in semi-structured formats, it limits the use of simple SQL syntax and forces users to resort to using much more complex Spark or Python based technologies to do their data engineering work. Once the data is ready, they finally read the data into Panda data frames in their ML notebooks and start experimenting & building. Any complex work that needs to happen also runs either on their own machines or on Hadoop based clusters with a limited amount of compute resources. If the datasets get too large, they hit the RUN button then have to go find a new hobby to kill a couple of hours if not more. It is also not uncommon to see DS users having to wait in queues or have weekly or daily allocated time slots to run their complex stuff so they are not affecting other business users. Courtesy of sharing limited resources with others.

How do we help? Snowflake can function as a SQL query engine to provide easy and fast means to query your data files(structured & semi-structured) that are already on your existing lakes. Snowflake can store those exact files in its internal data tables in their native formats and be the data lake which gives you even faster query performance plus 0-clone copy feature that DS teams can use to create test & dev copies of massive tables instantly within seconds (means that can clone production size tables for tests & not smaller cuts). Snowflake also provides massive amounts of compute clusters that can be created and dedicated to ML use cases which they don't have to share with others. These clusters can also be scaled up or down in seconds based on the current requirements.

What this means for a data scientist is that he or she can open up their ML notebook, connect to Snowflake account using Python, ODBC, JDBC &, etc. then start querying raw data files stored in tables using nothing but SQL. They can use SQL not just to query the data but also to use it to perform feature engineering & cleansing tasks. The beauty of using Snowflake compute to perform these tasks is that compute occurs on Snowflake clusters which means the user has full control of the size of the cluster. What used to take hours to complete on a shared Hadoop based resource using complex code could be done in minutes by simply resizing the cluster to bigger sizes(2,4,8,16, 32,64,128 & 256 nodes) instantly. Then once the task completes, they can instantly shrink it or pause it to stop incurring charges. This allows the team to run on much bigger clusters by running them only for short periods of time which completes their work so much faster so they can do rinse & repeat so many more time to increase accuracy. Essentially it is as easy as:

ALTER warehouse ML_WAREHOSUE set warehouse_size=XLARGE; (increase to 32 nodes)

UPDATE, INSERT OR SELECT ..SOME LARGE & COMPLEX DATA

ALTER warehouse ML_WAREHOSUE set warehouse_size=XSMALL; (shrink down to 1 node)

Since our drivers can be used by just about any ML or DS tool that can use SQL, it is very easy to use Snowflake to perform computationally complex tasks very fast and then query the results and load them into familiar things like Panda frames to actually do the ML & DS work.

To summarize, the main benefits are instant access to massive amounts of structured & semi-structured datasets using simple ANSI-SQL, dedicated & instantly scalable compute resources to complete complex tasks quickly and finally seamless integration with pretty much any ML & DS tool so you can use the best of breed tools that you prefer. The end result is, you will be able to do so much more with so much less time and using much less effort without having to wait or share resources with everyone else.

There you go!!! If you thought Snowflake was just a new shiny data warehouse, I just corrected that for you.

Snowflake is in fact a complete data platform that can not only solve many of your data-related challenges but actually vastly improve your workloads ranging from data engineering & data lakes to data warehousing & data science.

All the S#*!@(stuff) you CAN DO with Snowflake?

Nick Akincilar

Analytics, AI & Cloud Data Architect | Solutions Whisperer | Tech Writer

更多精彩文章

社区洞察

其他会员也浏览了

Warping through Data pipelines

Getting Your Hands Dirty with Microsoft Fabric: A Beginner's Guide (Part 1)

Azure Data and Power BI News (July 2023)

Azure Data and Power BI News (October 2023)

Big Data is for everyone – Part 1 of 2

Driving Business Growth with Big Data: SynapseIndia's Comprehensive Solutions and Strategies

Modern Data Stack: Looking into the Crystal Ball

HOWTO: Configure Azure Sentinel data export for long-term storage

Unveiling Microsoft Fabric: A Comparative Analysis Against Competing Platforms

Elevate Your Data Game with Microsoft Data Fabric

5 mins to create a Snowflake Self-Service Sandbox Environment

2022年2月16日

Snowflake Data Engineering & Governance Decision Tree

2021年12月2日

Running Python Workloads on scalable Snowflake Compute clusters

2021年11月23日

Snowflake, Blockbuster, Netflix, & YouTube...? What is DataCloud & Why it should matter to you?

2021年6月17日

What makes Snowflake's live data exchange so damn exciting?

2021年4月9日

If data platforms were cars, Snowflake would be...

2021年4月1日

Quick SQL DB to Snowflake Export Tool

2021年2月10日

The Most Awesome New Snowflake features for Data Science & Engineering.

2021年1月20日

All the S#*!@(stuff) you don't have to do with Snowflake

2020年9月28日

Cloud Data Security With Snowflake (Part-1: Data At Rest)

2020年8月4日