What makes Snowflake platform so damn cool?

What makes Snowflake platform so damn cool?

At this point, if you work in data technology space and haven't heard of Snowflake, you are either living under a rock or all your technology news feeds have been going to your spam folder for the last 3 to 5 years. (in that case, you should really fix that!)

Snowflake is the talk of entire data market space for a reason and it is not because it is just another hype. So, what makes Snowflake the posterchild of data technology. Why does every vendor in the BI & Analytics space want to partner with it & everyone who deals with SQL & data wants to learn & work with it?

To see what makes Snowflake really special, you first have to understand how everything else operates. When I say everything else, I mean any solution in the market that serves data using SQL. Which is, the relational stuff you have been working with for the past 30+ years.

No alt text provided for this image


Regardless of what big RDMS vendor you used or heard before; they all pretty much operate the same way. Essentially all of your workloads(Ingestion, ETL, BI, Reporting, Data Science &, etc.) share a single compute cluster. Different vendors use different methods to operate & scale this cluster such as shared-disk or shared-nothing but in the end resulting solution is the same. All the workloads still end up having to share the same single compute cluster. Let me show you what I mean...

This is what a traditional data warehouse looks like regardless of which vendor you choose & whether they are on-prem or cloud-based solutions. It is a cluster of computer resources with fix amount of computing power and some form of storage designed to support all of your data workloads.

No alt text provided for this image

The real fun starts when all your different workloads try to connect to it & run queries(Select, Insert, Update &, etc.) simultaneously. To manage this scenario, you have split the total computing power between these workloads depending on the importance of each workload such as shown below.

No alt text provided for this image

This can be done using either an automated workload management feature(which doesn't always prioritize things well) or manage it manually by splitting the total compute power between the different workloads. However, this doesn't change the fact that you only have a fixed amount of computing power to work with.

Using workload management, you can always change & assign more compute resources to certain workloads as shown below but this also means you have to take away resources from other workloads at the same time.

No alt text provided for this image

This is the reason why big complex ETL jobs usually run once every at night in a batch fashion so the business users are not kicking & screaming during the day.

Remember!!! You can manage the workloads all you want but the total size of the cluster(yellow box) does not change unless you scale up the whole thing..

So what happens when one of the workloads increases just a little to a point where the total workload requirements slightly exceed the total capacity? Like shown here...

No alt text provided for this image

This is where Snowflake & traditional solutions start to split apart from each other. In traditional solutions, whether they are one of the big cloud data warehousing products or their on-prem versions, there is only one way to fix the problem and that is to scale UP.

I use the term UP loosely because some cloud vendors will literally scale UP by adding more compute to the cluster then re-distribute the data while others will scale horizontally by adding additional clusters then replicate the entire dataset to each one. The goal is to add more horsepower for faster parallel processing of the data. In the end, regardless of how they scale, you still end up with the same result.

No alt text provided for this image

A bigger cluster with more horsepower & sometimes more storage.

In most cases, each round of scaling up usually ends up doubling the horsepower & your total running costs $$$ at the same time. Once scaled up, it is usually a permanent change that goes on forever.

As a result, you get stuck paying for a bigger & more expensive cluster just to handle your peak usage for brief periods where some or most of the newly obtained growth is not needed the rest of the time and is waste of money for the duration of your reserved instance (1 to 3 years).

Some cloud vendors also try to remedy this problem by trying to scale up either automatically or by giving you an option to do it manually on-demand but most solutions are either too slow to react when it comes to scaling, they won't scale down on their own or the scaling process is business disruptive meaning it stops all running queries so, you can't scale during while users are using the system.

What usually happens in the end, is that most IT departments will do capacity planning ahead of time where they will guestimate the max peak usage and reserve a big enough cluster to run 24x7, so disruptive & slow scaling is not needed.

Another big problem with the traditional approach is that most workload demands are never the same and at times could require lots more horsepower than they normally utilize. Data science & ETL are prime examples of these where you may see major spikes in compute needs that are multiples of their regular usage pattern because a data scientist just decided to run a very complex query on last 5 years of data to train his model or you just received a new request to ingest a monster size data source & clean it up by the end of the week for a major project.

And who can forget concurrency... Your organization acquires a new company with 200 more sales reps & 30 more analysts who all need access to your data warehouse and mostly on Monday mornings. You get three times the queries out of nowhere which you didn't account for.

Now what? Do you cut down the data ingestion frequency & resources for data science guys or do you scale up & pay more for stuff that won't be utilized most of the time?

No alt text provided for this image

These are the exact challenges that most organizations face with whether they use on-prem or cloud-based data warehouses and these are the exact challenges that made Snowflake the poster child of data management space.

You need to scale up when you need it, scale down when you don't, and stop paying for things when no-one is querying anything. But this alone is not enough because no two workloads are the same and each has distinct compute requirements so one size fits all approach is not good enough. You need to be able to adjust & scale compute power per each workload separately where one workload does not mess with another one. No resource contention.

By the way, you need it in a way that each workload can be scaled up & down for faster performance while users are still using it in an instant & can also automatically scale horizontally immediately to handle thousands of users logging-in within seconds and scale down right away when they all log off. If they stop running queries and nothing is running, why do you have to pay for it? It should just automatically shut itself off if it is idle and if a user runs a query, it should automatically start so fast that the user doesn't even notice the servers were not running when he triggered the query.

Looking at the way that Snowflake handles workloads gives us a completely different picture. We see various compute resources dedicated to each workload, yet they all access the same single copy of the data simultaneously for both read & write operations. Can you say goodbye to data silos & data governance? No more having to make copies of data & trying to keep everything in synch like you would with data marts.

No alt text provided for this image
Being able to assign different pools of independent compute clusters simultaneously to the exact same single copy of the data for both read & write operations is in the heart of how Snowflake does things differently than anything else in the market.

Remember this well, as this is the main reason why you can do many awesome things with data such as data sharing which I will cover in my next article.

During the night, it may be that BI & reporting needs are very basic but you need to run large complex ETL jobs & an offshore data science group pecking away at the data.

No alt text provided for this image

When it is daytime during weekdays, your BI users are running complex queries from their Tableau & PowerBI dashboards but ETL is much less yet the DataScience group starts running queries against bigger datasets.

No alt text provided for this image

And what about Monday morning rush? All 2000 users rushing to their dashboards at the same time where Snowflake automatically starts replicating multiple sets of BI compute clusters to handle concurrency. At the same time, since you have not processed any data on Sunday, no ETL jobs are running which the ETL virtual warehouse pauses within mins of being idle and stops incurring any charges. And there is only minimal usage on the Data Science warehouse because this group has their weekly meetings in the morning.

No alt text provided for this image

This is basically what makes Snowflake the leading data platform and replacement for on-prem and cloud-based traditional solutions across thousands of large & small companies.

Doing all of these things in a mostly automated manner with near 0 administration is the name of the game when it comes to Snowflake

But wait, there is more... Did I mention, Snowflake also doesn't care what cloud vendor you use?

Azure, AWS, Google, we just don't care. Do you have a multi-cloud strategy where you don't want to be locked down to a single cloud provider or running a global operation where you need your data accessible via different cloud providers around the globe?

Snowflake is the only Cloud Data Platform where your data can be automatically synchronized around the globe & across different cloud vendors & regions giving you the ultimate flexibility.

No alt text provided for this image

After all, data is global & should never be locked into a single cloud provider. With Snowflake, data can reside across all the clouds where it can be consumed & shared with your internal users, customers & partners regardless of what cloud they might be using.

Well, this is Snowflake 101 for you. If you didn't know, now you know.

Spread the word & don't forget to give this article a LIKE if it was helpful so all your linked in buddies can see it as well.

Vasim C.

VP - Data Engineer at Morgan Stanley

4 å¹´

Thank you Nick, extremely informative.

赞
回复
Shyam Chaware

Strategic Accounts at MongoDB | Driving Growth | Powering GenAI applications

4 å¹´

Nick Akincilar I'll keep the adjectives to a minimum but you're one of the smartest guys I've worked with. Very well written article which explains differences between Snowflake and it's competitors succinctly. One question about this part - "All 2000 users rushing to their dashboards at the same time where Snowflake automatically starts replicating multiple sets of BI compute clusters to handle concurrency" - does it mean that every user running their Tableau or PBI dashboard interacts with a different copy of the dataset? How do you make sure all data is synced to the last second?

赞
回复
Daan Bakboord

Jouw partner bij het vertalen van de juiste data naar tijdige beslissingen | ?????????????????????? | ?????????????????? ?? ???????? ?????????????????? | Consulting Partner Data Management @ ????????????????

4 å¹´

This explains things very well Nick.

赞
回复

要查看或添加评论,请登录

Nick Akincilar的更多文章

社区洞察

其他会员也浏览了