Snowflake: The cloud data warehouse solution with no modeling

In this article, I am going to talk about the cloud based data warehouse solution Snowflake. I will deep dive into some of the good features, cons and pros. I had opportunity to work with Snowflake. Therefore, I will try to give some useful insights and share my experiences.

Snowflake comes from the family of massive parallel processing (MPP) databases. It runs completely on a cloud infrastructure. You can choose Snowflake account hosted on cloud platform of your choice like AWS or GCP or Azure. Snowflake uses virtual compute instances for its compute needs and a storage service for persistent storage of data. It cannot be run on private cloud infrastructures (on-premises or hosted). It is not a packaged software offering that can be installed by a user. Also, Snowflake manages all aspects of software installation and updates.

That means Snowflake is fully managed and server less service. You don't need to worry about software, hardware and servers. Everything is taken care by Snowflake for you.

We talked high level about Snowflake. Now lets deep dive into some of the features/cons:

  1. Performance: Snowflake performance is good. You can run query on millions of records or huge table size without worrying much about performance. Based on your need, you can choose a virtual warehouse ((This is cluster of compute resources such as CPU, Memory, Temporary storage). Virtual warehouse comes in various size like x-small, small, medium, large, x-large, xx-large. Point here is that if you choose appropriate warehouse configuration, you can get good performance. Configuring the warehouse is not a difficult task. It is just about few parameters (cluster, size, monitor, resume ,suspend, scaling policy ) and you are done. You do have to consider some of the factors for deciding warehouse configuration like what is the data volume you are processing, what is the turn around time you are looking for query completion and what are the various joins you are using in the query etc.

2. Modeling: Snowflake has very less or no modeling. You don't need to define the primary keys or partition keys or indexes. This is one of the good feature of Snowflake. When you work with the MPP databases such as Redshift, Greenplum or Teradata, You need to spend some time on data profiling and thinking which columns to choose for index, data distribution, and partition. Even in case of BigQuery, you need to choose partition columns (for large tables) to enhance the performance of queries. But with Snowflake, you don't need to worry for these at all. Snowflake does micro partitioning for huge tables automatically and adjust those parameters over the time based on the query/usage pattern. This feature makes Snowflake different than other MPP databases. Over the time, same query performance may get better. Also, Snowflake has a clustering (for co-locating data) feature and if you want, use it for very huge tables (again based on the usage pattern).

3. Staging area: Snowflake has internal and external stages. The external stages are meant to access data outside to Snowflake. If you want to access S3 data, just create an external stage pointing to your S3 data lake folders, and access it via SQL in Snowflake. No need to move data to Snowflake. This feature is very useful if you are loading data from the S3 data lake. This brings your data lake virtually into the Snowflake data warehouse. Internal stages are very useful to load flat files directly into Snowflake. You can stage/upload those files into internal stage and copy data to your tables. Also, once data is loaded into the internal stage, you can query it without loading to a table. These stages are very useful when you load data from external systems or perform historical load or want to access the data lake built on a different cloud platform.

4. Disaster Recovery: Snowflake keeps backup of your data for certain period of time (default is 1 day and you can extend it up to 90 days). If you made any mistake while working with the insert, update or delete, no worries. You can restore a specific point in time data by writing a simple select query. This is called time travel and fail safe. This is very useful from disaster and recovery perspective. You can run your DML without any worry as in case of mistakes, you will be able to recover the data quickly.

5. Metadata and Audit Log: Snowflake provides very nice metadata and Audit log features. Metadata information schema is similar to Postgres. You will be able to get all info about the tables, columns, keys by querying Information schema. Snowflake maintains historical and current queries audit log. You can find out which query has consumed how much snowflake credits and act upon them. Snowflake is use and pay model, hence this can be very useful if you want to fine tune some of your high credit consuming quires and reduce the cost.

Overall, Snowflake is one of the good cloud data warehouse which has good performance and takes away tables modeling worries.

There are few things which Snowflake can improve upon, or I see as cons:

  1. Eliminate need of creating virtual warehouses because sometimes you start thinking from cost perspective and that confuses about the warehouse configuration.

2. If your S3 data lake files are in parquet format which is one of the most preferred file format and you are accessing it via external stage, query output result is in JSON format. You need to apply parse json function and extract the required fields. This may be a challenging work for many end business users (if they want to access data lake from Snowflake) . Also, If data lake files are huge, performance is an issue. In that case you prefer to copy those files to Snowflake which leads to data duplication/redundancy.

3. Snowflake should give developers/architects ability to partition tables because sometimes they can choose better partition as they know data very well.

4. Snowflake is only a database solution and you need to buy other services ( ETL, Workflow orchestration/scheduling , Reporting, Notification, Monitoring ) separately. Many organizations preference are : stay on single platform, tightly coupled services, seamless integration and packaged deal. This seems one of the biggest challenge for Snowflake.

I will end my article here. What is your experience with Snowflake. Do share in the comment box. Happy learning and coding!!!

Tamanna Gupta

Data Leader Driving Business Transformation through Advanced Analytics and AI | Database Architecture | Data Mesh |Data Lake |Snowflake -Data Cloud Certified Architect

4 年

Hi Gopal, Partitioning is supported through external tables. I appreciate all the write up by you.

回复
Gopal Kumar Roy

Solution Architect

4 年

Thanks guys for reading and sharing!

回复
Sathish Ksheersagar

Data & Analytics | Leadership | Strategy & Road-map | Healthcare | Manufacturing | Architecture & Governance | Cloud Technology | Engineering | Data Science | Data Products | App & Platform SRE | Dev-ops | Automation

4 年

Good one Gopal ??

回复
Sushovon Sengupta

Vice President-Data Analytics at Citi, Expert in Data Engineering and Architecture

4 年

Good Article Gopal Kumar Roy

回复
Ghanshyam Varindani

Co-Owner & VP - Data Management at Servify

4 年

good read

回复

要查看或添加评论,请登录

Gopal Kumar Roy的更多文章

  • NoSQL versus SQL Database

    NoSQL versus SQL Database

    I have been working with SQL and MPP databases since very long time. After working so long, I learnt depth and breadth…

    2 条评论
  • AWS Data Analytics - Specialty exam preparation tips

    AWS Data Analytics - Specialty exam preparation tips

    Last week I passed the AWS Data Analytics - Specialty exam and thought of sharing some of the tips that can be very…

    3 条评论
  • Airflow: ETL Workflow Management Platform

    Airflow: ETL Workflow Management Platform

    Airflow is getting very popular for the ETL workflow management (It can be used for other kind of workflow management…

    2 条评论
  • Spark: The most popular big data processing framework

    Spark: The most popular big data processing framework

    Here is my another article related to big data and cloud technologies. In this article, I am going to talk about the…

    5 条评论
  • Why Python is top choice for Data Engineering

    Why Python is top choice for Data Engineering

    Python is one of the most popular programming language. Cloud, Big data and Machine Learning have made it very popular…

    1 条评论
  • Google's BigQuery: Strengths

    Google's BigQuery: Strengths

    Google's cloud offering GCP is increasing its footprint very rapidly. Specifically, GCP's data warehouse service…

    1 条评论
  • AWS Glue- Based on a data Engineer real life experience

    AWS Glue- Based on a data Engineer real life experience

    There is lot of buzz going around cloud technologies.Many organizations are moving to Cloud.

    6 条评论

社区洞察

其他会员也浏览了