登录查看更多内容

Snowflake Questions asked to me in interviews

Arjun Rajeshirke

Bilingual in German | Azure Data Engineer | SQL | Azure Synapse Analytics | MS Fabric | Azure DevOps | ETL | ADLS | Azure Data Bricks | Azure Data Factory | Cosmos DB | Pyspark | Azure Function Apps | Power BI | Blogger

发布日期: 2024年1月25日

If you have been reading my recent stories, you must be knowing that I am writing, reading and learning about technology and azure cloud computing.

I had the opportunity in to work in Snowflake for some years and want to just list down some of the interview questions asked to me about Snowflake.

Please feel free to write down questions asked to you in interviews in comment section so it can help fellow readers on our LinkedIn community.

What is a stored procedure and how it works?

A stored procedure is a prepared SQL code that you can save, so the code can be reused over and over again.

So if you have an SQL query that you write over and over again, save it as a stored procedure, and then just call it to execute it.

You can also pass parameters to a stored procedure, so that the stored procedure can act based on the parameter value(s) that is passed.

The JavaScript API for stored procedures is similar to, but not identical to, the APIs in Snowflake connectors and drivers (Node.js, JDBC, Python, etc.).

The API enables you to perform operations such as:

Execute a SQL statement.
Retrieve the results of a query (i.e. a result set).
Retrieve metadata about the result set (number of columns, data types of the columns, etc.)

To create a stored procedure we use the below syntax:

CREATE OR REPLACE PROCEDURE procedure_name
RETURNS STRING
LANGUAGE SQLAS
$$BEGIN
sql_statements;RETURN 'Success';
END;
$$;

To run/execute the stored procedure we use the below command:

EXEC procedure_name;

2. What so you mean by a task in Snowflake?

A task in snowflake is a piece of SQL code that runs either when we call it manually or schedule it to run on specific intervals of time.

Snowflake Tasks are schedulers that can assist you in scheduling a single SQL Query or Stored Procedure. When paired with streams to create an end-to-end Data Pipeline, a job can be quite beneficial.

Tasks can execute a single SQL statement or a stored procedure.

A task can also be scheduled to run at any instance of time. Currently, a task is not able to run multiple SQL statements.

Since a snowflake task can execute a stored procedure, we can add that stored procedure within that task and schedule that task to run at specific intervals of time.

CREATE OR REPLACE task_name
WAREHOUSE = COMPUTE_WH
SCHEDULE = '1 MINUTE'
AScall procedure_name;

3. What are analytical functions? Can you explain with an example?

Analytical functions are used to calculate an aggregated value from the dataset but are based on a specific set of rows instead of the entire dataset. As compared to aggregate functions like SUM, COUNT, AVG, etc. which return scalar records, these functions can return multiple records based on the conditions. The most common examples of using these functions are to find moving averages, running totals, etc. SQL Server supports the following analytic functions.

a. CUME_DIST — Find the cumulative distribution of a numerical column

b. FIRST_VALUE — Finds the first value of a column from the group and prints the same for each row

c. LAST_VALUE — Finds the last value of a column from the group and prints the same for each row

d. LAG — Reads values after the specified number of rows for a column

e. LEAD — Reads values before the specified number of rows for a column

f. rank() and dense_rank()— I have explained what rank() and dense_rank() functions are in this article. Check it out : https://www.dhirubhai.net/pulse/what-difference-between-rankdenserank-rownumber-sql-arjun-rajeshirke-pwgzf

4. What are the different types of warehouses in Snowflake?

Warehouses are required for queries, as well as all DML operations, including loading data into tables. A warehouse is defined by its size, as well as the other properties that can be set to help control and automate warehouse activity.

Warehouses can be started and stopped at any time. They can also be resized at any time, even while running, to accommodate the need for more or less compute resources, based on the type of operations being performed by the warehouse.

Below are the different sizes of warehouses available in Snowflake.

5. What do you mean my micro partition?

Traditional data warehouses rely on static partitioning of large tables to achieve acceptable performance and enable better scaling. In these systems, a partition is a unit of management that is manipulated independently using specialized DDL and syntax; however, static partitioning has a number of well-known limitations, such as maintenance overhead and data skew, which can result in disproportionately-sized partitions.

In contrast to a data warehouse, the Snowflake Data Platform implements a powerful and unique form of partitioning, called micro-partitioning, that delivers all the advantages of static partitioning without the known limitations, as well as providing additional significant benefits.

Benefits of Micro-partitioning-

a. Snowflake micro-partitions are derived automatically; they don’t need to be explicitly defined up-front or maintained by users.

b. As the name suggests, these are small partitions 50MB to 500MB, which enables extremely efficient DML and fine-grained pruning for faster queries.

c. Helps prevent skew.

d. As the data is stored independently and in columnar storage, only the columns referenced in the query are scanned.

e. Columns are also compressed individually within micro-partitions. Snowflake automatically determines the most efficient compression algorithmfor the columns in each micro-partition.

6. What is QUALIFY function in Snowflake?

In a SELECT statement, the QUALIFY clause filters the results of window functions.

QUALIFY does with window functions what HAVING does with aggregate functions and GROUP BY clauses.

In the execution order of a query, QUALIFY is therefore evaluated after window functions are computed. Typically, a SELECT statement’s clauses are evaluated in the order shown below:

From
Where
Group by
Having
Window
QUALIFY
Distinct
Order by
Limit

7. Can you tell me something about query optimization practices in Snowflake?

Query optimization simply means that tuning the query such that the database doesn’t become slower. Also the queries might return results faster. Following are the techniques of query optimization,

a. Maximizing Caching

b. Using materialized views wherever possible.

c. Scaling up, means increasing the size of the warehouse for complex queries.

d. Scaling out, means adding more and more data warehouses of the same size.

领英推荐

The 5 Hardest Things to Do in SQL

Shaikh Amir Hussain 1 年前

Top Azure Databricks Scenario-Based Questions and…

Akshay T. 2 年前

Running Snowflake SQL Queries in Databricks

Deepak Rajak 4 年前

8. Tell me something about storage integration object.

Creates a new storage integration in the account or replaces an existing integration.

A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of allowed or blocked storage locations (Amazon S3, Google Cloud Storage, or Microsoft Azure).

What this means is that if we define the storage integration object for Amazon S3 (below code), we are able to access the files from s3 into snowflake. The storage integration object is an object defining the configuration and what is allowed and what is not allowed.

//Creating an integration object. 
CREATE OR REPLACE STORAGE INTEGRATION s3_int
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = S3
  ENABLED = TRUE 
  STORAGE_AWS_ROLE_ARN = ''
  STORAGE_ALLOWED_LOCATIONS = ('s3://<your-bucket-name>/<your-path>/', 's3://<your-bucket-name>/<your-path>/')
   COMMENT = 'This is an optional comment'//Describing the integration object.
DESC INTEGRATION s3_int;

9. How do you create a user ?

A user in snowflake is a person having access to snowflake and also has a role associate to it. To create a user and assign a role to it, we need to create a role first. (If the role is already created then we can just create a user and assign role to it.)

To create a user type in the following command:

CREATE USER 'JOHN' PASSWORD = 'Notasimplepassword123!@'

10. How do you share objects to different users in Snowflake?

Since in snowflake each newly created user has a default PUBLIC role assigned to him, we can now directly start giving privileges if we want to.

//Granting usage permission to warehouse
GRANT USAGE ON WAREHOUSE READ_WH TO ROLE PUBLIC;//Granting usage permissions to database.
GRANT USAGE ON DATABASE READ_DB TO ROLE PUBLIC;//Grant usage permissions to schema.
GRANT USAGE ON SCHEMA INFORMATION_SCHEMA TO ROLE PUBLIC;//Grant usage permissions to tables.
GRANT SELECT ON ALL TABLES IN SCHEMA INFORMATION_SCHEMA TO ROLE PUBLIC;

11. What is the difference between permanent, transient and temporary table?

There are 3 types of tables in Snowflake.

a. Permanent Table — Permanent table is like a normal table which will be stored in a database and dropped until explicitly stated. These have time travel and fail safe functionality depending on which edition you are using.

b. Transient Table — Snowflake supports creating transient tables that persist until explicitly dropped and are available to all users with the appropriate privileges. Transient tables are similar to permanent tables with the key difference that they do not have a Fail-safe period.

As a result, transient tables are specifically designed for transitory data that needs to be maintained beyond each session (in contrast to temporary tables), but does not need the same level of data protection and recovery provided by permanent tables.

NOTE: Snowflake also supports creating transient databases and schemas. All tables created in a transient schema, as well as all schemas created in a transient database, are transient by definition.

c. Temporary Table — Snowflake supports creating temporary tables for storing non-permanent, transitory data (e.g. ETL data, session-specific data). Temporary tables only exist within the session in which they were created and persist only for the remainder of the session. As such, they are not visible to other users or sessions. Once the session ends, data stored in the table is purged completely from the system and, therefore, is not recoverable, either by the user who created the table or Snowflake.

12. Can you perform DML operations on views?

DML operations could be performed through a simple view. DML operations could not always be performed through a complex view. INSERT, DELETE and UPDATE are directly possible on a simple view. We cannot apply INSERT, DELETE and UPDATE on complex view directly.

13. What is the difference between views and materialized views?

Similar to views, materialized views are also database objects which are formed based on a SQL Query however unlike views, the contents or data of the materialized views are periodically refreshed based on its configuration.

The contents of view will get updated automatically when the underlying table (forming the query) data gets changed. However, materialized views can be configured to refresh its contents periodically or can be manually refreshed when needed.

Creating materialized views can be a very good approach for performance tuning especially when dealing with remote tables.

create materialized view mymv
    comment='Test view'
    as
    select col1, col2 from mytable;

14. What is a merge statement?

Merge is part of the DML commands in SQL which can be used either perform INSERT or UPDATE based on the data in the respective table.

If the desired data is present then merge will update the records. If desired data is not present then merge will insert the records.

Sample merge statement is shown below. Here if the managers and directors table have matching records based the ID field then UPDATE command will be run else if there are no matching records then INSERT statement will be executed.

MERGE INTO managers mUSING directors d  ON (m.id = d.id)WHEN MATCHED THEN    UPDATE SET name = 'TEST'WHEN NOT MATCHED THEN    INSERT VALUES (d.id, d.name, 0);

15. What is a trigger?

Trigger is a database object which is similar to a stored procedure which will automatically get invoked or executed when the specified event occurs in the database.

The most common type of triggers are DML triggers, DDL triggers and Database triggers (also referred as Logon triggers).

DML triggers are invoked when a DML operation (INSERT, UPDATE, DELETE) occurs on the respective table (table on which the trigger was created). Trigger can be configured to invoke either before the DML operation or after the DML operation.

DDL triggers are invoked when a DDL operation (CREATE, ALTER, DROP) occurs on the respective table (table on which the trigger was created).

Database trigger is invoked when the database session is established or shut down.

16. What is difference between WHERE and HAVING clause?

WHERE clause is used to filter records from the table. We can also specify join conditions between two tables in the WHERE clause. If a SQL query has both WHERE and GROUP BY clause then the records will first get filtered based on the conditions mentioned in WHERE clause before the data gets grouped as per the GROUP BY clause.

Whereas HAVING clause is used to filter records returned from the GROUP BY clause. So if a SQL query has WHERE, GROUP BY and HAVING clause then first the data gets filtered based on WHERE condition, only after this grouping of data takes place. Finally based on the conditions in HAVING clause the grouped data again gets filtered.

17. What is an external table in Snowflake?

In a typical table, the data is stored in the database; however, in an external table, the data is stored in files in an external stage. External tables store file-level metadata about the data files, such as the filename, a version identifier and related properties. This enables querying data stored in files in an external stage as if it were inside a database. External tables can access data stored in any format supported by COPY INTO <table_name> statements.

External tables are read-only, therefore no DML operations can be performed on them.

However, external tables can be used for query and join operations. Views can be created against external tables.

Querying data stored external to the database is likely to be slower than querying native database tables; however, materialized views based on external tables can improve query performance.

18. What is the difference between a secure view and a normal view?

A normal view is when a view is created from a table and then everyone who has access to the view can see all the data in the view along with “THE SQL QUERY THAT HAS BEEN USED TO CREATE THE VIEW”. If the user can see the SQL query then the user will know that something has been filtered in the WHERE clause and the view is not as complete as the table.

In this scenario, we can make use of secure view.

A secure view only allows a role to see whatever is written in the SELECTstatement “WITHOUT ACTUALLY SEEING THE SQL QUERY THAT HAS BEEN USED TO CREATE THE VIEW”. So that user will see the view but will not know what has been filtered out as he cannot see the SQL query.

Also you can see a boolean value for a column “is_secure” when you see all the views by using the below query.

True means that the view is secure.

False means that the view is not secure. In short the view is normal.

NOTE: Don’t confuse this with Dynamic data masking policy. In a dynamic data masking policy we are showing all the columns but using some sort of special characters like ***-** in the column data to hide the data from the user.

Happy Learning!!

要查看或添加评论，请登录

Arjun Rajeshirke的更多文章

Understanding the Use and Importance of .crc Files in a Distributed Cloud Computing environment.

2025年2月27日

Understanding the Use and Importance of .crc Files in a Distributed Cloud Computing environment.

What are .crc Files in Data Engineering? In data engineering, especially when working with distributed computing…
My Experience in preparing Azure Data Engineer Associate DP-203.

2025年2月9日

My Experience in preparing Azure Data Engineer Associate DP-203.

So I recently appeared for the DP-203 certification by Microsoft and want to share my learnings and strategy that I…

2 条评论
Interesting Facts and Statistics about Alphabet (formerly Google)

2025年1月9日

Interesting Facts and Statistics about Alphabet (formerly Google)

Google’s story is a tale of innovation, collaboration, and ambition, beginning in the mid-1990s with its founders…
What is Protobuf? Is it really an efficient way for data exchange?

2025年1月2日

What is Protobuf? Is it really an efficient way for data exchange?

Protocol Buffers (Protobuf) is a language-neutral, platform-neutral mechanism for serializing structured data…
What is Apache Kafka and Why it is used?

2024年6月13日

What is Apache Kafka and Why it is used?

Apache Kafka is an open-source distributed event streaming platform used for building real-time data pipelines and…
AWS (Amazon Web Services) Best Practices to follow

2024年6月6日

AWS (Amazon Web Services) Best Practices to follow

This article discusses best practices for AWS, covering billing, EC2, IAM, S3, and security. It highlights cost…
Data Engineer Road Map in 2024.

2024年5月30日

Data Engineer Road Map in 2024.

Becoming a successful data engineer requires a combination of technical skills, domain knowledge, practical experience,…
Unraveling Bitcoin: Decentralization, Scarcity, and the Future of Money

2024年5月23日

Unraveling Bitcoin: Decentralization, Scarcity, and the Future of Money

Bitcoin, the pioneer of cryptocurrencies, revolutionized the financial landscape when it was introduced by an anonymous…
DevOps Metrics and Dashboards

2024年5月3日

DevOps Metrics and Dashboards

DevOps metrics and dashboards are used to measure the efficiency and quality of software development processes. Various…
Explain in detail the DevOps lifecycle in software development

2024年5月2日

Explain in detail the DevOps lifecycle in software development

The DevOps lifecycle is a set of practices aimed at improving collaboration and communication between software…

See all articles

Snowflake Questions asked to me in interviews

Arjun Rajeshirke

Bilingual in German | Azure Data Engineer | SQL | Azure Synapse Analytics | MS Fabric | Azure DevOps | ETL | ADLS | Azure Data Bricks | Azure Data Factory | Cosmos DB | Pyspark | Azure Function Apps | Power BI | Blogger

领英推荐

Arjun Rajeshirke的更多文章

社区洞察

其他会员也浏览了

Snowflake Azure DataFactory ADF Connector

Synapse Serverless SQL Pool: DQP vs MPP

From ElasticSearch back to SQL Server

RLFO#34: Robert Tableson, 2nd of His Name

Amazon Athena

Choice galore in Databricks

Query Compilation Overview in Snowflake

A single ELT pipeline in Snowflake

Synapse vs Spark - the alike and the nuance

Slowly changing dimensions with dbt, Databricks, and MySQL

领英推荐

Arjun Rajeshirke的更多文章

Understanding the Use and Importance of .crc Files in a Distributed Cloud Computing environment.

My Experience in preparing Azure Data Engineer Associate DP-203.

Interesting Facts and Statistics about Alphabet (formerly Google)

What is Protobuf? Is it really an efficient way for data exchange?

What is Apache Kafka and Why it is used?

AWS (Amazon Web Services) Best Practices to follow

Data Engineer Road Map in 2024.

Unraveling Bitcoin: Decentralization, Scarcity, and the Future of Money

DevOps Metrics and Dashboards

Explain in detail the DevOps lifecycle in software development

社区洞察

其他会员也浏览了

Snowflake Azure DataFactory ADF Connector

Synapse Serverless SQL Pool: DQP vs MPP

From ElasticSearch back to SQL Server

RLFO#34: Robert Tableson, 2nd of His Name

Amazon Athena

Choice galore in Databricks

Query Compilation Overview in Snowflake

A single ELT pipeline in Snowflake

Synapse vs Spark - the alike and the nuance

Slowly changing dimensions with dbt, Databricks, and MySQL