DIFFERENCES IN SQL

DIFFERENCES IN SQL

WHERE vs HAVING

WHERE and HAVING clauses are both used in SQL to filter data.

WHERE

  • WHERE clause should be used before grouping the results.
  • WHERE can be used to filter on table columns.
  • It operates on individual rows of the table and filters the rows based on specified conditions.

eg) Consider a table employees with following data.

eg) Suppose consider we want to retrieve employees who belong to IT department and have a salary greater than 55000.

HAVING

  • HAVING clause should be used to filter the rows after grouping.
  • HAVING can be used to filter on aggregate functions like sum, avg, count, min and max.
  • It operates on group of rows and filters groups based on specified conditions.

eg) Suppose we want to retrieve the deparments with an average salary greater than 50000.

UNION vs UNION ALL

Both are used to combine the results of two or more SELECT queries into a single result set.

Rules

  1. Both the SELECT statement must have the same number of columns and the same data type.
  2. Columns in both the SELECT statement must be in the same order.
  3. The names of the columns can be different, but the final result set will show the column name of the first select query.

UNION

  • UNION combines the results of multiple SELECT queries and removes duplicate rows from the final result.
  • It ensures only distinct rows are included by performing an implicit DISTINCT operation on the result set.
  • It is useful when we want to combine and eliminate duplicate data from multiple tables or queries.
  • Slower compared to UNION ALL.

eg) Consider a table employee and manager with following data.

If we want to combine two tables using UNION then,

The output looks something like this.

UNION ALL

  • UNION ALL also combines the results of multiple SELECT queries, but includes all the rows, even if there are duplicates.
  • It is fatser as compared to UNION since it does not perform any duplicate elimination.
  • It is useful when we want to combine and preserve all rows, including duplicates from multiple tables or queries.

If we want to combine two tables using UNION ALL then,

The output looks something like this.

TRUNCATE, DELETE and DROP

TRUNCATE, DELETE and DROP are used to remove data or database objects.

TRUNCATE

  • TRUNCATE is used to delete all the rows from a table while keeping the table structure intact.
  • It is a Data Manipulation Language command that is typically faster than the DELETE command because it does not log individual row deletions.
  • The TRUNCATE command cannot be used on tables with foreign key constraints, and also cannot be rolled back.
  • When we use TRUNCATE, the table's identity counter (if any) is reset and the table becomes empty.

TRUNCATE TABLE employees ;        

DELETE

  • DELETE is used to remove specific rows from a table based on specified conditions.
  • It is a DML command that is slower than TRUNCATE because it logs individual row deletions, generates undo/redo data, and performs additional checks for constraints and triggers.
  • DELETE can be used with a WHERE clause to specify the conditions for row deletion.
  • The DELETE command can be rolled back using the ROLLBACK statement.

DELETE FROM employees WHERE department = 'IT' ;        

DROP

  • DROP is used to remove database objects such as tables, views, indexes or stored procedures.
  • It is a Data Definition Language command that completely removes the object from the database.
  • DROP cannot be rolled back and once an object is dropped, it is permanently deleted from the database.

DROP TABLE employees ;        

COUNT(*), COUNT(1), and COUNT(column_name)

COUNT(*) and COUNT(1) includes null values while counting but, COUNT(column_name) doesn't take null values into consideration during counting.

COUNT(*)

  • COUNT(*) counts the total number of rows in the result set, including rows with NULL values in all columns.
  • It does not consider individual column values but simply counts the rows that satisfy the conditions in the query.
  • It is usually the fastest option for counting rows because it does not involve evaluating any column values.

SELECT COUNT(*) AS total_rows FROM employees ;        

COUNT(1)

  • COUNT(1) is similar to COUNT(*), as it also counts the total number of rows in the result set, including rows with NULL values in all columns.
  • Instead of considering individual column values, it uses the constant value "1" for each row in the result set, effectively counting the rows without evaluating any column values.
  • Like COUNT(*), it is a fast option for counting rows.

SELECT COUNT(1) AS total_rows FROM employees ;        

COUNT(column_name)

  • COUNT(column_name) counts the number of non-null values in the specified column. It excludes rows with NULL values in the specified column from the count.
  • It is used when we want to count the occurrences of non-null values in a particular column.

SELECT COUNT(employee_id) AS non_null_employee_ids
FROM employees;        

In summary, COUNT(*) and COUNT(1) are used to count all rows in the result set, while COUNT(column_name) is used to count the non-null values.

IN, EXISTS, NOT IN and NOT EXISTS

IN and EXISTS are SQL operators that are used to filter and compare values in different ways. Similarly, NOT IN and NOT EXISTS are used for negating those comparisons.

  • IN and NOT IN compare values directly, while EXISTS and NOT EXISTS are used for existence checks based on subqueries.
  • IN and NOT IN are typically used for comparing values from a list, while EXISTS and NOT EXISTS are often used for correlated subqueries.
  • EXISTS and NOT EXISTS are usually more efficient for checking existence compared to IN and NOT IN because they stop processing once a match is found.

Correlated and Non-Correlated Sub Queries

Non-Correlated Subquery

  • A non-correlated subquery is an independent query that can be executed on its own without reference to the outer query.
  • The subquery is evaluated first and its result is then used in the outer query to filter or perform other operations.
  • Non-correlated subqueries are executed only once, regardless of the number of rows in the outer query, making them generally more efficient.

eg ) Consider we have employees table and we want to find all employees whose salary is greater than the average salary of all employees. We can use a non-correlated subquery for this:

SELECT EMP_ID, EMP_NAME, DEPARTMENT, SALARY 
FROM employees
WHERE SALARY > (
     SELECT AVG(SALARY)
     FROM employees
);        

The subquery (SELECT AVG(salary) FROM employees) is evaluated only once and provides the average salary value. The outer query then uses this value to filter the employees whose salary is greater than the average.

Correlated Subquery

  • A correlated subquery is a subquery that depends on the values from the outer query to execute.
  • For each row processed by the outer query, the subquery is re-evaluated with the specific values from the current row of the outer query.
  • Correlated subqueries can lead to decreased performance, especially when dealing with large datasets, as they may be executed multiple times.

eg ) Consider want to find all employees whose salary is greater than the average salary of their respective departments. We can use a correlated subquery for this:

SELECT EMP_ID, EMP_NAME, DEPARTMENT, SALARY 
FROM employees e1
WHERE SALARY > (
     SELECT AVG(SALARY)
     FROM employees e2
     WHERE e1.DEPARTMENT = e2.DEPARTMENT
);        

The subquery (SELECT AVG(SALARY) FROM employees e2 WHERE e1.DEPARTMENT = e2.DEPARTMENT) is correlated to the outer query by the department column. For each row processed by the outer query (e1), the subquery is re-evaluated with the specific department value from the current row of the outer query.

PRIMARY KEY and UNIQIUE KEY

Both the PRIMARY key and the UNIQUE key constraint are used to enforce uniqueness in a table's columns.

PRIMARY KEY constraint

  • The PRIMARY key constraint is used to uniquely identify each row in a table. It ensures that the values in the specified column(s) are unique and not NULL.
  • Each table can have only one PRIMARY key constraint and it must be defined when the table is created.
  • A PRIMARY key column cannot have duplicate or NULL values.
  • The PRIMARY key also creates a clustered index on the specified column(s), which determines the physical order of data in the table for efficient data retrieval.

CREATE TABLE employees(
    EMP_ID INT PRIMARY KEY,
    EMP_NAME VARCHAR(30),
    DEPARTMENT VARCHAR(50)
);        

In this example, the EMP_ID column is designated as the PRIMARY key, ensuring that each employee's ID is unique and not NULL.

UNIQUE KEY constraint

  • The UNIQUE key constraint is used to enforce uniqueness in a column(s) but does not necessarily identify each row uniquely.
  • A table can have multiple UNIQUE key constraints, and they can be defined when the table is created or added later.
  • A UNIQUE key column can have unique values, but it can allow NULL values (except in the case of a composite UNIQUE key where all columns must be unique and not NULL).
  • A UNIQUE key creates a non-clustered index on the specified column(s) to optimize data retrieval.

CREATE TABLE employees(
    EMP_ID INT,
    EMP_NAME VARCHAR(30),
    EMAIL VARCHAR(100) UNIQUE
);        

In this example, the EMAIL column has a UNIQUE key constraint, ensuring that each email address is unique in the table, but it can allow NULL values.

Clustered and Non Clustered Index

Both clustered and non-clustered indexes are used to improve the performance of database queries by providing quick access to data.?

Clustered Index

  • A clustered index determines the physical order of data rows in a table. Each table can have only one clustered index.
  • When a table has a clustered index, the rows are stored in the order of the clustered index key. This means that the data is physically organized on disk based on the values in the indexed column.
  • Due to the physical ordering, the retrieval of rows using a clustered index is faster when querying on the indexed column.
  • Creating or rebuilding a clustered index can be more time-consuming, as it affects the physical order of data in the table.

CREATE CLUSTERED INDEX idx_emp_id ON employees(EMP_ID) ;        

The data in the employees table will be physically sorted based on the values in the EMP_ID column. Any query that uses the EMP_ID column as a search criteria will benefit from faster data retrieval.

Non Clustered Index

  • A non-clustered index is a separate data structure that contains a copy of the indexed columns along with a pointer to the actual data rows in the table.
  • A table can have multiple non-clustered indexes, allowing for different indexing strategies to optimize various queries.
  • Non-clustered indexes do not affect the physical order of data in the table; instead, they provide a quick lookup path to the actual data rows.
  • Non-clustered indexes are generally faster to create or rebuild compared to clustered indexes.

CREATE NONCLUSTERED INDEX idx_salary ON employees(salary) ;        

The index will contain a copy of the salary column along with pointers to the actual rows in the employees table. Any query that uses the salary column in a search or join operation will benefit from faster data retrieval.

VIEW and MATERIALIZED VIEW

Both views and materialized views are database objects used to provide a logical representation of data from one or more underlying tables.

VIEW

  • A view is a virtual table that does not store data on its own. It is defined by a query that retrieves data from one or more base tables.
  • Views are used to simplify complex queries, encapsulate data access logic, and provide a security mechanism by restricting access to specific columns or rows of a table.
  • Whenever you query a view, the underlying query is executed, and the data is generated on-the-fly based on the latest data in the base tables. The view's data is always up-to-date with the underlying tables.
  • Views are suitable for scenarios where real-time data is needed, and you want to hide the complexity of underlying tables.

Example of creating a VIEW:

CREATE VIEW EMPLOYEE_DETAILS AS
SELECT EMP_ID, EMP_NAME, DEPARTMENT
FROM employees;        

In this example, the EMPLOYEE_DETAILS view provides a simplified representation of the data in the employees table, containing only the EMP_ID, EMP_NAME and DEPARTMENT columns.

MATERIALIZED VIEW

  • A materialized view is a physical copy of the result set of a query. It stores the data in a separate table and keeps the data updated periodically or on-demand based on a defined refresh schedule.
  • Materialized views are used to improve query performance by precomputing and storing the results of complex and resource-intensive queries. They are beneficial for scenarios where you need to access the same query result multiple times or when the underlying data changes infrequently.
  • The data in a materialized view is not always up-to-date with the underlying tables; it depends on the last refresh or update time. Therefore, materialized views are suitable for scenarios where near-real-time data is sufficient.

Example of creating a MATERIALIZED VIEW:

CREATE MATERIALIZED VIEW MV_EMPLOYEE_DETAILS AS
SELECT EMP_ID, EMP_NAME, DEPARTMENT
FROM employees;        

In this example, the MV_EMPLOYEE_DETAILS materialized view stores a physical copy of the EMP_ID, EMP_NAME and DEPARTMENT columns from the employees table. The data in the materialized view must be refreshed periodically to stay up-to-date.

Credits : GROKKING THE SQL INTERVIEW

要查看或添加评论,请登录

NIKHIL G R的更多文章

  • Introduction to DBT (Data Build Tool)

    Introduction to DBT (Data Build Tool)

    dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse…

  • Introduction to Azure Databricks (Part 2)

    Introduction to Azure Databricks (Part 2)

    DBFS (Databricks File System) It is a Distributed File System. It is mounted into a databricks workspace.

  • Introduction to Azure Databricks (Part 1)

    Introduction to Azure Databricks (Part 1)

    Databricks is a company created by the creators of Apache Spark. It is an Apache Spark based unified analytics platform…

  • Aggregate and Window Functions in Pyspark

    Aggregate and Window Functions in Pyspark

    Aggregate Functions These are the functions where the number of output rows will always be less than the number of…

  • Different ways of creating a Dataframe in Pyspark

    Different ways of creating a Dataframe in Pyspark

    Using spark.read Using spark.

  • Dataframes and Spark SQL Table

    Dataframes and Spark SQL Table

    Dataframes These are in the form of RDDs with some structure/schema which is not persistent as it is available only in…

  • Dataframe Reader API

    Dataframe Reader API

    We can read the different format of files using the Dataframe Reader API. Standard way to create a Dataframe Instead of…

  • repartition vs coalesce in pyspark

    repartition vs coalesce in pyspark

    repartition There can be a case if we need to increase or decrease partitions to get more parallesism. repartition can…

    2 条评论
  • Apache Spark on YARN Architecture

    Apache Spark on YARN Architecture

    Before going through the Spark architecture, let us understand the Hadoop ecosystem. The core components of Hadoop are…

  • Introduction to Apache spark

    Introduction to Apache spark

    Apache Spark is a Distributed Computing Framework. Before going into Apache Spark let us understand what are the…

社区洞察

其他会员也浏览了