Intermediate SQL takeaways
I just finished the Intermediate SQL course on DataCamp, here are my takeaways.
Click here if you want the full version of my notes.
Case statements
- Used for returning a value on a specified condition
- goes through conditions and returns value as the given result if it is met. If not, the else result is returned.
- If there is no ELSE, it gives NULL.
- To filter NULL, place the entire CASE WHEN into a WHERE clause and end with IS NOT NULL
- can be used with aggregate functions - COUNT, SUM, AVG, Rounded AVG
- You can sum logical values by converting them to 1 and 0 using CASE WHEN so you can calculate percentages.
Case statement code:
Subqueries
- in SELECT, FROM, WHERE, GROUP BY
- Returns -> scalar, list, table
Why Subqueries?
- Compare groups to summarized values
- Reshape data
- Combining data that cannot be joined
Subqueries in FROM
1. restructuring and transform data
- transform data from long → wide
- prefilter data
2. calculate aggregates of aggregates
To remember
- give each subquery an alias, and join them
- Join subquery to a table in FROM
- include a joining column in both tables (ON or USING)
Subqueries in SELECT
1. return single value
- include aggregate values to compare individual values (you can't include aggregate values in ungrouped SQL query)
2. Mathematical calculation
- deviation from average
- ex: compare single average vs overall average.
To remember
- need to return single value
- make sure you have filters in the right place (include where clause in subquery and in main query)
Tips for subqueries
- format your query (line up SELECT, FROM, WHERE and GROUP BY)
- Annotate your queries with comments
- indent your queries (subqueries)
- make sure subqueries are necessary as they can slow down computing
- Properly filter each subquery (include in all subquery and outer query)
Example of subquery in SELECT, FROM and WHERE from the course
Correlated Queries
- use values from outer query to generate results
- re-run for every row generated in final dataset
- used for advanced joining, filtering and evaluating.
Simple query vs correlated query
simple
- extracting, structuring, and filtering
- run independent of main query
- Evaluated once
correlated
- dependent of main query
- Evaluated in loops (slows down query time)
Nested Queries
- subquery inside another subquery
- perform multiple layers of transformation
- can be either simple or correlated.
Common Table Expression (CTE)
What
- a temporary named table for easy use and easier to read code
- An alternative to subqueries
Code
- s1 and s2 are the table names
- you put subqueries into CTE
Benefits
1. executed once
- CTE are stored in memory
- improves query performance
2. improves organization of queries
3. Allows referencing of other CTEs
4. Allows referencing itself (SELF JOIN)
Deciding Techniques
- JOINS
- Correlated Subqueries
- Multiple/nested Subqueries
- CTEs
JOINS
Pros
- Combines 2+ tables
- Simple operations / aggregations
Cons
- can't join two columns to a table in a single column in another at a time
Correlated subquery
Pros
- circumvents multiple complex joins
- reduce the length of your query
- Solves limit of joins
Cons
- High processing time
- Who does each employee report to in a company
Multiple/nested subquery
Pros
- useful when it requires multi-step transformation
- improves accuracy and reproducibility
- processed first before main query
Cons
- Sensitive to changes to data
CTE
Pros
- organize subquery sequentially
- declared first, improving query run time
- Reference other CTEs
Cons
- Takes up space
- Cannot be nested
Which to use?
- depends on database/question
- Use technique that best allows you to use and reuse queries and generate clear and accurate results
Window Function
What is it
- a class of function that performs calculations on a result set that has been already generated (window)
Use case
- use window to perform aggregate calculations without having to group data using the GROUP clause
- calculate information → running total, rankings, and moving averages
Why
- to use aggregate functions, you need to group your results (every singly non-aggregate value)
- This means you can't compare aggregate values to non-aggregate data
- using window functions, we can perform calculations on results that has been generated
- simpler syntax and faster processing time
IMPORTANT: it's processed after every part of the query except ORDER BY
- uses info in result set rather than database
OVER()
- used to create windows
- Placed after the aggregate function and before the AS alias
- Example:
PARTITION BY
what
- calculate separate windows based on cols you want to divide your results
- create a single column that calculates average of goals scored for each seasons
- can partition 1 or more columns
- can be used with aggregate functions, ranks, etc.
Example
{aggregate func} OVER(PARTITION BY column_you_want_to_divide_by ) AS name
Extract
- extract specific month, day or year from date columns
- ex: grab august from date
EXTRACT(month FROM date) = 08)
RANK()
- Creates ranking of your columns
- include RANK() in the front
- remember to include ORDER BY inside the OVER clause to specify the column you want to rank
- default order is smallest → largest
- add a DESC inside your agg function to reverse
Example:
RANK() OVER(ORDER BY {aggregate function} desc) AS col_name
Sliding windows
- create running calculations between any two points in a window using functions
- PRECEDING - before
- FOLLOWING - after
- CURRENT ROW
- UNBOUNDED PRECEDING/FOLLOWING - no limit
- You can calculate running counts, sums, averages, and other aggregate functions between any two points you specify in the data set.
Next course: PostgreSQL Summary Stats and Window Functions
Resources
- Course notes with code
- SQLZOO - an awesome series of practical exercises and quizzes for students of SQL
- Stack Exchange Data Explorer - Write SQL code to query Stack Overflow
Thank you.