SQL: Mastering Data Engineering Essentials
Photo by Thao LEE on Unsplash

SQL: Mastering Data Engineering Essentials


Here’s an interesting fact: do you know when the SQL language was created? When it first appeared? I do! It was in 1974. I wasn’t even born yet, and people were already writing in SQL. Today, nearly 50 years later, it’s younger than ever.

SQL is one of the main tools used in data science across the board. At some point, you will likely work with SQL: whether it’s to access a database, prepare a pipeline, or load data.

Whether you’re handling data science tasks, data preparation, or data engineering, it doesn’t matter. SQL will probably be part of your job at some point. Anyone who works with data will have to use SQL in some project.

My goal here is to give you a complete introduction to SQL with a focus on data engineering. I’ll present the main commands and then move on to more complex topics, demonstrating various hands-on activities.

And, of course, in your day-to-day work, you’ll encounter SQL at some point. There’s no escaping it, ok? You’ve got to add it to your toolbox. The more tools you know, the more problems you’ll be able to solve.

And please, give as many claps ???? as your fingers can handle, I do all this with lots of care. Let’s go! ??


Unlocking the Power of SQL for Data-Driven Decision?Making

Let’s start by preparing our database, creating the schema, and diving straight into the work using pgAdmin, an open-source tool for PostgreSQL database administration and management.?

But before anything else, the first step is to check if the container is running in Docker Desktop. In my case, you can see the status shows as “Running”?—?excellent!

If the status isn’t “Running”, you won’t be able to access it, right?

What’s inside this container? I have the PostgreSQL database management system. Now, I want to access the database.

For our example, I have pgAdmin installed locally. I created the connection on port 5437, which is linked to port 5432 on Docker.?

I named my database “pandata”, set the host name as localhost on my local machine, saved it, the connection was successful. Ready to go!

A database is logically divided into schemas. When you create a database from scratch, the DBMS, like PostgreSQL, automatically creates the “public” schema for you.?

Anything you want to share between schemas is usually placed in the “public” schema.

But I’m not going to use the “public” schema. I’ll create my own schema. To do this, you can right-click on Schemas, select Create > Schema, and a window will pop up.

You can then, for example, name it “sc01”.

Notice that the owner is “pandata”, which is the user we used when creating the database via the Docker container.

When you type in the Name box, PostgreSQL automatically generates the SQL statement for you. This is also a great way to learn even more.

Whenever you see the SQL tab at the top, take a look to see how the statement was generated. It’s an additional learning opportunity.

Also, you can right-click on Schemas and choose Query Tool to run your SQL queries. Here, you can directly enter your first SQL statement.

Isn’t that great?

# 1. Create Schema
CREATE SCHEMA sc01 AUTHORIZATION pandata;        

Everything in uppercase is the SQL syntax. In this case, we’re using a DDL statement to create an object. I’m creating a schema.

A schema is a logical division of objects. So, for example, I can have one schema for a web application. I create the schema, the tables, procedures, views, etc.

I can create another schema for an internal company application, for instance. In this other schema, I’ll also have tables and other objects.

The schema divides the database logically. You can:

  • Grant privileges;
  • Isolate each schema;
  • Place schemas in different files within the database itself.

There are many possibilities.

And that’s all you need to start building your database. So, you need to have a container running in Docker, open pgAdmin, connect to your database, and then create your schema.

Within this schema, we will do all of our work throughout this chapter.


DDL Statements?—?CREATE, ALTER,?DROP?

Now, I’m going to introduce a series of SQL instructions, gradually increasing the complexity and explaining everything step by step.

First, on the left side, navigate to the sc01 schema we just created. Right-click and choose Query Tool. When you click, the code window and the output panel will open.

Let’s start with DDL statements. Well, actually, the acronym DDL stands for Data Definition Language. DDL statements are used to create, modify, or delete objects in SQL, ok?

  • CREATE: Creates an object.
  • ALTER: Modifies an already created object.
  • DROP: Deletes a created object.

These are the main DDL statements. I used one earlier, right? It was CREATE SCHEMA. I used a DDL statement. Now, I’m going to use another one: CREATE TABLE.

Note: Everything in uppercase refers to SQL syntax. The words in lowercase are the names we define.

-- DDL Instructions (Data Definition Language)
-- CREATE, ALTER, DROP
-- 2. Table creation

CREATE TABLE sc01.employees (
    employee_id SERIAL PRIMARY KEY,
    first_name VARCHAR(50),
    department VARCHAR(50),
    salary DECIMAL(10, 2)
);        

So, CREATE TABLE is a DDL statement used to create a table. I’m going to place this table in the sc01 schema, and the name of the table will be employees.

We’ll have the following columns:

  • employee_id: Type SERIAL, an integer generated sequentially by the DBMS. Defined as the PRIMARY KEY, meaning it doesn't allow duplicate values in this column. It doesn't make sense to have two employees with the same ID, so this column will be the primary key.
  • first_name: Type VARCHAR(50), a variable character type with a maximum of 50 characters. Names exceeding this length won't fit in the column, requiring an ALTER statement to change the data type later.
  • department: Type VARCHAR(50), used to store the department name.
  • salary: Type DECIMAL(10, 2), which allows up to 10 digits before the decimal point and 2 after. This is ideal for representing monetary values with decimal precision.

pgAdmin is your friend and offers a simple alternative. On the left panel, navigate to Tables, right-click, and select Create > Table….

This will open an interface where you can define, for example, the table name as employees.?

Tip: When I can’t remember a specific instruction or syntax, I go straight to pgAdmin. I start creating an object, it provides me with the syntax, and I just complete it and move on.

Now, you can simply execute the statement or confirm the creation of the table via the interface. This is a great tip to speed up your work even more.

I’ve already shown you how to create a table using a DDL statement, specifically CREATE TABLE. So, do you need to create an object in the database??

You’ll use the DDL CREATE statement. For example: CREATE FUNCTION, CREATE VIEW, CREATE MATERIALIZED VIEW, and so on. To configure and use the object in the database, you first need to create it.

But how do you know the complete syntax to create each object? You have two main options:

  1. Visit the official documentation of the DBMS.
  2. Use a DBMS access tool, like pgAdmin.

Now, let’s take a look at the Employees table. Right-click on it and go to Properties. This will show you a series of tabs at the top. Among them is the SQL tab, but it’s empty.

Why?

No updates have been made yet, which is why it only shows the CREATE command, since you're creating the object. If you edit now, you can modify the object.

For example, you can go to the columns and make the necessary changes.

Imagine you want to modify the department column. Now, you no longer want it to be of type VARCHAR (variable character). Instead, you want it to be a fixed CHAR (fixed-length character).

The ALTER TABLE command is another DDL statement. When you create an object, you now have a command to modify that object.

So, CREATE TABLE creates the table, and ALTER TABLE modifies the table. CREATE PROCEDURE creates a procedure. And guess how you modify it? By using ALTER PROCEDURE. In other words, the first word is always the same. What changes is the object: table, function, procedure, view, or materialized view. This applies to any database object.

Now that I’ve created the table, I can modify it using exactly ALTER TABLE.

How do I do that?

-- Altering the table

ALTER TABLE sc01.employees ADD COLUMN hire_date DATE;        

The ALTER TABLE command. Always include the schema before the table name, meaning always precede the table name with the schema. Now, I’m going to add a column.

Notice that I don’t need to delete the table and create it again. I can simply alter the table that was previously created.

I’ll add a column called hire_date and define its data type, which in this case is DATE.

Execute. Table successfully altered.

Again, on the left panel, select the employees table. Right-click and go to Properties, then to Columns. Look who’s there: the hire_date column.

Both CREATE and ALTER are DDL statements. They allow you to create or alter an object. I’m showing the example with a table, but it’s exactly the same process for any other object. You can create a view and alter it. You can create a function and alter it.

To wrap up DDL statements, I can also drop an object. But this must be done with extreme caution, especially in a company’s production environment. Be careful. The DROP command means delete.

For example, DROP TABLE specifies the name of the table that will be deleted. Since I’m the owner of the database, when I execute this command, there’s no discussion. The table will be deleted.

-- Dropping the table

DROP TABLE sc01.employees;        

You have to hope there’s a backup available, right? If there’s no backup, it’s game over. You’ve lost everything.

So, this command must be used with great caution. In our test environment here, of course, there’s no issue. But in a production environment, in the daily operations of a company, you need to be extremely careful.

These three commands, or rather, these three instructions?—?CREATE, ALTER, and DROP—can be used with any object. You just replace TABLE with the object you want to create, modify, or drop. And, of course, you need to include the specific details of that object. It could be a PROCEDURE, FUNCTION, VIEW, or MATERIALIZED VIEW.

This is how you manage objects: by creating, altering, or deleting them.


DML Statements?—?SELECT, INSERT, UPDATE,?DELETE?

I’ve already shown you the DDL commands, which are used to create and manipulate objects in the database. With DDL, you can create, alter, or delete objects, working at the object level. But now, I want to access and manipulate the content of an object, specifically a table. In this case, we can’t use DDL?—?we need to use DML.

So, let’s get back to pgAdmin. First, I need to create a table, right? I’m going to create the table, because now we’ll manipulate its content. I’ve taken this opportunity to show you the CREATE TABLE command again. Notice that this time, I’m using the INT type.

-- Creating the table

CREATE TABLE sc01.employees (
    employee_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    department VARCHAR(50),
    hire_date DATE,
    salary DECIMAL(10, 2)
);        

Each time I introduce an instruction, I’ll try to bring as much content as possible. So, always follow along carefully.

The INT type indicates that I’ll be inserting an integer value. Previously, I had used SERIAL PRIMARY KEY, which means that the DBMS itself generates a sequential number. That’s an option if you don’t want to manually input the ID.

But now, I want to manually insert the ID, which is why I used INT. I’ve also added the columns first_name, department, hire_date as DATE, and salary as DECIMAL.

So, execute the instruction, the table will be created successfully, refresh, and you’re good to go!

DDL is for manipulating the object. I’ve created the object, the employees table. Now, I use DML to manipulate the content of this table. I can INSERT records, UPDATE, DELETE, and SELECT or search the records.

Since the table was just created, it’s empty, right? Right-click on employees, choose View/Edit Data, and then select All Rows to view all the rows.

Do you know what it’s going to give you? A DML statement.

The SELECT statement is probably the one you’ll use the most in your day-to-day work. It queries the table to see if there’s anything in it. In this case, it brought a SELECT on the table, which is completely empty.

Let’s start with INSERT. I need to insert records, right? Notice the syntax: INSERT INTO. In this case, I’m going to insert into the employees table. This table has the following columns: first_name, department, hire_date, and salary. Got it?

-- DML Instructions (Data Manipulation Language)
-- SELECT, INSERT, UPDATE, DELETE
-- Data insertion

INSERT INTO sc01.employees (first_name, department, hire_date, salary)
VALUES (100, 'John Albright', 'Data Engineering', '2024-01-10', 25000.00);        

So, I’m inserting the values: employee_id, first_name, department, hire_date, and salary. I execute it. Oops! Error message. This is where many people freeze.

Pay attention! The message says the INSERT has more expressions than target columns. What does this mean? It simply means that the INSERT has more values than corresponding columns.

Let’s count together. I’m trying to insert:

  • 100 (employee_id)
  • John Albright (first_name)
  • Data Engineering (department)
  • 2024-01-10 (hire_date)
  • 25000.00 (salary)

Of course, it’s going to throw an error, right? What’s missing here? I didn’t include employee_id. I used this example specifically to draw your attention to the error.

You learn a lot from mistakes. Don’t be afraid of them; they’re a great source of learning.

To have a match here, I should have included employee_id, as I’m doing now:

-- DML Instructions (Data Manipulation Language)
-- SELECT, INSERT, UPDATE, DELETE
-- Data insertion

INSERT INTO sc01.employees (
employee_id, first_name, department, hire_date, salary)
VALUES (100, 'John Albright', 'Data Engineering', '2024-01-10', 25000.00);        

Now, I have 5 columns and I’m inserting 5 values. Guess what’s going to happen now when I execute it? It’s going to work.

Excellent, right? A small detail, but that detail makes all the difference.

When writing your INSERT statement, you need to have the correct number of columns at the top (the column names) to match the values below. There are alternatives to this, like using SERIAL or not, but that’s a story for another time. We’ll talk about that later.

For now, the important thing is this: How many columns does your table have? Five. So, you need to list the five columns at the top and the five values below.

Always pay attention to the data types. For example:

  • employee_id is of type INT. So, I input an integer.
  • first_name and department are of type VARCHAR (character). So, I enclose them in single quotes.
  • hire_date is of type DATE. It also goes in single quotes and must follow the correct date format.
  • salary is of type DECIMAL. So, it accepts the decimal point when entering the value.

These details are part of the syntax. If you have any doubts, pgAdmin can help you validate them.

You can go to the employees table, right-click, choose the Scripts option, and then select Insert. It will open a window and generate the exact INSERT statement for you.

Look how cool this is. When in doubt, pgAdmin can help you. You use INSERT INTO on the employees table.

The question mark you see is exactly where you’ll place the value for each column. Depending on the data type?—?whether it’s an integer, decimal, or string?—?you adjust accordingly, adding or omitting quotes where necessary.

This is all just an introduction to the introduction. I’ll bring up some advanced topics soon, but before that, you need to learn to walk before you can run, right?

So, for those who have never seen this before, great?—?you’re seeing your first SQL commands now. For those who have already seen it, take this as a review and follow along with the tutorial.

Now, I want to insert more records. I’ll take this opportunity to introduce a few more INSERT statements.

Can this be automated? Yes, I could create a stored procedure that follows a certain logic. Then, you can develop the code with SQL and use the INSERT command once to insert multiple records. This is possible with stored procedures. I’ll bring an example for you soon.

For now, we’ll continue repeating the INSERT statement, but with care to not repeat the ID.

Why can’t I repeat the ID? Because the ID is the primary key. We cannot have two records with the same information in the primary key.

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (101, 'Michael Andrews', 'Data Science', '2024-02-10', 19000.00);

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (102, 'Catherine Morgan', 'Data Engineering', '2024-02-11', 22000.00);

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (103, 'Charles Davidson', 'DataOps', '2024-02-12', 23400.00);

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (104, 'Claire Livingston', 'DataOps', '2024-02-13', 21800.00);        

Let’s select all of this, starting from employee_id 100, which has already been inserted. Then, we’ll go from 101 to 104, using the same syntax I’ve already explained. Execute.

Now, before we check the data, let’s do an experiment. I’ll try to insert the employee with employee_id 104, Claire Livingston, again.

Do you know what’s going to happen?

Error message. This is an opportunity for learning. What is the message saying?

employee_id 104 already exists. This is an attempt to duplicate a key, which results in a constraint violation. You can’t do that.

In fact, it’s not exactly an “error.” It’s just not allowed. I’m trying to violate a rule, because the column is a primary key.

Now, I want to check if the data was actually inserted, right?

SELECT * FROM sc01.employees
ORDER BY employee_id ASC        

Now, I’m going to run the SELECT, which is another DML statement. Some people like to treat SELECT as a separate type of statement, but it’s DML because we are manipulating and returning data.

You use SELECT to return data, and you can build a query as simple as this, or massive, complex queries full of details, all using SELECT. It can get much more complicated than the example we’re seeing now.

So, here we have the fields: employee_id, first_name, department, hire_date, and salary.?

If you look at the salary, you’ll see that Michael Andrews is the only one earning less than 20,000. This can’t happen. Let’s give Michael Andrews a raise, right?

I need to modify the table and increase his salary. I’ll modify just this field, only the salary.

For that, I’ll use another DML statement: UPDATE.

-- Data update

UPDATE sc01.employees
SET salary = 26000.00
WHERE first_name = 'Michael Andrews';        

Everything that appears in uppercase is, remember, the exact SQL instruction. We are updating the employees table in the sc01 schema. We set the salary to 26,000. A nice raise for Michael Andrews.

But I need to do this by establishing a rule. What’s the rule?

The rule is to use the WHERE clause, where the first_name is equal to 'Michael Andrews'.

If I remove the WHERE, what do you think will happen? It will update everyone’s salary because I didn’t specify a condition, I didn’t apply a filter. This is dangerous!

Only do this if you know what you’re doing. If you want to update just one record, you must apply a filter. Otherwise, you’ll update all the records.

Normally, when you want to update a record, you need to apply the filter in the WHERE clause very carefully. Otherwise, you could end up in trouble.

Now, I want to show you the DELETE command. We can delete a specific record. Let me show you an example of DELETE:

-- Data deletion

DELETE FROM sc01.employees
WHERE first_name = 'Michael Andrews';        

The WHERE clause is the filter we’re using here. We’re specifying the first_name, which in this case is 'Michael Andrews'.

The rule is the same as with UPDATE: when performing an UPDATE, use a filter. When performing a DELETE, also use a filter.

Notice that now ‘Michael Andrews’ is no longer there. And what happens if I remove the WHERE clause? The entire table will be emptied.

-- Data deletion

DELETE FROM sc01.employees        

What I’m showing you is to avoid making this mistake in your day-to-day work. When using UPDATE or DELETE, unless you know exactly what you’re doing and have a good reason, always use the filter with WHERE.

The filter will give you the safety to change exactly what you need. You can add multiple criteria to the filter, etc., but always use the filter with WHERE. Otherwise, what will happen? You’ll either update the entire table or delete the entire table.

Speaking of deleting, we can also do this in pgAdmin.

Just right-click on the employees table in the menu, and you’ll see the options Delete and Delete Cascade. Delete Cascade is used when you have a table with multiple relationships. In our case, we don’t have that, we only have one table, so we choose the Delete option.

In our test environment, this isn’t a problem. But in a production environment, you need to be extremely careful.


Aggregation Functions and?Grouping?

I deleted the employees table, and now I’m going to recreate the table, insert the records again, and continue with you. I’ll create the same table I was using earlier.

In a moment, I’ll create a second table, also called employees, with the columns employee_id, first_name, department, hire_date, and salary.

-- Table creation

CREATE TABLE sc01.employees (
    employee_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    department VARCHAR(50),
    hire_date DATE,
    salary DECIMAL(10, 2)
);        

Table successfully recreated. I will now insert records into this table. Here are the DML INSERT statements:

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (100, 'John Albright', 'Data Engineering', '2024-01-10', 25000.00);

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (101, 'Michael Andrews', 'Data Science', '2024-02-10', 19000.00);

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (102, 'Catherine Morgan', 'Data Engineering', '2024-02-11', 22000.00);

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (103, 'Charles Davidson', 'DataOps', '2024-02-12', 23400.00);

INSERT INTO sc01.employees (employee_id, first_name, department, hire_date, salary)
VALUES (104, 'Claire Livingston', 'DataOps', '2024-02-13', 21800.00);        

Data Warehousing is a consolidated database. In other words, you take data from transactional databases, which are used in the day-to-day operations of a company, and you aggregate, summarize, and group that data.

Afterward, you create your Data Warehouse model, load the data into that model, and make it available for reporting.

So, very likely, during the data preparation process for loading into the Data Warehouse, you’ll need to perform some aggregation, summarization, and grouping.

Let’s study what SQL offers us in this regard. Essentially, we have five aggregation functions: MIN, MAX, AVG, SUM, and COUNT.

-- Minimum, Maximum, Average, Sum, and Count

SELECT MIN(salary), MAX(salary), AVG(salary), SUM(salary), COUNT(salary)
FROM sc01.employees;        

I can return the minimum value (MIN), the maximum value (MAX), the AVG (average), the sum (SUM), and then the count(COUNT).

I apply the function, specify the column where I want to execute the function, and indicate from which table I’ll extract the data.

You’ll notice that, when calculating the AVG (average), there are several decimal places. This is normal, especially when working with PostgreSQL. The system outputs the maximum number of decimal places configured by default in the DBMS. But those are a lot of decimal places, right?

So, let’s use another SQL function to simplify the result: the ROUND function.

-- Minimum, Maximum, Average, Sum, and Count

SELECT MIN(salary), MAX(salary), ROUND(AVG(salary),2), SUM(salary), COUNT(salary)
FROM sc01.employees;        

Now, the result looks a bit more user-friendly. Notice that I’m using AVG(salary) as an argument inside the ROUNDfunction. So, one function is used as an argument of another. The ROUND function, as you might expect, rounds the result. In this case, I’m rounding to two decimal places.

You’ll notice that for the other columns, rounding isn’t necessary.

Why? The AVG function is the only one calculated using division. To calculate the average, you sum the elements and divide by the number of elements. During this division process, many decimal places appear. For the other columns, since there’s no division involved, we naturally get fewer decimal places.

Next, let’s study how we perform grouping. Pay attention, as there are important rules. Everything I’ll show now will use AVG, but the same rule applies to MIN, MAX, SUM, and COUNT.

I’ll focus on AVG, as it’s quite common, especially when preparing Data Warehouses.

-- General salary average

SELECT ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees;        

At this moment, when you execute this query, it analyzes all the rows in the employees table and calculates the average of the salary column across all rows. This is the overall average. I’m not segmenting the data in any way, just calculating an aggregation function.

However, in practice, you’ll likely want to segment the result. Instead of calculating the overall salary, you might want to calculate the salary by department.

Since we have multiple departments with employees, I could now include the department column.

-- Departmental salary average

SELECT department, ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees;        

I want the average salary by department. For each department, SQL will gather the employees of that department, fetch their salaries, sum them up, and divide by the number of employees in that department. This is still aggregation, but now it’s segmented.

Error message. Now what do we do?

The department column in the employees table needs to appear in the GROUP BY clause or, alternatively, be used in an aggregation function. This is a common rule you’ll encounter in SQL across any DBMS.

To simplify and note this in your journal: any column that is not part of an aggregation function must be included in the GROUP BY clause. I’ll repeat: any column not in an aggregation function must be in the GROUP BY.

What are the aggregation functions? They are: MIN, MAX, AVG, SUM, and COUNT.

-- Departmental salary average
SELECT department, ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees
GROUP BY department;        

If you’re only using the AVG function or any other aggregation function, you don't need a GROUP BY, because in that case, you're calculating the global average.

However, when performing any type of segmentation, any column not part of the aggregation function needs to be included in the GROUP BY.

You could add 15 columns, and all of them would have to be in the GROUP BY. Otherwise, you won’t be segmenting correctly, and you’ll get the error I mentioned earlier.

Is it clear now why we use the GROUP BY clause? We use GROUP BY to group data in a segmented way. In this case, the grouping I want is to calculate the average, but segmented by department.

I could include other columns in the SELECT, and in that case, anything not part of an aggregation function must be in the GROUP BY.

Additionally, you can also sort the data. In fact, sorting can be done with any SQL clause.

So, first I group the data, as I’ve already done, but now I want to order the result. For that, I use the ORDER BY clause.

-- Departmental salary average ordered by department

SELECT department, ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees
GROUP BY department
ORDER BY department;        

Notice that now we have sorting, in this case by department. With this, I’m combining more and more SQL instructions.

Now, let’s do the following. Pay close attention. I want the average salary by department, but only if the average is greater than 20,000. Let’s use the previous query:

-- Departmental salary average ordered by department

SELECT department, ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees
GROUP BY department
ORDER BY department;        

When you execute this query, you get the average salary for each department. SQL will count the employees in each department, sum the salaries, divide by the number of employees, and thus calculate the average for each department.

But now, I’d like to apply a filter. In other words, I want to calculate the average only if it’s greater than 20,000.

How do we do this?

I could handle it as follows:

-- Departmental salary average only if the average is greater than 20000, 
-- ordered by department

SELECT department, ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees
WHERE ROUND(AVG(salary), 2) > 20000
GROUP BY department
ORDER BY department;        

The WHERE clause allows you to apply filters in your SQL statement. Just like we used it with UPDATE and DELETE earlier, we can also apply it with SELECT.

In this case, I’ll use a filter: I can apply filters based on the available columns and use both relational and logicaloperators, right?

For this example, I’m requesting a salary average greater than 20,000. I’ll calculate the average and place it in the WHERE clause. Then, I’ll use the correct relational operator, which in this case is greater than, to ensure I’m filtering only for salary averages above 20,000.

Seems logical, doesn’t it? I’m applying the filter and asking for the grouping. When the result is processed, I want it to show me only the departments where the average salary is greater than 20,000.

Execute. Oops! Error message.

Let’s stop and read the error message, okay? I’m doing this exercise because many people don’t give enough attention to error messages. However, they often tell you exactly what’s going on. And, by the way, error messages are becoming more detailed and informative.

In this case, the error message is clear: aggregation functions are not allowed in the WHERE clause. That’s it.

Aggregation functions like AVG, as well as MAX, MIN, COUNT, and others, cannot be used in the WHERE clause. And there’s a logical explanation for this!

Now, let’s understand how the order of execution in an SQL query works.

The SELECT statement will execute first, right? It’s going to fetch the data from the employees table. When it reaches the WHERE clause, I ask you: has the average been calculated yet?

The answer is no. The average will only be calculated when we get to the GROUP BY clause. But GROUP BY is the next stepin the execution process.

So, when I try to use the average calculation in the WHERE clause, it won’t work because the average hasn’t been calculated yet. It will only be calculated later, during the GROUP BY stage.

That’s why we can’t use aggregation functions in the WHERE clause. But there’s always someone who tries to move things around, right?

-- Departmental salary average only if the average is greater than 20000, 
-- ordered by department

SELECT department, ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees
GROUP BY department
WHERE ROUND(AVG(salary), 2) > 20000
ORDER BY department;        

You move things around, and now the error changes too. Now we have a syntax error: the GROUP BY cannot come before the WHERE clause.

So, what now? How do I fix this? My boss is waiting for the report, and I need to deliver the average salary by department, but only if the average is greater than 20,000.

Well, we have a way to do this by using another SQL clause: the HAVING clause. HAVING was created specifically to solve this issue with the WHERE clause.

The difference is that HAVING comes after the GROUP BY. It’s like WHERE, but it’s applied after the grouping.

It’s the same rule we’ve already discussed, but with a different clause. Since HAVING comes after GROUP BY, it first performs the grouping. After the grouping is done, we apply the filter with HAVING.

-- Departmental salary average only if the average is greater than 20000, 
-- ordered by department

SELECT department, ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees
GROUP BY department
HAVING ROUND(AVG(salary), 2) > 20000
ORDER BY department;        

Execute and watch the magic happen. A lot of people?—?and I repeat, a lot of people?—?don’t fully understand this. They don’t know exactly why they are using HAVING. They just use it without really knowing what’s happening.

HAVING is a filter after the GROUP BY. WHERE is a filter before the GROUP BY. So, you can’t use an aggregation function in the WHERE clause, because the GROUP BY hasn’t been executed yet.

You can use WHERE with other filters, but not with aggregation functions. Is that clear?

When you need to calculate an aggregation and then apply a filter based on that calculation, you’ll use the HAVING clause, which is the filter applied after the aggregation.

Now, I want to calculate the average salary by department, but only if the average is greater than 20,000, the filter we just discussed. In addition, only if the department name contains the word “Engineering”. Oh, and I want the result to be ordered by department.

In this case, I have a filter after the GROUP BY (using HAVING) and a filter before the GROUP BY (using WHERE). I’m going to use both clauses.

Here’s an example that I think is interesting, to make everything clear, especially for those areas that often cause confusion:

-- Departmental salary average only if the average is greater than 20000 
-- and only if the department name contains the word 'Engineering'
-- Ordered by department

SELECT department, ROUND(AVG(salary), 2) AS average_salary
FROM sc01.employees
WHERE department LIKE '%Engineering%'
GROUP BY department
HAVING ROUND(AVG(salary), 2) > 20000
ORDER BY department;        

In this example, I’m going to select the department and average_salary from the employees table. Then, I’ll add the WHERE clause to apply a filter.

The LIKE clause is another SQL operator that allows you to filter using strings and characters. Let’s take a look at the syntax: you place the string in single quotes and use the wildcard operator %.

This means it will check each row in the department column, and if it finds something that contains the word "Engineering", it will filter those results.

After that, the data will be grouped by department, and the additional filter will be applied after the GROUP BY with the HAVING clause.

I could even remove the ordering if I wanted, no problem at all.

Is the difference between WHERE and HAVING clear to you? If you need to apply a filter on the aggregation, it has to be done after the GROUP BY. In that case, you’ll use the HAVING clause.

If you want to apply any other filter that is not based on aggregation, you place it in the WHERE clause.


Joins Functions?—?Inner, Left, Right, and?Full

Let’s move on to another topic that often causes confusion: joins.

To demonstrate how joins work, I need at least one more table. A join happens when you combine two or more tables to return the data you need.

So, I’ll create a second table. I’m going to create the projects table:

CREATE TABLE sc01.projects (
    project_id INT PRIMARY KEY,
    project_name VARCHAR(100),
    employee_id INT REFERENCES sc01.employees(employee_id)
);        

The projects table will have:

  • The project ID (project_id), which will be the primary key;
  • The project name (project_name);
  • The employee ID (employee_id), which will reference the employee_id from the employees table.

For this, I’m using another SQL instruction: REFERENCES. This instruction creates the reference, linking the employee_idfrom the projects table to the employee_id in the employees table.

Now, let’s perform the “joining” of the tables, or the join itself.

With the table created, let’s insert some records. Here’s the DML INSERT statement.

INSERT INTO sc01.projects (project_id, project_name, employee_id)
VALUES (6001, 'Real-Time Data Analysis', 101);

INSERT INTO sc01.projects (project_id, project_name, employee_id)
VALUES (6002, 'CI/CD Pipelines', 103);

INSERT INTO sc01.projects (project_id, project_name, employee_id)
VALUES (6003, 'Transactional Data Extraction', 104);

INSERT INTO sc01.projects (project_id, project_name, employee_id)
VALUES (6004, 'Data Backup', 102);

INSERT INTO sc01.projects (project_id, project_name, employee_id)
VALUES (6005, 'Requirements Gathering', null);        

I’ll insert the project with ID 6001, including the project name, and 101 is the employee assigned to this project. I’ll repeat the process for the other projects, and at the end, I’ll insert NULL.

In other words, I’ll create a record where there won’t be an associated employee ID, meaning no employee is assigned to that project.

Now, let’s run a SELECT to verify if everything was successfully inserted.

And here are the registered projects, with one project not having an associated employee. Let’s take this opportunity to discuss an important concept: is the NULL here correct? Yes or no? And why?

The answer is: it depends.

Does the company allow registering a project without an associated employee? If yes, then it’s correct. However, if the company doesn’t allow creating a project without at least one responsible employee, then it would be incorrect.

In other words, the business rule defines how the model is implemented. You can’t say whether it’s right or wrong just by looking at the data. Some companies only allow registering a project if there’s an employee assigned. Other companies allow creating a project without an employee, with the intention of assigning one later.

In my example, I’m allowing projects to be created without an associated employee. The company might want to create a project in the system and assign an employee later. This is valid.

Table created successfully!

Now, I want to get the following: the name and salary of employees who are assigned to projects.

  • Where is the employee name? In the employees table.
  • Where is the salary? In the employees table.
  • But where is the project assignment? In the projects table.

So, I need to perform a join. In this case, there’s no filter. I just want the name and salary of employees who are assigned to projects.

I can solve this using an inner join, which is the most common type of join and one you’ll likely use a lot in your day-to-day work. The inner join is performed using the INNER JOIN clause.

-- INNER JOIN - Name and salary of employees allocated to projects

SELECT e.first_name, e.salary, p.project_name
FROM sc01.employees e
INNER JOIN sc01.projects p ON e.employee_id = p.employee_id;        

Notice what I did here. I’m selecting first_name and salary, and there’s an "e" before these variables. What is this "e"? It’s an alias—a shorthand that I’m giving to the employees table.

If I don’t use the alias, I would have to use the full notation, like sc01.employees.first_name or sc01.employees.salary. It’s not required if the column name is unique.

If the column name appears in both tables, then using an alias would be mandatory.

If the column name is in only one table, it wouldn’t be necessary. However, if tomorrow I modify the schema or add a column with the same name, it could cause problems.

One strategy to simplify your query, make it always readable and safer, is to use aliases, as I did here.

I’m saying: I want the first_name and salary from the employees table, which has the alias "e", and I’m joining it with the projects table, which has the alias "p".

Now, to perform the join, I use columns that are common to both tables. In this case, the employee_id in the "e" table has to match the employee_id in the "p" table. Execute, and you’ll get the exact result you're looking for.

Some questions might arise here:

  1. Is it mandatory to use the word INNER?
  2. No, it’s not. If you remove the word INNER and execute the query, the result will be the same. However, I recommend leaving INNER explicit. This makes your query clearer and helps distinguish when you use other types of JOIN (like LEFT, RIGHT, or FULL JOIN).
  3. If you use just JOIN, it will be interpreted as an INNER JOIN, but using it explicitly is always more readable.
  4. What if there’s no relationship between the tables?
  5. If there’s no common field between the tables, you can’t perform an INNER JOIN. In that case, you would need to use a CROSS JOIN, which performs the Cartesian product of the tables.
  6. However, the goal here is to establish a relationship between the tables, which requires a common field like employee_id. This relationship is defined in the database model.

The ON clause specifies the field that establishes the relationship. This works for two or more tables. The procedure is always the same: adding joins and specifying the common columns using the ON clause.

A common mistake that prevents many people from learning SQL is trying to memorize the syntax. In reality, there’s no need to memorize, right? If you have a tool like pgAdmin, it generates an initial draft for you. Other tools do the same, creating the initial SQL syntax for you.

It doesn’t make sense to memorize; it’s a waste of time. What matters most is understanding what you want to return. That’s the key to learning SQL.

In this case, what did I want? The name and salary of everyone assigned to a project. Notice that 4 employees were returned. But how many employees are there in the employees table? Do you remember? Let’s check. I’ll run a SELECT to confirm.

I have 5 employees, right? All of them are registered. When I wanted to return only those who are assigned to a project, I looked for the exact match on the employee_id. One of the employees isn’t assigned to a project, correct? That’s fine, no problem. At some point, it’s possible that an employee might not be assigned to a project.

The inner join returns records if there’s a match. In other words, the employee_id has to exist in one table and also in the other. If they match, it establishes the relationship and returns the data. That’s what the inner join does: it always returns the record if there’s a match.

However, the requirement may change. Now, for example, I want the name and salary of all employees, regardless of whether they are assigned to projects or not.

Do I have employees who are not assigned to projects? Yes, I do. Now, I want to show everyone: both those who are assigned to projects and those who are not.

Can I use the same query that I was using to answer this new question?

-- Name and salary of all employees regardless of being allocated to projects

SELECT e.first_name, e.salary, p.project_name
FROM sc01.employees e
INNER JOIN sc01.projects p ON e.employee_id = p.employee_id;        

No, this query won’t work because it will only return employees who are allocated to a project. But I also want the employees who are not allocated. I need to show this information to management. So, we can’t use an INNER JOIN.

See the difference? Instead of memorizing syntax, it’s about understanding what is being returned.

Basically, what I want is the following: I want everyone from the employees table. In the employees table, we have people who are allocated to projects and people who are not allocated, right?

Now, I want everyone. But how do I do that? Well, it’s actually quite simple. Just replace the INNER JOIN with a LEFT JOIN.

-- Name and salary of all employees, whether allocated to projects or not

SELECT e.first_name, e.salary, p.project_name
FROM sc01.employees e
LEFT JOIN sc01.projects p ON e.employee_id = p.employee_id;        

What did we do exactly? With the LEFT JOIN, I’m essentially saying the following: Return everyone from the left table. If there’s a match in the right table, bring it along. If there isn’t a match, insert NULL. So, the order of the tables is important here, right?

The left table is the employees table. The right table will return the matches. If there’s no match, it will return NULL.

Isn’t that what happened here? It returned everyone from the left table, which is the employees table. And in the right table (projects), it brought the corresponding data. However, there’s one employee not allocated to a project. So, there’s no way to return a project for them. It returned NULL. Simple, right?

But what happens if I reverse the order of the tables? Then, you can’t use the LEFT JOIN anymore. The logic changes.

SELECT e.first_name, e.salary, p.project_name
FROM sc01.projects p
LEFT JOIN sc01.employees e ON e.employee_id = p.employee_id;        

Oh, the result changed completely! In this case, the left join changed the outcome. What do we want now? I want it to return everything from the left table, which is the projects table. And I want it to return whatever matches exist in the employees table.

Alright, the result changed completely. Notice that, in terms of syntax, it’s the same query, right? The only thing that changed is the order of the tables. And that changed the result. So, what do you want to return? Ask yourself this question a thousand times if necessary: What do you want to return?

Do you want to return only what matches? Then use INNER JOIN. Do you want to return everything from the left table, and if there’s a match in the right table, include that? Then you’ll use LEFT JOIN.

That’s it! But we have a problem. Null values are showing up. You’re not going to deliver a report with null values, are you? Don’t do that. It’s terrible. Never deliver a report with null values, under any circumstances.

So, we’ve solved one problem, but now we’ve created another. I still need to deliver the report, but I can’t show null. Well, we have an elegant way to solve this. I’m going to introduce another SQL function, COALESCE.

What does COALESCE do? This function will return the project name when it exists. If it’s null, I’ll replace it with a string like "Not allocated to a project." Isn’t that a much more elegant solution?

-- A more elegant alternative for the previous result
SELECT e.first_name, e.salary, COALESCE(p.project_name, 'Not Allocated to a Project')
FROM sc01.employees e
LEFT JOIN sc01.projects p ON e.employee_id = p.employee_id;        

The difference is that, when you use LEFT JOIN or RIGHT JOIN, you’ll likely return null values because there might not be a match. With an INNER JOIN, if there’s a match, it will return the value, so the chances of getting null are minimal, right?

When you use LEFT JOIN or RIGHT JOIN, there’s a high probability that you’ll get a null value if no match exists. The query will return the results but will insert null where there’s no correspondence.

But you can’t leave null in the report, because it looks unprofessional. That’s when you use COALESCE, for example, and solve the problem.

The difference is that, when you use LEFT JOIN or RIGHT JOIN, you'll likely return NULL values because there might not be a match.

With an INNER JOIN, if there's a match, it will return the value, so the chances of getting NULL are minimal, right?

When you use LEFT JOIN or RIGHT JOIN, there's a high probability that you'll get a NULL value if no match exists. The query will return the results but will insert NULL where there's no correspondence.

But you can’t leave NULL in the report, because it looks unprofessional. That's when you use COALESCE, for example, and solve the problem.

-- RIGHT JOIN - Name of all employees assigned to projects and projects without assigned employees
SELECT COALESCE(e.first_name, 'No Employee Assigned') AS employee_name, p.project_name
FROM sc01.employees e
RIGHT JOIN sc01.projects p ON e.employee_id = p.employee_id;        

I will keep the same table order, but now I will use the RIGHT JOIN. The logic is the same as the LEFT JOIN.

What am I going to do? I want everything from the right table. If there is a match in the left table, it returns; if there’s no match, it will insert NULL.

You already know you don’t want to leave NULL, so I’ve already added COALESCE here for you to replace the NULL with a string.

See that it returned all the projects. If there was an employee assigned, it returned the name. This last project didn’t have anyone, remember? So instead of NULL, I replaced it with "no employee assigned."

Notice that I didn’t change the order of the tables. I kept the same order. The only thing I did was tell the SQL engine that now I want to look at everything from the right table. With the LEFT JOIN, I want to look at everything from the left table.

If I want a complete match, I use the INNER JOIN. But what if I want everything from both the right and left tables, even if there is no match?

Now I want everything, regardless of whether there’s a match or not. I can’t use INNER JOIN because it only returns data when there’s a match.

Now, I’m going to explain the FULL JOIN. The requirement is: Names of all employees allocated or not to projects, and all projects with or without assigned employees. In this case, it's a combination of a LEFT JOIN and a RIGHT JOIN.

There’s a specific function for this, which is the FULL JOIN. Since I know that null values will appear, I'm taking a proactive decision: using COALESCE for both columns because I know there will be null values in this situation.

The only difference here is the use of the FULL JOIN, which behaves very differently from an INNER JOIN.

-- FULL JOIN - Name of all employees allocated or not to projects and all projects with or without assigned employees
SELECT COALESCE(e.first_name, 'No Employee Assigned') AS employee_name,
       COALESCE(p.project_name, 'No Project Assigned') AS project_name
FROM sc01.employees e
FULL JOIN sc01.projects p ON e.employee_id = p.employee_id;        

Look at this beautiful report. It returned the names of the employees and the names of the projects. When there is a match, it shows the data. Michael Andrews is assigned to the project Real-Time Data Analysis.

Below, I see a project without any assigned employees, and it displays the text I set with COALESCE. Similarly, John Albright is not assigned to any project, and the text appears as I configured using COALESCE.

Do you agree with me that the FULL JOIN is very different from the INNER JOIN? Many people confuse this, trust me. A lot of people.

The INNER JOIN only returns results if there is a match. That’s it. If there is no match, then you have to make a choice. Do you want everything from the left table? Everything from the right table? Or do you want everything from both the right and the left tables? In that case, you use the FULL JOIN.

If there’s no INNER keyword and just JOIN, then it’s an INNER JOIN by default in SQL.

I hope these simple commands have clearly, directly, and objectively explained something that many people have doubts about: the use of join functions.

There’s also the CROSS JOIN, which results in a Cartesian product. This only makes sense in very specific situations and isn’t necessary for our context.

And then we have the SELF JOIN, which doesn’t have a specific clause. It’s simply an INNER JOIN of a table with itself, useful when you want to apply some type of recursion.

Now, I want the average salary of departments with employees allocated to projects. In this case, I will use the INNER JOIN.

Why use the INNER JOIN? Because it doesn’t mention whether or not there’s a match. The INNER JOIN only returns records when there’s a match between the tables. If there’s no match, the record won’t be returned.

This keeps the SQL commands in backticks and provides a more readable structure with shorter, clearer paragraphs.

-- INNER JOIN - Average salary of departments with employees allocated to projects
SELECT department, ROUND(AVG(salary), 2) AS avg_salary
FROM sc01.employees e
INNER JOIN sc01.projects p ON e.employee_id = p.employee_id
GROUP BY department;        

It’s not always clear in the request, the solicitation, or the demand. You have to interpret it and then decide.

In this case, I want the average salary of departments with employees allocated to projects. There’s nothing mentioned about employees not allocated to projects, right? Only those allocated to projects. In this case, there must be a match. So, we have to use the INNER JOIN.

Let’s proceed with the average salary. I already know that I need to use AVG, because it's the aggregation function that calculates the average. I'll use ROUND to limit it to two decimal places. I want this average grouped by department. So, I know that I need to return the department.

I already know that I need a GROUP BY. As you know, any column not included in an aggregation function must be included in the grouping.

I’m using aliases with AS. It's not mandatory, but it's a good practice to indicate that there is an alias. Ok?

Now, I want to return this only for employees who are allocated to projects. Employee data, like the department, is in one table, while the project allocation is in another. So, I need to join these tables.

In this case, since I only want employees who are allocated, I use the INNER JOIN, and then I specify the columns that allow the relationship in the ON clause.

Run the query, and here you have the result.

Always understand the result you need. What do you want to return? This is what will define how you create your statement.

Let’s move on to another query. Now, I want the average salary by department. But, just like the previous example, we need to focus only on employees allocated to projects. The same question applies here. However, now we add a filter. I want this data only for employees whose hire date was on the 10th of any month or year. If the person was hired on the 10th, I want to intuitively apply that as a filter.

Is this a filter that involves aggregation? No. So, if it doesn’t involve aggregation, where does it go? It goes into the WHERE clause.

If it did involve aggregation, what would I use? I’d use HAVING. Isn't that right? So, let's take a look at what we’re going to do here.

I want the average salary. I’ll use ROUND for rounding. I want this average by department. So, the department goes here in the SELECT. It's not part of the aggregation function, so where does it go? It goes into the GROUP BY.

-- INNER JOIN - Average salary of departments with employees allocated to projects
-- whose hire date is on the 10th of any month or year
SELECT department, ROUND(AVG(salary), 2) AS avg_salary
FROM sc01.employees e
INNER JOIN sc01.projects p ON e.employee_id = p.employee_id
WHERE EXTRACT(DAY FROM hire_date) = 10
GROUP BY department;        

To check who is allocated to a project, I need to join the tables. The requirement doesn’t specify whether or not the employee is allocated, so I’m assuming they need to be allocated. Therefore, I will use an INNER JOIN.

The only thing missing is the filter, which specifies that the hire date must be on the 10th day of any month.

I have the hire_date column, and I need to extract the day from the date. To do that, I'll use the EXTRACT function, another SQL function. As the name suggests, it extracts.

What am I extracting? The day. I will extract the day from the hire_date column. I extract the day and compare it: is the day equal to 10? If yes, then please return the record.

In other words, I returned the average salary of the departments for employees allocated to projects.?

However, I did this only if the employee’s hire date was on the 10th of any month or year. I used the INNER JOIN, returned the department's average, grouped the data, and applied a filter by extracting the day from the hire date.

Subqueries and CTEs (Common Table Expressions)

Another aspect of SQL is that, in many cases, you’ll have more than one possible solution to the same problem. There isn’t a single solution all the time; you can solve things in different ways.

For example, I used a LEFT JOIN and a RIGHT JOIN, right? I could have solved everything with the LEFT JOIN, just by changing the order of the tables. In other words, either I use the RIGHT JOIN and don't change the order of the tables, or I use the LEFT JOIN and change the order. These are two possibilities to reach the same solution.

This is very common when working with SQL. Keep in mind that there’s always more than one way to do the same thing.

We also need to consider the performance of the query. But first, focus on understanding how to build the query. Then, worry about how to optimize the query for the performance you want. It’s a two-step process.

A lot of people try to do everything at once. Of course, that slows down the learning process for SQL. First, focus on learning how to build a query. Have you learned? Can you now structure the logic? Great, excellent. Then, we can focus on performance.

I’ll discuss this with you soon, because when you work with ETL or load data into a Data Warehouse or any database, performance becomes a critical factor. So, first learn how to structure your query and build the logic. Then, worry about performance.

Now, try solving this for me?—?I want something simple. I want employees with salaries higher than R$ 21,900. I’ll show you the solution using a CTE.

WITH high_salary_employees AS (
    SELECT first_name, salary
    FROM sc01.employees
    WHERE salary > 21900
)
SELECT *
FROM high_salary_employees;        

But what do you mean, instructor? There’s a SELECT down there? All of this is a query, and it’s called a CTE.

In some scenarios, it might be useful to create a temporary table. You don’t want to create another table in your database, right? Why? Because, suddenly, the table might become too large. You’d have to write that to disk. It doesn’t make sense. You can create a temporary table that will only exist in the computer’s memory while the query is being executed. And this can be faster.

An alternative is always creating a table on disk, right? You can create a copy of a table. You can create another table from a SELECT, which I'll show you an example of too. But I can create temporary tables, and that's what I'm doing with the CTE.

Notice that I first have the WITH clause. After that, I have a name. This name is for the temporary table. This table, high_salary_employees, will only exist in the computer's memory during execution.

If you observe the SELECT below, you’ll see that I’m selecting exactly from this table. It’s a temporary table. It doesn’t exist on disk—that’s why a SELECT by itself at the end of the query doesn’t work.

In fact, what I’m telling the execution engine is the following: Create this temporary table high_salary_employees as the result of this query.

SELECT first_name, salary
    FROM sc01.employees
    WHERE salary > 21900        

Here, I’m selecting the name and salary from the employees table, where the salary is greater than 21,900, which is what I want. So, what’s going to happen? It will execute this internal SELECT, returning a few rows, right? I’m going to call these rows a table: high_salary_employees.

Then, I come down here and run the SELECT. Select all from this, execute, and that’s it. Here’s your result.

Many times, this approach can save you in terms of performance. Sure, you can solve it in other ways, but they might give you inferior performance. There’s no magic rule, ok? Will using a CTE always be better? No. You always have to analyze the execution plan. The execution plan will tell you if your query is good or not in terms of performance?—?for this database, for these tables, on this computer.

You can take the same data, move it to another computer, in a different DBMS, and that query could be worse or better. A lot of people ignore the fact that all of this is executed where? On the computer, right? Every processor has a different computational capacity. So, a query might perform great on one computer and terribly on another.

That’s why you can’t say that a CTE will always be better or worse. I always have to analyze the execution plan. That’s what will tell me if the query is performing well or not.

But beyond the performance aspect, using a CTE allows me to apply filters to the data, place it in a temporary table in memory, which is much faster than reading from disk, and maybe execute the query in a quicker way.

This way, I place the filter I want inside the WITH clause—notice the open and close parentheses—it creates the temporary table, and then I SELECT from that table.

Now, pay attention to this. Take a moment to think about this for a while.

-- Employees with a salary higher than 21900 
-- why does this query show an error?)

WITH employees_highest_salaries AS (
    SELECT first_name, salary
    FROM sc01.employees
    WHERE salary > 21900
)
SELECT *
FROM employees_highest_salaries
WHERE EXTRACT(DAY FROM hire_date) = 10;        

I’ve modified the query. Now, I want employees with a salary greater than 21,900 who were hired on the 10th. So, I used EXTRACT to get the day of hire. Now, when I run this query, it’s going to throw an error.

Why did this error message occur? If you understand what a CTE is, then you already understand why the error is happening. Let’s interpret this together, ok?

Error: column hire_date does not exist. The error points to the part of the query where the column hire_date is being used. But wait a second. This column exists in the employees table, right? So, how can it be saying that the column doesn't exist?

Yes. The hire_date column doesn't exist where? It doesn’t exist in the temporary table. When I try to apply the WHEREclause using hire_date, the engine says: "Hold on, that column doesn't exist in the temporary table." And it's absolutely correct.

How do we solve this problem? It’s super easy. Look here. Now, I’m going to modify my CTE to include hire_date in the internal query. Now, this result will have the column.

-- CTE
-- Employees with a salary higher than 21900 hired on the 10th day
WITH employees_highest_salaries AS (
    SELECT first_name, salary, hire_date
    FROM sc01.employees
    WHERE salary > 21900
)
SELECT *
FROM employees_highest_salaries
WHERE EXTRACT(DAY FROM hire_date) = 10;        

Now, when I run the SELECT using the temporary table, it's going to work perfectly. Do you understand the concept now? If you didn't get it before, sometimes the error message helps you understand the concept better than when everything works right from the start.

If I remove hire_date from the SELECT, then my temporary table simply won't have that column anymore. So, naturally, it's going to throw an error in the SELECT because I'm trying to use that column to filter with the WHERE clause. The way to fix this is to add hire_date to the temporary table by including it in the query inside your CTE.

This is extremely valuable, right? When we use ETL and apply the CTL process with SQL logic, this could make all the difference. It could mean the difference between an ETL process that takes five hours and one that takes five minutes. Why? Because a lot of people use SQL by reading data from disk, which is much slower. If I can aggregate in memory, it’s much faster.

So, during the load process, a CTE can drastically reduce the time needed to load data, especially in a data warehouse, which typically handles very large data loads.

A CTE is a type of subquery. In practice, we have a subquery that generates a temporary table, and then I run a query on that temporary table. That’s a CTE.

Inside this subquery in parentheses, you can put whatever SQL you want. You can use JOIN, GROUP BY, or even HAVING to filter after grouping. You can join multiple tables. You use the subquery to create a temporary table and then query that temporary table. It's a type of subquery, got it?

But I have another alternative. As I said earlier, in SQL, there are generally many ways to solve a problem. This can be confusing at first, but over time, you’ll see that one of the best things about SQL is its flexibility. You can always find different ways to solve a problem.

Now, take a look at this example I have for you. Considering employees hired in February, I need to return the name and department of those with the highest salary. Interesting, isn’t it?

First, do I need to perform any joins? Yes or no, and why?

No, I don’t need a join. The data I need is all in one table.

Do I need to do aggregation? Yes, I do. Why? Because I need to return the highest salary. To return the highest salary, I need an aggregation function. And for that, there’s the MAX function. So, I already know that I'll need to perform aggregation.

Look at what we’re doing here?—?this is an incredibly valuable exercise. We’re interpreting the requirements, and from there, building our query.

Alright, let’s bring in a first proposal for a solution. Check this out.

-- SUBQUERY
-- Considering the employees hired in the month of February, 
-- return the name and department of those with the highest salary.

SELECT first_name, department, MAX(salary)
FROM sc01.employees
WHERE EXTRACT(MONTH FROM hire_date) = 2
GROUP BY first_name, department;        

Someone who isn’t familiar with subqueries will probably try to solve it this way:

Select first_name, department and the highest salary from the employees table. I'm applying the filter because I want the month to be equal to 2, which is February. And I'm grouping by the two columns that aren't in the aggregation function.

So, syntactically, it looks like everything is correct, right? You run it, and then I ask you: Was this the result I wanted?

Considering the employees hired in February, ok, I applied the rule. Return name and department, ok, got that. But what about the highest salary? Wait a second, who has the highest salary now? I can see it from here, look. Charles Davidson has the highest salary, 23,400.

Yes, but if I had 8 million records in this table? Am I going to look through all 8 million records to find the highest salary? It seems to me that this query isn’t ideal. It’s correct in terms of SQL syntax, and it did return the highest salary, but does it seem reasonable if I have a table with millions of records? Probably not.

So, maybe I need to rewrite my query. Many people try to solve a problem with the only tool they know. It doesn’t matter if it’s a nail or a screw, they’ll always use a hammer. But we have the appropriate tool for a nail and the appropriate tool for a screw.

So, this query, while it returned a result, doesn’t seem to be the best option. Let’s change the query, let’s change the tool. And this is where something like a subquery comes in. Look how cool this is!

-- Solution with subquery
SELECT first_name, department
FROM sc01.employees
WHERE salary = (SELECT MAX(salary) FROM sc01.employees WHERE EXTRACT(MONTH FROM hire_date) = 2);        

Considering all employees hired in February, return the name and department of the employee with the highest salary. If there are two people, that’s fine. In the previous query, we saw who had the highest salary?—?it was Charles Davidson.

Now, my query below returns exactly and only this employee. But how did we get here? We used a subquery, a SELECTinside another SELECT.

First, I create this internal SELECT.

SELECT MAX(salary) 
FROM sc01.employees 
WHERE EXTRACT(MONTH FROM hire_date) = 2);        

This SELECT will return the highest salary from the employees table, considering those hired in February. For this query, do I need a GROUP BY? No. Why? Because I only have the aggregation function in the SELECT. So, I removed the GROUP BY.

In general, GROUP BY can cause performance issues. The subquery here allowed me to eliminate the GROUP BY while still using the MAX function, which is the aggregation function. Once I get this result, what will I have? The highest salary.

So, what do I do next? I go to the employees table and return the name and department where the salary equals the highest value. This is exactly the query we just created, which returns the result we need.

In terms of syntax, both queries are correct. But notice that the first query, with the GROUP BY, is returning several "highest" values. Why? In practice, the result will be more granular, more detailed. Why? Because I have to group bythe other two columns.

I could add more filters to the first query. But what will happen? The performance of this query will suffer. On the other hand, in the second query below, I managed to simplify things and removed, for example, the GROUP BY operation.

See how memorizing doesn’t help? If you try to memorize, that’s it?—?you won’t learn SQL. Don’t memorize. Always learn SQL by focusing on what you want to return.

Next Chapter

Phew! See you in the next one.

Thank you very much. ???? All images, content, and text are created and authored by Leonardo A.


要查看或添加评论,请登录

Leonardo A.的更多文章

社区洞察

其他会员也浏览了