SQL Queries Quickstart: Part II
Welcome back. This is part II of my SQL Queries Quickstart Guide. If this is your first time here, please read the first part which includes the Index as well as the conventions used during the series.
By this time, you should be familiar in writing a statement to query a single table. Now we will embark in multi-table queries, but in order to do so, we first need to know about a special type of constraint.
Foreign Keys
One of the objectives of a Relational database is to easily connect data in different tables. This was a difficult task to do with the early file-based databases, but SQL makes it easy.
Tables are connected between each other using Foreign Keys (aka FKs). This is a special type of index that references rows in a different table (the referenced table). The basic idea is that you create a column to contain the identifier -primary key value- of a referenced-table's row.
If that was confusing, let's view it with an example in Educa.
Let's add a third table, enrollments, that will connect a user with a course. This table will have two FKs pointing to the other two tables:
Remember this article will not focus on DDL statements (like this CREATE TABLE here), but it's important to show it so we can understand how a foreign key is defined. Watch for the last two lines how we set two local columns to reference the primary keys of the other two tables.
Foreign keys are considered constraints too. Why? Because if we now try to add a row with an invalid user or course ID number, the database will reject it. This is one of the most compelling features of a relational databases: it works as the end line of defense to keep your data consistent, without having to rely in custom application code to do so.
Multi table queries
Ok, now that we have our three table setup, lets run a query to view all students and the courses they have enrolled in:
???? Play on Fiddle
The query starts on enrollments and then joins its rows with the rows on the other two tables. Let's review what's new here.
Aliases
They are added with the AS keyword, although in some cases it's optional to do so. Aliases have different purposes:
Connecting tables with JOINs
The JOIN keyword allows to connect a row from one table (left table) to the rows of another (join or right table).
In our example, both joins represent a 1:1 relationship. This means that for each row in the enrollments table, we match only one row in each of the right tables.
Even though there are a lot of different JOINs specified by the ANSI SQL standard, in practice I would argue there are only two important ones: the INNER JOIN and the LEFT OUTER JOIN. I recommend sticking with this two, as the others typically can be rewritten using these two anyways (or are almost never needed at all).
INNER JOIN
This is the most common join type, and the one we used in our examples (the INNER keyword is optional). It selects rows that have matching values in both tables.
Here's a visual representation (source: w3schools):
In our example every enrollment needs both a student and a course.
领英推荐
Notice that Foreign Keys may have NULL values. For example, we may want to start a draft enrollment process without assigning a student, introducing the concept of a vacant seat for the class in our application. In this situation, how can we get all enrollments, even those without a student assigned?
LEFT OUTER JOIN
The LEFT OUTER JOIN (the OUTER keyword is optional) allows us to get all records from the left table, and the matching on the right table. Here's another visual representation:
???? Play in Fiddle
In this Fiddle playground I've removed the NOT NULL constraint from the userId column. Notice how all rows without a matching student record will just return a NULL value:
Anti Join
There's a final join pattern that is really useful, called the Anti Join. Imagine a situation where you want all rows from the left table that don't have a matching row in the right table. Here it is visually:
This can be obtained by doing a LEFT JOIN and then requiring the right column to be NULL. Let's see an example in Educa. We want to send information about courses to new users. In other words, we want to find to all users without an enrollment:
???? Play on Fiddle
Look at that, all our users have an enrollment, except for the admin account. Educa business is thriving!
Notice the tables used in this query: the Foreign Key is defined in the right table. That's why we need the Anti Join pattern. There's no way to get this information starting from the enrollments table. Actually, that's not entirely true, you can use one of the less common RIGHT JOIN types. In my experience, I've never had to use a RIGHT join, I could always get away rewriting a query using LEFT joins.
Tip: what happens if we change the condition to "WHERE enrollment.userId IS NOT NULL"? Technically, we would be transforming a LEFT JOIN into an INNER JOIN. This happens any time we add a filter (other than IS NULL) in the WHERE section over a column of a LEFT JOINed table. Why? Because any NULL value in a comparison will force the filter to return NULL. And NULL = NULL is actually a FALSE statement! The filter returns NULL, which in turn evaluates to FALSE. This is a very common -bad- assumption new developers make.
Too many rows and Cartesian Product
Joining can be tricky. If you forget to add a join predicate, then you'll get all the combinations between each row of the left table with the rows in the right table (a "cartesian" product).
Let's explain this with an example. We want to find all Educa students with an enrollment. If we just do a JOIN, here's what we get:
There are duplicate rows! What's happening?
The reason is that there are more than one enrollment for some students. It's tricky to see it, after all, we haven't added enrollment columns to the query, right? But that's the way a JOIN works.
Even worse, if you add a third table, also duplicating rows, you'll get the combination of all the rows, producing even more duplicate rows in the result set.
DISTINCT
One way to solve this, is to use the DISTINCT keyword before specifying the columns to query. This will remove the duplicates, but it may not be too good for performance: the engine will first retrieve all the rows, and then perform the deduplication. This could be time consuming on big datasets, and it can mess with your pagination (but more on these on the Performance article, Part IV, coming later).
How can we do this filtering? We'll have to use subqueries. But this will be explained in Part III. Look for it in the replies, or stay tuned to my Articles section on ??.