SQL Queries Quickstart: Part II
SQL Queries Quickstart: Part II

SQL Queries Quickstart: Part II

Welcome back. This is part II of my SQL Queries Quickstart Guide. If this is your first time here, please read the first part which includes the Index as well as the conventions used during the series.

?? Read Part I here

By this time, you should be familiar in writing a statement to query a single table. Now we will embark in multi-table queries, but in order to do so, we first need to know about a special type of constraint.

Foreign Keys

One of the objectives of a Relational database is to easily connect data in different tables. This was a difficult task to do with the early file-based databases, but SQL makes it easy.

Tables are connected between each other using Foreign Keys (aka FKs). This is a special type of index that references rows in a different table (the referenced table). The basic idea is that you create a column to contain the identifier -primary key value- of a referenced-table's row.

If that was confusing, let's view it with an example in Educa.

Let's add a third table, enrollments, that will connect a user with a course. This table will have two FKs pointing to the other two tables:

The enrollments table, with it's two foreign keys to the courses and users tables.

Remember this article will not focus on DDL statements (like this CREATE TABLE here), but it's important to show it so we can understand how a foreign key is defined. Watch for the last two lines how we set two local columns to reference the primary keys of the other two tables.

Foreign keys are considered constraints too. Why? Because if we now try to add a row with an invalid user or course ID number, the database will reject it. This is one of the most compelling features of a relational databases: it works as the end line of defense to keep your data consistent, without having to rely in custom application code to do so.

Multi table queries

Ok, now that we have our three table setup, lets run a query to view all students and the courses they have enrolled in:

A multi table query.

???? Play on Fiddle

The query starts on enrollments and then joins its rows with the rows on the other two tables. Let's review what's new here.

Aliases

They are added with the AS keyword, although in some cases it's optional to do so. Aliases have different purposes:

  • Changing a column label, specially useful when using functions like the CONCAT in the example. This will result in a more beautiful table output.
  • Easily work with duplicated column names, or even joining the same table more than once.
  • Better naming for tables. Look at how the users table becomes a student.

Connecting tables with JOINs

The JOIN keyword allows to connect a row from one table (left table) to the rows of another (join or right table).

In our example, both joins represent a 1:1 relationship. This means that for each row in the enrollments table, we match only one row in each of the right tables.

Even though there are a lot of different JOINs specified by the ANSI SQL standard, in practice I would argue there are only two important ones: the INNER JOIN and the LEFT OUTER JOIN. I recommend sticking with this two, as the others typically can be rewritten using these two anyways (or are almost never needed at all).

INNER JOIN

This is the most common join type, and the one we used in our examples (the INNER keyword is optional). It selects rows that have matching values in both tables.

Here's a visual representation (source: w3schools):

An inner join will select rows in which both tables have a match.

In our example every enrollment needs both a student and a course.

Notice that Foreign Keys may have NULL values. For example, we may want to start a draft enrollment process without assigning a student, introducing the concept of a vacant seat for the class in our application. In this situation, how can we get all enrollments, even those without a student assigned?

LEFT OUTER JOIN

The LEFT OUTER JOIN (the OUTER keyword is optional) allows us to get all records from the left table, and the matching on the right table. Here's another visual representation:

A left join will includes rows from the left table, even without a matching row in the right one.

???? Play in Fiddle

In this Fiddle playground I've removed the NOT NULL constraint from the userId column. Notice how all rows without a matching student record will just return a NULL value:

LEFT JOIN, notice the NULL values on the fullName column.

Anti Join

There's a final join pattern that is really useful, called the Anti Join. Imagine a situation where you want all rows from the left table that don't have a matching row in the right table. Here it is visually:

The anti-join, rows on the left table that don't have a match in the right one.

This can be obtained by doing a LEFT JOIN and then requiring the right column to be NULL. Let's see an example in Educa. We want to send information about courses to new users. In other words, we want to find to all users without an enrollment:

All users without an enrollment.

???? Play on Fiddle

Look at that, all our users have an enrollment, except for the admin account. Educa business is thriving!

Notice the tables used in this query: the Foreign Key is defined in the right table. That's why we need the Anti Join pattern. There's no way to get this information starting from the enrollments table. Actually, that's not entirely true, you can use one of the less common RIGHT JOIN types. In my experience, I've never had to use a RIGHT join, I could always get away rewriting a query using LEFT joins.

Tip: what happens if we change the condition to "WHERE enrollment.userId IS NOT NULL"? Technically, we would be transforming a LEFT JOIN into an INNER JOIN. This happens any time we add a filter (other than IS NULL) in the WHERE section over a column of a LEFT JOINed table. Why? Because any NULL value in a comparison will force the filter to return NULL. And NULL = NULL is actually a FALSE statement! The filter returns NULL, which in turn evaluates to FALSE. This is a very common -bad- assumption new developers make.

Too many rows and Cartesian Product

Joining can be tricky. If you forget to add a join predicate, then you'll get all the combinations between each row of the left table with the rows in the right table (a "cartesian" product).

Let's explain this with an example. We want to find all Educa students with an enrollment. If we just do a JOIN, here's what we get:

Students with an enrollment, why are they duplicated?

There are duplicate rows! What's happening?

The reason is that there are more than one enrollment for some students. It's tricky to see it, after all, we haven't added enrollment columns to the query, right? But that's the way a JOIN works.

Even worse, if you add a third table, also duplicating rows, you'll get the combination of all the rows, producing even more duplicate rows in the result set.

DISTINCT

One way to solve this, is to use the DISTINCT keyword before specifying the columns to query. This will remove the duplicates, but it may not be too good for performance: the engine will first retrieve all the rows, and then perform the deduplication. This could be time consuming on big datasets, and it can mess with your pagination (but more on these on the Performance article, Part IV, coming later).

How can we do this filtering? We'll have to use subqueries. But this will be explained in Part III. Look for it in the replies, or stay tuned to my Articles section on ??.

#sql #select #queries #database #dba #developer #DataAnalysis #tutorial #DatabaseDevelopment #SQLServer #MySQL #PostgreSQL #Oracle #SQLite

要查看或添加评论,请登录

Jose Canciani的更多文章

社区洞察

其他会员也浏览了