Mastering SQL Fundamentals (Detailed)
Photo by Leif Christoph Gottwald on Unsplash

Mastering SQL Fundamentals (Detailed)

This article is designed for beginners in the SQL language, offering a step-by-step guide starting from the most fundamental concepts. Even if you already have some experience with SQL, this reading will be valuable, serving as an excellent opportunity to review and strengthen your understanding of the language’s basics.

Our focus will be predominantly practical, covering fundamental concepts before delving into the effective use of SQL. Get ready for a journey that will transform you from a complete novice to a confident user of the SQL language, through practical examples and clear explanations that will facilitate your learning.

Database Fundamentals: Terminology and Initial Concepts

In this segment, we will establish a solid foundation of understanding, starting with a concise introduction to databases and relational models. Considering the vast array of terminologies and concepts, it’s common for beginners to feel a bit confused. Therefore, our first step is to clarify this terminology.

When we mention ‘DBMS,’ we are referring to ‘Database Management Systems’ like MySQL, Oracle, Microsoft SQL Server, PostgreSQL, among others. These are software designed to manage one or more databases. The term ‘Database’ refers to the files that store data. Each DBMS can manage one or more databases, and our focus will be on relational databases, where data is organized into interconnected tables.

Another important term is the ‘Logical Relational Model,’ which defines how data relates to each other. Professionals like data engineers or architects use this model to plan the database structure, identifying entities, relationships, cardinality, and, from there, creating a logical model with specific definitions for each situation. For example, when developing a database for a hospital clinic, separate tables would be created for patients, doctors, and medical procedures, all interconnected.

Finally, the ‘Physical Relational Model’ is the practical implementation of these structures in the database, using a set of SQL statements to create tables, columns, indexes, relationships, and so on.

Applying SQL to Real Data: Case Study with Anvisa Data

In this section, we will dive into the practical application of the SQL language using real data publicly provided by Anvisa (National Health Surveillance Agency). We have chosen a practical and relevant scenario: the scheduling of sanitary inspections on cruise ships. This context-driven approach not only facilitates learning but also makes the process more engaging and meaningful.

Instead of presenting disconnected SQL instructions, we will integrate each command within a real-world scenario, tackling and solving specific problems. This will allow us to not only learn SQL language effectively but also understand how it can be used to solve real-world problems. We will contextualize our situation, explore the issues involved, and apply SQL to find solutions, deepening our understanding of the language in the process.

Data Dictionary for Sanitary Inspections on Cruise Ships

1. ship_name: Identifies the cruise ship’s name.

2. mes_ano_abertura (month_year): Records the month and year in which the sanitary inspection was conducted.

3. classificacao_risco (risk_classification): Indicates the sanitary risk level of the ship, categorized into four classes:

— standard_a: Risk score up to 150, indicating excellent sanitary conditions.

— standard_b: Score between 151 and 300, satisfactory sanitary conditions.

— standard_c: Score between 301 and 450, acceptable sanitary conditions.

— standard_d: Score above 450, unsatisfactory conditions requiring immediate action.

Ships with critical controls not implemented are downgraded to the next category starting from the 2018/2019 season.

4. pontuacao_risco (risk_score): Sum of the risks for each inspected item. The more control failures, the higher the risk and score.

5. indice_conformidade (conformity_index): Percentage of items in the inspection checklist met by the ship.

6. temporada (season): Period of the National Health Surveillance Program for Cruise Ships, which runs from October to April of the following year.

Preparing the Database for Analysis

With the business challenge clearly defined, our focus now shifts to analyzing the cruise ship sanitary inspection program. Equipped with the necessary data and a comprehensible data dictionary, the next step is to prepare our database for analysis. We will then load the data files to start our investigation.

When opening MySQL Workbench, you will notice various elements related to MySQL on the left side of the interface. However, our main interest lies in the ‘Schemas.’ Currently, there is a growing trend among Database Management Systems (DBMS) to replace the traditional concept of a ‘database’ with that of a ‘schema.’ While, depending on the DBMS, these two concepts may be quite similar, it is important to note that a single Database can contain multiple Schemas.

In this context, a Schema refers to an organized collection of tables, views, and other database elements that are logically grouped to facilitate data management and organization. Understanding this distinction is crucial for effectively analyzing sanitary inspection data in an efficient and structured manner.

Creating a New Schema in MySQL

For our project, it is essential to understand and select the appropriate Schema. In MySQL, you will find the sys Schema, which is intrinsic to the MySQL Database Management System (DBMS) itself. It is important to note that this Schema should not be used for our purposes. Using or modifying the sys Schema can lead to significant issues, including the loss of the DBMS installation. Therefore, it is highly recommended to avoid any intervention in this specific Schema.

Therefore, the best practice is to create a new Schema. This avoids risks associated with altering the sys Schema and provides a dedicated space to work with our cruise ship sanitary inspection data, ensuring safety and efficiency in data organization and analysis.

CREATE SCHEMA 'cap02';        

After naming your new schema, here we call it cap02, MySQL will automatically display a Data Definition Language (DDL) SQL statement, specifically a CREATE SCHEMA command. To create the schema in the database, simply click ‘Apply’.

At the bottom of the MySQL interface, you will find a section that functions as a log, displaying all the operations being executed. It’s important to regularly monitor this area for any error messages.

Next, when selecting ‘cap02’ in the Schemas area, we can see that there are currently no tables available — the space is empty. To start creating our tables, simply right-click in this area and choose the ‘Create Table’ option.

Structuring the TB_SHIPS Table for Data Analysis

We named our table TB_SHIPS, reflecting its purpose of analyzing data from sanitary inspections on cruise ships.

The next step is to define the columns of the table, corresponding to each column present in the data file and detailed in the data dictionary.

CREATE TABLE `cap02`.`TB_SHIPS` (
    -- Identifies the name of the cruise ship.
    `ship_name` VARCHAR(50) NULL,

    -- Records the month and year in which the sanitary inspection was conducted.
    `month_year` VARCHAR(10) NULL,

    -- Indicates the sanitary risk level of the ship.
    `risk_classification` VARCHAR(15) NULL,

    -- Sum of the risks of each inspected item.
    `risk_score` INT NULL,

    -- Percentage of inspection items fulfilled by the ship.
    `compliance_index` VARCHAR(15) NULL,

    -- Period of the National Health Surveillance Program for Cruise Ships.
    `season` VARCHAR(200) NULL
);        

With the TB_SHIPS table successfully created, the next step is to load the data. To do this, right-click on the TB_SHIPS table in MySQL and select the ‘Table Data Import Wizard’ option. This wizard will guide you through the process of importing data into the tab

e.

The ‘Table Data Import Wizard’ is the tool we will use to load data into the TB_SHIPS table. This wizard simplifies the import process, ensuring that the data is correctly inserted into the table by simply specifying the directory or folder where the data source file is located.

In the next step of the ‘Table Data Import Wizard’, you will be presented with a choice: import the data into an existing table or create a new table for this purpose. Since we already have the TB_SHIPS table ready, we will select the option to import into the existing table.

Additionally, the wizard offers the option of ‘truncate table’. This is a useful SQL statement for clearing all data from a table. This functionality is particularly valuable if you need to make adjustments to the table and want to re-import the data. By selecting this option, the wizard automatically clears the existing records in the table before loading the new data.

The wizard automatically identifies the columns in the file to be imported, using the semicolon (;) as the default delimiter. It displays important details such as the CSV file format in UTF-8 encoding, as well as listing the column names in the file under the ‘Source’ column.

Next to it, in the ‘Dest’ column, it shows the corresponding column names in our TB_SHIPS table. At the bottom of the screen, the wizard provides a preview, displaying examples of the data in each column, making it easy to check and validate the column mapping.

With everything set up, we are ready to start importing the data into the TB_SHIPS table. After the successful completion of the process, the wizard confirms the number of imported records, which in this case is 463. This count must precisely match the number of records in the data file. If the number is different, it indicates that something did not go as expected during the import.

Fundamentals of SQL Query: Understanding the SELECT Query

Let’s explore the basic structure of an SQL query, starting with the simplest of all: SELECT * FROM. This query exemplifies the simplicity and effectiveness of SQL. The words SELECT and FROM are reserved SQL keywords used to build the foundation of almost all queries.

By convention, we use these keywords in uppercase to distinguish them in the code. Other elements, such as column names, are typically written in lowercase, but an important exception is the table names, which I recommend keeping in uppercase for easy identification of the data source.

SELECT * FROM cap02.TB_SHIPS;         

The asterisk (*) in the SELECT * FROM query means that we are requesting all data from all columns of the specified table. In this case, the TB_SHIPS table is located in the cap02 schema, which we created earlier. It’s important to note that the same table can exist in different schemas, so specifying the correct schema when performing the query is essential.

Getting Started

Our journey in SQL will involve using the language in various ways. Every SQL query follows a basic pattern: it starts with a reserved word, moves on to selecting columns, applies filters to rows, and extracts this information from one or more tables. This process is the essence of how SQL operates, offering a flexible structure that can be adapted to meet various needs.

We will begin with simple queries and gradually introduce more complex elements. This will allow us to extract increasingly sophisticated results aligned with our specific interests. By increasing the complexity of the queries, we also enhance our ability to manipulate and analyze data in various ways, exploring the full potential of SQL in different scenarios and demands.”

Selecting Data with the SELECT Clause

The SELECT statement is undoubtedly the most used and fundamental in SQL. As the name suggests, SELECT allows the selection of data from one or more tables. With it, we can retrieve all rows and columns from a table, simply by applying the syntax we’ve already explored previously, without the need for filters.

However, the true versatility of SELECT is revealed when we want to select specific columns. Let’s say we are only interested in the first column of our table, ship_name. For this, the syntax is as follows:"

SELECT ship_name
FROM cap02.TB_NAVIOS;        

Understood that the SELECT statement will return the columns we choose, and if no filter is applied, it will also retrieve all the rows from the table.

SQL is essentially a query language designed to retrieve data from a database. The SELECT statement is the primary tool for this purpose. When we want to perform insert, update, or delete operations, we turn to DML statements, which I will explore later.

SELECT is truly the cornerstone of SQL, allowing us to select and query data without making any changes to the table. The data remains unchanged while we query information and apply filters, aggregations, and operations as needed. Next, we will take our first step in applying filters.

Filtering Data with the WHERE Clause

To refine our queries, we use the WHERE clause in the SELECT statement. This clause allows us to establish filter criteria to select only the rows that meet certain conditions.

To provide context, let’s say we want to identify the ships classified as “D” in the Risk Classification column. This means we are interested in vessels that have unsatisfactory sanitary conditions, with immediate requirements to fulfill. The query with the filter would look like this:

SELECT ship_name, season
FROM cap02.TB_SHIPS
WHERE risk_classification = 'D';        

In this query, we select two columns, ‘nome_navio’ (ship name) and ‘temporada’ (season), from the TB_NAVIOS table in the ‘cap02’ schema.

The condition specified in the WHERE clause checks if the ‘classificacao_risco’ (risk classification) is equal to ‘D’, and only the rows that meet this criteria are returned.

For additional validation, we can include the ‘classificacao_risco’ column in the selection to confirm the results.

SELECT ship_name, risk_classification, season
FROM cap02.TB_SHIPS
WHERE risk_classification = 'D';        

This query will provide us with the ships that, according to the Anvisa data, received the lowest risk classification over the past seasons.

Ordering Results with the ORDER BY Clause in SQL

In addition to filtering columns and rows, it’s common to want to sort the results of our SQL queries based on specific criteria. For this purpose, we use the ORDER BY clause.

Let’s consider the previous example where we filtered the ships with a risk classification of ‘D’. Now, we want the results to be displayed in alphabetical order based on the ‘nome_navio’ column. To perform this sorting, we add the ORDER BY clause as follows:

SELECT ship_name, season, risk_classification
FROM cap02.TB_SHIPS
WHERE risk_classification = 'D'
ORDER BY ship_name;        

In this query, we are selecting the three specified columns in the SELECT statement, filtering only the rows with a risk classification of ‘D’, and ordering the results in ascending alphabetical order based on the ‘ship_name’ column.

Sorting results is a common practice to enhance readability and facilitate data analysis, making SQL an even more flexible and effective tool.

Applying Logical Operators in SQL Queries

As we refine our SQL queries, it’s common to need to apply multiple filtering criteria using logical operators. Let’s consider a scenario where we want to select ships classified as ‘D’ in the ‘risk_classification’ column and that also have a ‘risk_score’ greater than 1000.

The query below illustrates how we can combine these criteria using the ‘AND’ logical operator:

SELECT ship_name, season, risk_classification, risk_score
FROM cap02.TB_SHIPS
WHERE risk_classification = 'D' AND risk_score > 1000
ORDER BY ship_name;        

In this query, we are selecting the four columns specified in the SELECT statement and filtering the rows based on the criteria:

  • ‘risk_classification’ must be equal to ‘D’.
  • ‘risk_score’ must be greater than 1000.

The results will be ordered in alphabetical order based on the ‘ship_name’ colu

n.

The use of logical operators, such as ‘AND’, allows us to create complex SQL queries with multiple filtering criteria to extract specific information from the data.

Understanding the Behavior of the ‘AND’ Logical Operator

To understand the behavior of the ‘AND’ logical operator in SQL queries, it’s important to highlight that it requires both specified criteria to be true for a row to be included in the result. In the previous example, we used the following query:

SELECT ship_name, season, risk_classification, risk_score
FROM cap02.TB_SHIPS
WHERE risk_classification = 'D' AND risk_score > 3000
ORDER BY ship_name;        

I appreciate your patience. Here’s the explanation in English:

In this query, we are selecting the four specified columns in the SELECT statement and applying the following filtering criteria:

  • The ‘risk_classification’ column must be equal to ‘D’.
  • The ‘risk_score’ column must be greater than 3000.

If either of these criteria is not met for a particular row, that row will not be included in the result. In other words, both conditions need to be true for the row to be returned.

In the case of our example, if a ship has a risk classification other than ‘D’, even if its risk score is greater than 3000, it will not be included in the result. Similarly, if a ship has a risk classification of ‘D’ but its risk score is less than or equal to 3000, it will also be excluded.

If you have any more questions or need further assistance, feel free to ask.

Therefore, the logical operator ‘AND’ requires that all criteria be true for a row to be included in the query result.

Understanding the Behavior of the ‘OR’ Logical Operator

Now, let’s explore the behavior of the ‘OR’ logical operator in SQL queries. This logical operator allows a row to be included in the result if at least one of the specified conditions is true. Let’s analyze the following query as an example:

SELECT ship_name, season, risk_classification, risk_score
FROM cap02.TB_SHIPS
WHERE risk_classification = 'D' OR risk_score > 3000
ORDER BY ship_name;        

In this query, we are selecting the four specified columns in the SELECT statement and applying the following filter criteria:

  • The ‘risk_classification’ column must be equal to ‘D’.
  • The ‘risk_score’ column must be greater than 3000.

The fundamental difference between the ‘AND’ operator and the ‘OR’ operator is that with ‘OR’, only one of the conditions needs to be true for the row to be included in the result. This means that if a row has a risk classification equal to ‘D’ or a risk score greater than 3000 (or both), it will be returned.

For example, if a ship has a risk classification of ‘D’, even if its risk score is less than or equal to 3000, it will still be included in the result. Similarly, if a ship has a risk score greater than 3000, even if its risk classification is different from ‘D’, it will also be includ

d.

The ‘OR’ operator provides flexibility by allowing one or more conditions to be met for a row to be returned in the query.

And If Both Conditions Are False?

If both conditions specified in the WHERE clause are false, no records will be returned.

SELECT ship_name, season, risk_classification, risk_score
FROM cap02.TB_SHIPS
WHERE risk_classification = 'E' OR risk_score > 3000
ORDER BY ship_name;        

For example, if we search for ships with a ‘E’ classification (an invented pattern) or with a risk score higher than 3000, and no row meets these criteria, the result will be empty with no records returned. This happens because no row satisfied at least one of the conditions specified in the WHERE clause.

Logical Operators

Logical operators are those that return true or false values. They combine one or more true or false values and return a true or false value as a result.

  • AND**: The “AND” logical operator compares two boolean expressions and returns true only when both expressions are true.
  • OR**: The “OR” logical operator compares two boolean expressions and returns true when at least one of the expressions is true.
  • NOT**: The “NOT” logical operator takes a single boolean argument and inverts its value, meaning it turns true into false and false into true.

Special Operators

  • IN: The “IN” operator checks if a value is within a set of values separated by commas and retrieves rows from the table that match.
  • BETWEEN: The “BETWEEN” operator tests an expression against a range, where the range consists of a start, followed by the “AND” keyword, and an end expression.
  • ANY: The “ANY” operator compares a value with every value in a list or the results of a query and evaluates as true if the query returns at least one matching row.
  • ALL: The “ALL” operator is used to select all records from a SELECT statement. It compares a value with every value in a list or the results of a query and evaluates as TRUE if the query doesn’t return any rows.
  • SOME: The “SOME” operator compares a value with every value in a list or the results of a query and evaluates as true if the query returns at least one matching row.
  • EXISTS: The “EXISTS” operator checks for the existence of a result from a subquery. It tests whether the subquery retrieves at least one row and returns “FALSE” when no data is returned.

Understanding the Behavior of the IN Logical Operator

Let’s explore the IN logical operator. Now, as we modify our query, we will include the compliance index, which represents the percentage of items in the inspection checklist that each vessel complies with.

The higher the index, the higher the compliance of the vessel. Therefore, we will now include compliance_index in our query.

SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.TB_SHIPS
WHERE risk_classification = 'D'
ORDER BY ship_name;        

It’s important to note that the compliance index is a percentage value, and ships classified as “D” tend to have a lower compliance index.

SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.TB_SHIPS
WHERE risk_classification = 'A'
ORDER BY ship_name;        

When we change the risk classification to “A,” we can expect a significant increase in the compliance index of the ships. This is because the “A” classification indicates that the ships have excellent sanitary conditions, which means that most items on the inspection checklist have been met.

Applying the IN Logical Operator

Now, let’s create a query to identify the ships that fall into risk classifications “A” or “B” and additionally have a compliance index greater than 98%. Let’s construct this query.

SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.TB_SHIPS
WHERE risk_classification IN ('A', 'B') AND compliance_index > 98
ORDER BY ship_name;        

In this query, we are selecting the four specified columns from the cap02.TB_SHIPS table. We are filtering rows where the risk_classification is either 'A' or 'B' and the compliance_index is greater than 98%. The results will be ordered alphabetically by the ship_name

We can observe that all the returned vessels have an ‘A’ classification since our compliance index criterion is above 98%. Now, let’s adjust this criterion to 90% and check the result:

SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.TB_SHIPS
WHERE risk_classification IN ('A', 'B') AND compliance_index > 90
ORDER BY ship_name;        

ChatGPTNotice that now we have vessels classified as “A” and “B,” all of them with a compliance index higher than 90%.

Limited Query to the Top 10 Records

In our previous query, we retrieved several records that met the specified criteria. However, in some situations, we may prefer to limit the number of results returned, especially for an initial data visualization or to avoid overloading the database. Let’s take this opportunity to order our query by compliance index and limit the result to the top 10 records.

SELECT ship_name, risk_classification, compliance_index, season
FROM cap02.TB_SHIPS
WHERE risk_classification IN ('A', 'B') AND compliance_index > 90
ORDER BY compliance_index
LIMIT 10;        

In this way, we will be obtaining a view of the first 10 ships that meet the risk classification criteria of “A” or “B” and have a compliance index greater than 90%. This approach allows us to explore a sample of the data in a more controlled manner.

Exploring More Questions and Filters

In the previous topics, we learned how to apply conditional filters and multiple conditions to obtain the desired results in our SQL queries. As we progress, we will continue to explore different types of filters and query scenarios. Now, let’s address a specific question:

In April 2018, did any vessel have a compliance index of 100% and a risk score equal to 0?

SELECT ship_name,
       risk_classification,
       compliance_index,
       risk_score,
       season
FROM cap02.TB_SHIPS
WHERE month_year = '04/2018'
AND compliance_index > 90
AND risk_score = 0;        

We obtained only a single row as a result, highlighting how we increased the filter levels throughout the query.

Now, we managed to find a single record that corresponds to a vessel that, in the month of April 2018, achieved a compliance index of 100% and a risk score of 0.

Observing the process we followed to find a single record in our previous query, it becomes evident that we can have multiple conditions depending on what we want to retrieve.

However, as we added specific conditions, we realized the need to balance the number of applied filters. While we successfully solved our specific problem, it’s important to remember that adding too many conditions can negatively impact the database performance.

Therefore, using filters sparingly is essential to solve our problems without compromising the system’s performance, which becomes an important challenge.

Subquery Approach in SQL for Better Performance

SQL has a challenging feature: the ability to achieve the same result in various different ways. However, among these approaches, one may stand out in terms of performance, as discussed earlier.

Therefore, it’s not just about solving a specific problem but also about not compromising the overall database performance.

Now, we will explore the concept of a subquery to achieve the same result as the previous query but in a revamped way. We will use the following query:

SELECT ship_name, 
       risk_classification, 
       compliance_index, 
       risk_score, 
       season
FROM cap02.TB_SHIPS
WHERE compliance_index IN (SELECT compliance_index
                          FROM cap02.TB_SHIPS
                          WHERE compliance_index > 90)
     AND risk_score = 0 
     AND month_year = '04/2018'
ORDER BY compliance_index;        

In this query, we select the compliance index and keep it in the WHERE clause. However, now we are using the IN operator with a subquery enclosed in parentheses, which means a query within another query.

This technique is useful in various situations, solving a variety of problems and scenarios. Note that we have kept the other conditions after the parentheses.

Two Approaches, Different Performances

While both approaches have produced the same result, the way they impact the database performance differs significantly.

In the first approach, we used multiple conditions in the WHERE clause to filter the data. This can be slow on large datasets because each row is evaluated separately.

In the second approach, we used a subquery with the IN operator. First, we obtained the desired compliance indices, and then we used these indices in the main WHERE clause. This approach is more efficient as it reduces the number of condition evaluations.

Therefore, although both approaches provide the same result, efficiency in query execution is crucial, especially in large databases. Choosing the right approach can significantly improve query performance.

Conclusion: Efficiency of the SQL Language

In this exploration of the SQL language, we have learned how to use filters, logical operators, and subqueries to access and analyze data in databases. We have seen how the careful use of these tools can lead to efficient queries and accurate results.

It is crucial to remember that SQL offers various ways to achieve the same result. However, choosing the right approach can make a significant difference in query performance, especially in extensive databases. Therefore, when creating SQL queries, it is essential to consider not only what you want to achieve but also how to do it efficiently.

Mastering the SQL language is a valuable skill for anyone working with data, as it allows for accessing specific information and effectively answering complex questions. So, while exploring SQL, remember to balance simplicity with efficiency to achieve the best possible results in your queries.

Thank you! ????

Thank you! ????

要查看或添加评论,请登录

Leonardo A.的更多文章

社区洞察

其他会员也浏览了