GeekOut time: Exploring Complex SQL Queries with Natural Language
(Also on Constellar tech blog: https://medium.com/the-constellar-digital-technology-blog/geekout-time-exploring-complex-sql-queries-with-natural-language-50bfc37aba23)
The world of databases is becoming more accessible thanks to tools that convert natural language into SQL. One such tool is Text2SQL.ai, which simplifies querying databases by allowing users to input queries in plain English and automatically generating the corresponding SQL commands. This can be particularly helpful for users who may not be familiar with SQL syntax but need to interact with databases.
In this blog, we will explore how Text2SQL.ai performs when querying the Chinook Database, a sample dataset that models a digital music store. We’ll start with a simple query and then move on to more complex queries, demonstrating how well Text2SQL.ai handles these different levels of complexity.
The Chinook Database Overview
The Chinook Database has a typical structure you’d expect from a digital music store. It includes the following tables:
With this schema, we’ll test how well Text2SQL.ai can generate SQL for different types of queries. We load the schema and data first.
-- Create the Artist table
CREATE TABLE Artist (
ArtistId INTEGER PRIMARY KEY AUTOINCREMENT,
Name NVARCHAR(120)
);
-- Insert sample data into Artist table
INSERT INTO Artist (Name) VALUES ('AC/DC');
INSERT INTO Artist (Name) VALUES ('Aerosmith');
INSERT INTO Artist (Name) VALUES ('Queen');
-- Create the Album table
CREATE TABLE Album (
AlbumId INTEGER PRIMARY KEY AUTOINCREMENT,
Title NVARCHAR(160),
ArtistId INTEGER,
FOREIGN KEY (ArtistId) REFERENCES Artist(ArtistId)
);
-- Insert sample data into Album table
INSERT INTO Album (Title, ArtistId) VALUES ('High Voltage', 1);
INSERT INTO Album (Title, ArtistId) VALUES ('Rocks', 2);
INSERT INTO Album (Title, ArtistId) VALUES ('A Night at the Opera', 3);
-- Create the Genre table
CREATE TABLE Genre (
GenreId INTEGER PRIMARY KEY AUTOINCREMENT,
Name NVARCHAR(120)
);
-- Insert sample data into Genre table
INSERT INTO Genre (Name) VALUES ('Rock');
INSERT INTO Genre (Name) VALUES ('Jazz');
INSERT INTO Genre (Name) VALUES ('Metal');
-- Create the MediaType table
CREATE TABLE MediaType (
MediaTypeId INTEGER PRIMARY KEY AUTOINCREMENT,
Name NVARCHAR(120)
);
-- Insert sample data into MediaType table
INSERT INTO MediaType (Name) VALUES ('MPEG audio file');
INSERT INTO MediaType (Name) VALUES ('Protected AAC audio file');
INSERT INTO MediaType (Name) VALUES ('Protected MPEG-4 video file');
-- Create the Track table
CREATE TABLE Track (
TrackId INTEGER PRIMARY KEY AUTOINCREMENT,
Name NVARCHAR(200),
AlbumId INTEGER,
MediaTypeId INTEGER,
GenreId INTEGER,
Composer NVARCHAR(220),
Milliseconds INTEGER,
Bytes INTEGER,
UnitPrice DECIMAL(10, 2),
FOREIGN KEY (AlbumId) REFERENCES Album(AlbumId),
FOREIGN KEY (MediaTypeId) REFERENCES MediaType(MediaTypeId),
FOREIGN KEY (GenreId) REFERENCES Genre(GenreId)
);
-- Insert sample data into Track table
INSERT INTO Track (Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, UnitPrice)
VALUES ('It\'s A Long Way To The Top', 1, 1, 1, 'Angus Young, Malcolm Young, Bon Scott', 320000, 10485760, 0.99);
INSERT INTO Track (Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, UnitPrice)
VALUES ('Back In Black', 1, 1, 1, 'Angus Young, Malcolm Young, Brian Johnson', 300000, 10485760, 0.99);
INSERT INTO Track (Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, UnitPrice)
VALUES ('Sweet Emotion', 2, 1, 1, 'Steven Tyler, Tom Hamilton', 250000, 8192000, 0.99);
INSERT INTO Track (Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, UnitPrice)
VALUES ('Bohemian Rhapsody', 3, 2, 1, 'Freddie Mercury', 354000, 12800000, 1.29);
Scenario 1: Simple Query
Query: List all artists.
We’ll start with a very basic query that retrieves the names of all artists in the database. This is a straightforward query where we expect Text2SQL.ai to generate a simple SELECT statement.
Natural Language: List all artists.
SQL Expected:
SELECT *
FROM Artist
SQL Generated:
Not bad, I thought it should be “Select * From Artist” though. It selects all columns from the Artist table and will return all the artist names. Text2SQL.ai performs on this simple task.
Scenario 2: Mid-Complexity Query
Query: Show the albums by AC/DC.
For the next query, we want to find all albums by the artist AC/DC. This requires joining the Album and Artist tables to match the artist’s name with their albums.
Natural Language: Show the albums by AC/DC.
SQL Expected:
SELECT Album.Title
FROM Album
JOIN Artist ON Album.ArtistId = Artist.ArtistId
WHERE Artist.Name = 'AC/DC';
SQL Generated:
Text2SQL.ai correctly generates a SQL query that joins the Album and Artist tables and filters for albums by AC/DC. It demonstrates its ability to handle more complex queries involving JOIN operations.
领英推荐
Scenario 3: Complex Query
Query: Find the total duration of all tracks by Queen.
In this query, we’re asking to calculate the total duration of all tracks by the artist Queen. This requires joining the Track, Album, and Artist tables and summing up the total duration of the tracks.
Natural Language: Find the total duration of all tracks by Queen.
SQL Expected:
SELECT SUM(Track.Milliseconds)
FROM Track
JOIN Album ON Track.AlbumId = Album.AlbumId
JOIN Artist ON Album.ArtistId = Artist.ArtistId
WHERE Artist.Name = 'Queen';
SQL Generated:
The generated SQL is spot on. Text2SQL.ai correctly uses the SUM function to calculate the total duration of all tracks by Queen by joining the necessary tables. This shows that Text2SQL.ai is capable of handling aggregation functions and more complex joins.
Scenario 4: More Complex Query
Query: Find the top 3 most expensive albums and their total track count, then return the artist with the most albums from that list.
This query is quite complex. It involves multiple steps:
This requires multiple joins, aggregation functions, and the use of Common Table Expressions (CTEs) to break down the query into manageable parts.
Natural Language: Find the top 3 most expensive albums and the total number of tracks on each. Then find the artist with the most albums among these.
SQL Expected:
WITH
AlbumPrices AS (
SELECT
Album.AlbumId,
Album.Title,
Album.ArtistId,
SUM(Track.UnitPrice) AS TotalPrice,
COUNT(Track.TrackId) AS TrackCount
FROM
Album
JOIN Track ON Album.AlbumId = Track.AlbumId
GROUP BY
Album.AlbumId
),
TopAlbums AS (
SELECT
AlbumId,
Title,
ArtistId,
TotalPrice,
TrackCount
FROM
AlbumPrices
ORDER BY
TotalPrice DESC
LIMIT
3
),
ArtistAlbumCount AS (
SELECT
ArtistId,
COUNT(AlbumId) AS AlbumCount
FROM
TopAlbums
GROUP BY
ArtistId
)
SELECT
a.Name AS ArtistName,
aac.AlbumCount
FROM
ArtistAlbumCount aac
JOIN Artist a ON aac.ArtistId = a.ArtistId
ORDER BY
aac.AlbumCount DESC
LIMIT
1;
SQL Generated:
The query is correct and efficiently broken down into several steps:
This shows that Text2SQL.ai is capable of handling complex SQL scenarios involving multiple CTEs, joins, and aggregation functions. It can handle advanced use cases with ease.
Conclusion
Text2SQL.ai performed exceptionally well in our tests using the Chinook Database, from simple queries to highly complex ones. The tool is capable of generating accurate SQL for a range of use cases, whether you’re asking for a simple list of artists or performing multi-step aggregations and joins. There might be other tools outside as well. After testing the Chinook Database, a great next step would be to explore whether Text2SQL.ai can handle more complicated scenarios. The ability to convert natural language into SQL opens up exciting possibilities, especially when dealing with complicated financial schemas like ERP systems. If large language models (LLMs) can effectively transform complex business logic into accurate SQL queries, they could greatly streamline processes for non-technical users who need insights from their databases. For example, financial analysts or procurement managers could simply ask questions like, “What are the top overdue invoices over 60 days?” or “What vendors have the highest outstanding balances?” without needing any SQL knowledge. This could significantly reduce the reliance on technical teams for database interactions and accelerate decision-making in organizations.
Enjoy and have fun!