GeekOut time: Exploring Complex SQL Queries with Natural Language

GeekOut time: Exploring Complex SQL Queries with Natural Language

(Also on Constellar tech blog: https://medium.com/the-constellar-digital-technology-blog/geekout-time-exploring-complex-sql-queries-with-natural-language-50bfc37aba23)

The world of databases is becoming more accessible thanks to tools that convert natural language into SQL. One such tool is Text2SQL.ai, which simplifies querying databases by allowing users to input queries in plain English and automatically generating the corresponding SQL commands. This can be particularly helpful for users who may not be familiar with SQL syntax but need to interact with databases.

In this blog, we will explore how Text2SQL.ai performs when querying the Chinook Database, a sample dataset that models a digital music store. We’ll start with a simple query and then move on to more complex queries, demonstrating how well Text2SQL.ai handles these different levels of complexity.

The Chinook Database Overview

The Chinook Database has a typical structure you’d expect from a digital music store. It includes the following tables:

  • Artist: Stores information about music artists.
  • Album: Contains albums by artists.
  • Track: Lists individual tracks with details like duration and price.
  • Genre: Describes the genre of each track.
  • MediaType: Represents the format of each track.

With this schema, we’ll test how well Text2SQL.ai can generate SQL for different types of queries. We load the schema and data first.

-- Create the Artist table
CREATE TABLE Artist (
ArtistId INTEGER PRIMARY KEY AUTOINCREMENT,
Name NVARCHAR(120)
);
-- Insert sample data into Artist table
INSERT INTO Artist (Name) VALUES ('AC/DC');
INSERT INTO Artist (Name) VALUES ('Aerosmith');
INSERT INTO Artist (Name) VALUES ('Queen');
-- Create the Album table
CREATE TABLE Album (
AlbumId INTEGER PRIMARY KEY AUTOINCREMENT,
Title NVARCHAR(160),
ArtistId INTEGER,
FOREIGN KEY (ArtistId) REFERENCES Artist(ArtistId)
);
-- Insert sample data into Album table
INSERT INTO Album (Title, ArtistId) VALUES ('High Voltage', 1);
INSERT INTO Album (Title, ArtistId) VALUES ('Rocks', 2);
INSERT INTO Album (Title, ArtistId) VALUES ('A Night at the Opera', 3);
-- Create the Genre table
CREATE TABLE Genre (
GenreId INTEGER PRIMARY KEY AUTOINCREMENT,
Name NVARCHAR(120)
);
-- Insert sample data into Genre table
INSERT INTO Genre (Name) VALUES ('Rock');
INSERT INTO Genre (Name) VALUES ('Jazz');
INSERT INTO Genre (Name) VALUES ('Metal');
-- Create the MediaType table
CREATE TABLE MediaType (
MediaTypeId INTEGER PRIMARY KEY AUTOINCREMENT,
Name NVARCHAR(120)
);
-- Insert sample data into MediaType table
INSERT INTO MediaType (Name) VALUES ('MPEG audio file');
INSERT INTO MediaType (Name) VALUES ('Protected AAC audio file');
INSERT INTO MediaType (Name) VALUES ('Protected MPEG-4 video file');
-- Create the Track table
CREATE TABLE Track (
TrackId INTEGER PRIMARY KEY AUTOINCREMENT,
Name NVARCHAR(200),
AlbumId INTEGER,
MediaTypeId INTEGER,
GenreId INTEGER,
Composer NVARCHAR(220),
Milliseconds INTEGER,
Bytes INTEGER,
UnitPrice DECIMAL(10, 2),
FOREIGN KEY (AlbumId) REFERENCES Album(AlbumId),
FOREIGN KEY (MediaTypeId) REFERENCES MediaType(MediaTypeId),
FOREIGN KEY (GenreId) REFERENCES Genre(GenreId)
);
-- Insert sample data into Track table
INSERT INTO Track (Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, UnitPrice)
VALUES ('It\'s A Long Way To The Top', 1, 1, 1, 'Angus Young, Malcolm Young, Bon Scott', 320000, 10485760, 0.99);
INSERT INTO Track (Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, UnitPrice)
VALUES ('Back In Black', 1, 1, 1, 'Angus Young, Malcolm Young, Brian Johnson', 300000, 10485760, 0.99);
INSERT INTO Track (Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, UnitPrice)
VALUES ('Sweet Emotion', 2, 1, 1, 'Steven Tyler, Tom Hamilton', 250000, 8192000, 0.99);
INSERT INTO Track (Name, AlbumId, MediaTypeId, GenreId, Composer, Milliseconds, Bytes, UnitPrice)
VALUES ('Bohemian Rhapsody', 3, 2, 1, 'Freddie Mercury', 354000, 12800000, 1.29);        

Scenario 1: Simple Query

Query: List all artists.

We’ll start with a very basic query that retrieves the names of all artists in the database. This is a straightforward query where we expect Text2SQL.ai to generate a simple SELECT statement.

Natural Language: List all artists.

SQL Expected:

SELECT *
FROM Artist         

SQL Generated:

Not bad, I thought it should be “Select * From Artist” though. It selects all columns from the Artist table and will return all the artist names. Text2SQL.ai performs on this simple task.

Scenario 2: Mid-Complexity Query

Query: Show the albums by AC/DC.

For the next query, we want to find all albums by the artist AC/DC. This requires joining the Album and Artist tables to match the artist’s name with their albums.

Natural Language: Show the albums by AC/DC.

SQL Expected:

SELECT Album.Title 
FROM Album 
JOIN Artist ON Album.ArtistId = Artist.ArtistId 
WHERE Artist.Name = 'AC/DC';        

SQL Generated:

Text2SQL.ai correctly generates a SQL query that joins the Album and Artist tables and filters for albums by AC/DC. It demonstrates its ability to handle more complex queries involving JOIN operations.

Scenario 3: Complex Query

Query: Find the total duration of all tracks by Queen.

In this query, we’re asking to calculate the total duration of all tracks by the artist Queen. This requires joining the Track, Album, and Artist tables and summing up the total duration of the tracks.

Natural Language: Find the total duration of all tracks by Queen.

SQL Expected:

SELECT SUM(Track.Milliseconds) 
FROM Track 
JOIN Album ON Track.AlbumId = Album.AlbumId 
JOIN Artist ON Album.ArtistId = Artist.ArtistId 
WHERE Artist.Name = 'Queen';        

SQL Generated:

The generated SQL is spot on. Text2SQL.ai correctly uses the SUM function to calculate the total duration of all tracks by Queen by joining the necessary tables. This shows that Text2SQL.ai is capable of handling aggregation functions and more complex joins.

Scenario 4: More Complex Query

Query: Find the top 3 most expensive albums and their total track count, then return the artist with the most albums from that list.

This query is quite complex. It involves multiple steps:

  1. Find the top 3 most expensive albums by summing the prices of their tracks.
  2. Count the number of tracks for each of those albums.
  3. Identify the artist with the most albums among these top 3 albums.

This requires multiple joins, aggregation functions, and the use of Common Table Expressions (CTEs) to break down the query into manageable parts.

Natural Language: Find the top 3 most expensive albums and the total number of tracks on each. Then find the artist with the most albums among these.

SQL Expected:

WITH
  AlbumPrices AS (
    SELECT
      Album.AlbumId,
      Album.Title,
      Album.ArtistId,
      SUM(Track.UnitPrice) AS TotalPrice,
      COUNT(Track.TrackId) AS TrackCount
    FROM
      Album
      JOIN Track ON Album.AlbumId = Track.AlbumId
    GROUP BY
      Album.AlbumId
  ),
  TopAlbums AS (
    SELECT
      AlbumId,
      Title,
      ArtistId,
      TotalPrice,
      TrackCount
    FROM
      AlbumPrices
    ORDER BY
      TotalPrice DESC
    LIMIT
      3
  ),
  ArtistAlbumCount AS (
    SELECT
      ArtistId,
      COUNT(AlbumId) AS AlbumCount
    FROM
      TopAlbums
    GROUP BY
      ArtistId
  )
SELECT
  a.Name AS ArtistName,
  aac.AlbumCount
FROM
  ArtistAlbumCount aac
  JOIN Artist a ON aac.ArtistId = a.ArtistId
ORDER BY
  aac.AlbumCount DESC
LIMIT
  1;        

SQL Generated:

The query is correct and efficiently broken down into several steps:

  • AlbumPrices: This CTE calculates the total price of each album and counts the number of tracks.
  • TopAlbums: This CTE selects the top 3 albums based on their total price.
  • ArtistAlbumCount: This CTE counts how many albums each artist has among the top 3 albums.
  • Finally, the query returns the artist with the most albums from the top 3 most expensive ones.

This shows that Text2SQL.ai is capable of handling complex SQL scenarios involving multiple CTEs, joins, and aggregation functions. It can handle advanced use cases with ease.

Conclusion

Text2SQL.ai performed exceptionally well in our tests using the Chinook Database, from simple queries to highly complex ones. The tool is capable of generating accurate SQL for a range of use cases, whether you’re asking for a simple list of artists or performing multi-step aggregations and joins. There might be other tools outside as well. After testing the Chinook Database, a great next step would be to explore whether Text2SQL.ai can handle more complicated scenarios. The ability to convert natural language into SQL opens up exciting possibilities, especially when dealing with complicated financial schemas like ERP systems. If large language models (LLMs) can effectively transform complex business logic into accurate SQL queries, they could greatly streamline processes for non-technical users who need insights from their databases. For example, financial analysts or procurement managers could simply ask questions like, “What are the top overdue invoices over 60 days?” or “What vendors have the highest outstanding balances?” without needing any SQL knowledge. This could significantly reduce the reliance on technical teams for database interactions and accelerate decision-making in organizations.

Enjoy and have fun!

要查看或添加评论,请登录

Nedved Yang的更多文章

社区洞察

其他会员也浏览了