登录查看更多内容

Vector Embeddings and Fuzzy Matching with SQL

Richard Conway

CEO @ Elastacloud | Group CTO @ Acora | Transforming businesses one byte at a time

发布日期: 2025年1月12日

In the previous post we saw some interesting things relating to the use of embeddings with some stats and showing how we would build out a simple workflow to get embeddings followed by similarities between two lists. In this post we’ll look at how we can do the same the thing with new features in SQL 2025 (and Azure SQL).

We’ll start with SQL first and talk through how we can build out the same Vector embeddings solution and what both timings and performance look like.

Let's remind ourselves of the previous post. In that post we ran a python script which got all of our embeddings and then put them into a csv file. We built a series of folds and used Fabric Spark Pools to write them into 10 buckets of cosine similarity so that we can work through the closest matches.

As such this is a two phase workflow. Getting the embeddings for two lists of names and calculating a cosine similarity using the dot product notation we discussed derived from multiplying every vectors with every other vector in the two lists (cross join).

Azure SQL has a new Vector type and a new feature which allows us to do this quite simply so we'll pick up and determine how best to commoditise this in SQL.

It's worth saying that I'm running this on a serverless SQL instance in Azure with a min of 0.5 vCores and a max of 2 along with 32GB of memory. At the time of writing there's an offer that give you the first 100K vCore seconds free within a month period!

Firstly we're going to create some scoped credentials in an Azure SQL instance that you create. At this time the Vector type and functions are in preview so everything you do using this article should be available to you. For the purposes of the test I have two tables in SQL that I've declared like this:

CREATE TABLE List1(
   [Name] nvarchar(1056),
   VectorValue Vector(1536)
);
CREATE TABLE List2(
   [Name] nvarchar(1056),
   VectorValue Vector(1536)
);

In my first table I have between 5 and 6000 records and my second table just under 12,000 records.

Let's create the scoped credentials now:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '@cora123#!';

SELECT * 
FROM sys.symmetric_keys
WHERE name = '##MS_DatabaseMasterKey##';

if exists(select * from sys.[database_scoped_credentials] where name = 'https://[uri].cognitiveservices.azure.com')
begin
	drop database scoped credential [https://[uri].cognitiveservices.azure.com];
end
create database scoped credential [https://[uri].cognitiveservices.azure.com]
with identity = 'HTTPEndpointHeaders', secret = '{"api-key": "[key]"}';
go

You should have already setup your Azure OpenAI instance already setup. Mine is registered as a cognitiveservices endpoint for the text-small-003 model that we discussed in the last post. Wherever the service is get your API key and copy it along with the endpoint address into the SQL above. Every time this is used the key will be used. The second line of SQL should return a row for the master key which you'll need before you add your scoped credentials.

We're going to create a first stored procedure. I took this from the Microsoft docs and tweaked it for my use case:

CREATE OR ALTER PROCEDURE dbo.get_embedding
    @inputText NVARCHAR(MAX),
    @embedding VECTOR(1536) OUTPUT
AS
BEGIN
    -- Declare variables
    DECLARE @retval INT, @response NVARCHAR(MAX);
    DECLARE @payload NVARCHAR(MAX) = JSON_OBJECT('input': @inputText);
    DECLARE @url NVARCHAR(1000) = 'https://[uri].cognitiveservices.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15';

    -- Call the external REST endpoint
    EXEC @retval = sp_invoke_external_rest_endpoint
        @url = @url,
        @method = 'POST',
        @credential = [https://[uri].cognitiveservices.azure.com],
        @payload = @payload,
        @response = @response OUTPUT;

    -- Declare a variable to hold the extracted embedding
    DECLARE @re NVARCHAR(MAX) = NULL;

    IF (@retval = 0)
    BEGIN
        -- Successful call: extract embedding and increment success count
        SET @re = JSON_QUERY(@response, '$.result.data[0].embedding');
    END
    ELSE
    BEGIN
        -- Log the error message
        PRINT 'Error processing Name: ' + @inputText + ' | API Response: ' + @response;
    END

    -- Set the output embedding if successful
    SET @embedding = CAST(@re AS VECTOR(1536));

    RETURN @retval;
END;
GO

Again, replace the [uri] with your actual Uri endpoint for Azure OpenAI and also ensure that your model name for the embeddings model is listed correctly after the deployments in the path. I've called mine text-embeddings-small-3 which should be apparent.

Now we're going to consume the above in another stored proc:

领英推荐

SQL Insights: In Conversation With Ignacio Spreafico

LearnSQL.com 9 个月前

SQL Challenge: Mastering Advanced Joins and…

StrataScratch 8 个月前

Create an AI Agent with SQL

AINIRO.IO 1 个月前

CREATE OR ALTER PROCEDURE dbo.UpdateEmbeddings
    --@TableName NVARCHAR(128) -- Name of the table to update
AS
BEGIN
    SET NOCOUNT ON;

    -- Declare variables
    DECLARE @sql NVARCHAR(MAX);
    DECLARE @name NVARCHAR(MAX);
    DECLARE @retval INT;
    DECLARE @embedding VECTOR(1536);
    DECLARE @successCount INT = 0; -- Initialize success count for the session

    -- Dynamic SQL to declare a cursor
    DECLARE dynamic_cursor CURSOR FOR
    SELECT [Name]
    FROM List1
    WHERE [VectorValue] IS NULL

    -- Open the cursor
    OPEN dynamic_cursor;

    -- Fetch the first record
    FETCH NEXT FROM dynamic_cursor INTO @name;

    WHILE @@FETCH_STATUS = 0
    BEGIN
        -- Call the stored procedure to get the embedding
        EXEC @retval = dbo.get_embedding
            @inputText = @name,
            @embedding = @embedding OUTPUT;

        -- If successful, update the table
        IF @retval = 0
        BEGIN
            UPDATE List1
            SET [VectorValue] = @embedding
            WHERE [Name] = @name;
            SET @successCount = @successCount + 1;

            -- Print success message
            PRINT 'Successfully processed Name: ' + @name  + ' | Success Count: ' + CAST(@successCount AS NVARCHAR(10));
        END
        ELSE
        BEGIN
            -- Print error message
            PRINT 'Error retrieving embedding for: ' + @name;
        END

        -- Fetch the next record
        FETCH NEXT FROM dynamic_cursor INTO @name;
    END

    -- Close and deallocate the cursor
    CLOSE dynamic_cursor;
    DEALLOCATE dynamic_cursor;
END
GO

So this stored procedure uses a cursor to get each [Name] value and then call the prior get_embeddings stored proc to get the Vector embedding and then update the table. I'd look to use dynamic SQL by passing in the table name but it went wonky somewhere along the like with a cast exception so for now I've hardcoded the table name which means you'll need to do it for each table name. If you figure out the dynamic SQL please add something in the comments and I'll update the article.

After a while you should have all of the embeddings in the VectorValue field of each of the tables.

We had a lot of stats in the last post on this which was application-centric not data-centric. We were able to batch about 3K embeddings per minute. Even with the relatively beefy SQL I have it's far less than that. It takes on average 15 minutes to generate 10K embeddings and it looks like SQL will update the screen based on 34 iterations, we can probably tweak this to make it much faster but for now it will do. Still that's 667 per minute which ain't the great compared to our 3K per minute supercharged version from the last post.

Anyway, Phase 1 over, hopefully SQL will come into it's own now as we compare Vector types across the tables List1 and List2.

SELECT 
    a.Name AS List1Name,
    b.Name AS List2Name,
    vector_distance('cosine', a.VectorValue, b.VectorValue) cosine_distance
FROM 
    List1 a
INNER JOIN 
    List1 b
ON 
    vector_distance('cosine', a.VectorValue, b.VectorValue) < 0.2;

As a matter of course it takes a whole heap of compute power and days the way I did it before and you have to precompute not compute on the fly. With this SQL version the calculation is so incredibly quick due to optimisations in the vector_distance function as well as optimisations in the Vector type. So it needs to cross-join match 62m Vectors but on my SQL takes just under 3 minutes! This is orders of magnitude better than the prior version.

In the prior article we looked at 600K rows by 350K rows leading to 200 billion comparisons. For now I'm not testing all of this but will look to do this shortly to see whether it scales linearly. If it does most likely the best permutation is a combination of the cosine similarity using SQL and the application process which gets the embeddings which is 5x as quick as SQL.

An example output from my lists which are actor catalogues are:

Billy Bob Thornton | Billy Bob Thorn | 0.0914451116996523

In the prior post we looked at it was between 0 to 1 where 1 was the higher Cosine Similarity. In this one according to the docs below 2 represents opposite vectors (0) and 0 represents identical vectors (1) hence why the above query use the filter < 0.2, to all intents and purposes this should be the same as > 0.9 in the top bucket if the scale is the same linear.

Microsoft docs usage of vector_distance function

Was actually quite impressed. Thanks Microsoft. Not perfect but not bad at all.

In the next post I'll review the same using EventHouse, not as optimal but EventHouse is lovely and better for a lot of other things!

Happy trails.

要查看或添加评论，请登录

Richard Conway的更多文章

Using EventHouse to drive Vector Embeddings matches

2025年1月26日

Using EventHouse to drive Vector Embeddings matches

Finally on part 3 of this crazy series of similarity matching for two lists. I'm on a roll with this.
Fuzzy Matching comes of age with vector embeddings

2024年12月23日

Fuzzy Matching comes of age with vector embeddings

I've recently had the opportunity to work on a fuzzy matching project for a customer. "Fuzzer" as we've so fondly…
Watching the AI dust settle from my armchair

2023年12月10日

Watching the AI dust settle from my armchair

It's been a couple of weeks now since the craziness of Sam Altman and Greg Brockman of last weekend. I'm an AI…
Financing your Data and Sustainability Strategy

2023年8月15日

Financing your Data and Sustainability Strategy

Elastacloud has had countless conversations over the last few years on Sustainability reporting. We are experts in…

3 条评论
Fear and loathing with Large Language Models

2023年7月9日

Fear and loathing with Large Language Models

This is my first time outside of blog directly on linkedin. Thought I'd use a great photo that Charlie Curtis, one of…

1 条评论
Why we’re launching Cloudgrads

2021年5月18日

Why we’re launching Cloudgrads

A graduate programme and technology talent solution Today, digital strategy and corporate strategy are intertwined…

7 条评论

See all articles

Vector Embeddings and Fuzzy Matching with SQL

Richard Conway

CEO @ Elastacloud | Group CTO @ Acora | Transforming businesses one byte at a time

领英推荐

Richard Conway的更多文章

社区洞察

其他会员也浏览了

How to process a 100 GB file or "it is possible to embrace a huge, but in parts".

Using Tarql to Convert Excel Spreadsheets to RDF

Efficient & Robust Analytic Queries

Five VScode Extensions for Working with Data

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

AI2SQL: Bridging the Gap Between Non-Engineers and SQL Query Generation

Text-to-SQL with Dataherald and Yellowbrick

SQL Queries Quickstart: Part III

Mastering Joins in T-SQL: Types, Techniques, and Troubleshooting

Creating and Deploying PySpark Temporary Views for SQL Querying... by Fidel Vetino

领英推荐

Richard Conway的更多文章

Using EventHouse to drive Vector Embeddings matches

Fuzzy Matching comes of age with vector embeddings

Watching the AI dust settle from my armchair

Financing your Data and Sustainability Strategy

Fear and loathing with Large Language Models

Why we’re launching Cloudgrads

社区洞察

其他会员也浏览了

How to process a 100 GB file or "it is possible to embrace a huge, but in parts".

Using Tarql to Convert Excel Spreadsheets to RDF

Efficient & Robust Analytic Queries

Five VScode Extensions for Working with Data

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

AI2SQL: Bridging the Gap Between Non-Engineers and SQL Query Generation

Text-to-SQL with Dataherald and Yellowbrick

SQL Queries Quickstart: Part III

Mastering Joins in T-SQL: Types, Techniques, and Troubleshooting

Creating and Deploying PySpark Temporary Views for SQL Querying... by Fidel Vetino