Vector Embeddings and Fuzzy Matching with SQL
Richard Conway
CEO @ Elastacloud | Group CTO @ Acora | Transforming businesses one byte at a time
In the previous post we saw some interesting things relating to the use of embeddings with some stats and showing how we would build out a simple workflow to get embeddings followed by similarities between two lists. In this post we’ll look at how we can do the same the thing with new features in SQL 2025 (and Azure SQL).
We’ll start with SQL first and talk through how we can build out the same Vector embeddings solution and what both timings and performance look like.
Let's remind ourselves of the previous post. In that post we ran a python script which got all of our embeddings and then put them into a csv file. We built a series of folds and used Fabric Spark Pools to write them into 10 buckets of cosine similarity so that we can work through the closest matches.
As such this is a two phase workflow. Getting the embeddings for two lists of names and calculating a cosine similarity using the dot product notation we discussed derived from multiplying every vectors with every other vector in the two lists (cross join).
Azure SQL has a new Vector type and a new feature which allows us to do this quite simply so we'll pick up and determine how best to commoditise this in SQL.
It's worth saying that I'm running this on a serverless SQL instance in Azure with a min of 0.5 vCores and a max of 2 along with 32GB of memory. At the time of writing there's an offer that give you the first 100K vCore seconds free within a month period!
Firstly we're going to create some scoped credentials in an Azure SQL instance that you create. At this time the Vector type and functions are in preview so everything you do using this article should be available to you. For the purposes of the test I have two tables in SQL that I've declared like this:
CREATE TABLE List1(
[Name] nvarchar(1056),
VectorValue Vector(1536)
);
CREATE TABLE List2(
[Name] nvarchar(1056),
VectorValue Vector(1536)
);
In my first table I have between 5 and 6000 records and my second table just under 12,000 records.
Let's create the scoped credentials now:
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '@cora123#!';
SELECT *
FROM sys.symmetric_keys
WHERE name = '##MS_DatabaseMasterKey##';
if exists(select * from sys.[database_scoped_credentials] where name = 'https://[uri].cognitiveservices.azure.com')
begin
drop database scoped credential [https://[uri].cognitiveservices.azure.com];
end
create database scoped credential [https://[uri].cognitiveservices.azure.com]
with identity = 'HTTPEndpointHeaders', secret = '{"api-key": "[key]"}';
go
You should have already setup your Azure OpenAI instance already setup. Mine is registered as a cognitiveservices endpoint for the text-small-003 model that we discussed in the last post. Wherever the service is get your API key and copy it along with the endpoint address into the SQL above. Every time this is used the key will be used. The second line of SQL should return a row for the master key which you'll need before you add your scoped credentials.
We're going to create a first stored procedure. I took this from the Microsoft docs and tweaked it for my use case:
CREATE OR ALTER PROCEDURE dbo.get_embedding
@inputText NVARCHAR(MAX),
@embedding VECTOR(1536) OUTPUT
AS
BEGIN
-- Declare variables
DECLARE @retval INT, @response NVARCHAR(MAX);
DECLARE @payload NVARCHAR(MAX) = JSON_OBJECT('input': @inputText);
DECLARE @url NVARCHAR(1000) = 'https://[uri].cognitiveservices.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15';
-- Call the external REST endpoint
EXEC @retval = sp_invoke_external_rest_endpoint
@url = @url,
@method = 'POST',
@credential = [https://[uri].cognitiveservices.azure.com],
@payload = @payload,
@response = @response OUTPUT;
-- Declare a variable to hold the extracted embedding
DECLARE @re NVARCHAR(MAX) = NULL;
IF (@retval = 0)
BEGIN
-- Successful call: extract embedding and increment success count
SET @re = JSON_QUERY(@response, '$.result.data[0].embedding');
END
ELSE
BEGIN
-- Log the error message
PRINT 'Error processing Name: ' + @inputText + ' | API Response: ' + @response;
END
-- Set the output embedding if successful
SET @embedding = CAST(@re AS VECTOR(1536));
RETURN @retval;
END;
GO
Again, replace the [uri] with your actual Uri endpoint for Azure OpenAI and also ensure that your model name for the embeddings model is listed correctly after the deployments in the path. I've called mine text-embeddings-small-3 which should be apparent.
Now we're going to consume the above in another stored proc:
领英推荐
CREATE OR ALTER PROCEDURE dbo.UpdateEmbeddings
--@TableName NVARCHAR(128) -- Name of the table to update
AS
BEGIN
SET NOCOUNT ON;
-- Declare variables
DECLARE @sql NVARCHAR(MAX);
DECLARE @name NVARCHAR(MAX);
DECLARE @retval INT;
DECLARE @embedding VECTOR(1536);
DECLARE @successCount INT = 0; -- Initialize success count for the session
-- Dynamic SQL to declare a cursor
DECLARE dynamic_cursor CURSOR FOR
SELECT [Name]
FROM List1
WHERE [VectorValue] IS NULL
-- Open the cursor
OPEN dynamic_cursor;
-- Fetch the first record
FETCH NEXT FROM dynamic_cursor INTO @name;
WHILE @@FETCH_STATUS = 0
BEGIN
-- Call the stored procedure to get the embedding
EXEC @retval = dbo.get_embedding
@inputText = @name,
@embedding = @embedding OUTPUT;
-- If successful, update the table
IF @retval = 0
BEGIN
UPDATE List1
SET [VectorValue] = @embedding
WHERE [Name] = @name;
SET @successCount = @successCount + 1;
-- Print success message
PRINT 'Successfully processed Name: ' + @name + ' | Success Count: ' + CAST(@successCount AS NVARCHAR(10));
END
ELSE
BEGIN
-- Print error message
PRINT 'Error retrieving embedding for: ' + @name;
END
-- Fetch the next record
FETCH NEXT FROM dynamic_cursor INTO @name;
END
-- Close and deallocate the cursor
CLOSE dynamic_cursor;
DEALLOCATE dynamic_cursor;
END
GO
So this stored procedure uses a cursor to get each [Name] value and then call the prior get_embeddings stored proc to get the Vector embedding and then update the table. I'd look to use dynamic SQL by passing in the table name but it went wonky somewhere along the like with a cast exception so for now I've hardcoded the table name which means you'll need to do it for each table name. If you figure out the dynamic SQL please add something in the comments and I'll update the article.
After a while you should have all of the embeddings in the VectorValue field of each of the tables.
We had a lot of stats in the last post on this which was application-centric not data-centric. We were able to batch about 3K embeddings per minute. Even with the relatively beefy SQL I have it's far less than that. It takes on average 15 minutes to generate 10K embeddings and it looks like SQL will update the screen based on 34 iterations, we can probably tweak this to make it much faster but for now it will do. Still that's 667 per minute which ain't the great compared to our 3K per minute supercharged version from the last post.
Anyway, Phase 1 over, hopefully SQL will come into it's own now as we compare Vector types across the tables List1 and List2.
SELECT
a.Name AS List1Name,
b.Name AS List2Name,
vector_distance('cosine', a.VectorValue, b.VectorValue) cosine_distance
FROM
List1 a
INNER JOIN
List1 b
ON
vector_distance('cosine', a.VectorValue, b.VectorValue) < 0.2;
As a matter of course it takes a whole heap of compute power and days the way I did it before and you have to precompute not compute on the fly. With this SQL version the calculation is so incredibly quick due to optimisations in the vector_distance function as well as optimisations in the Vector type. So it needs to cross-join match 62m Vectors but on my SQL takes just under 3 minutes! This is orders of magnitude better than the prior version.
In the prior article we looked at 600K rows by 350K rows leading to 200 billion comparisons. For now I'm not testing all of this but will look to do this shortly to see whether it scales linearly. If it does most likely the best permutation is a combination of the cosine similarity using SQL and the application process which gets the embeddings which is 5x as quick as SQL.
An example output from my lists which are actor catalogues are:
Billy Bob Thornton | Billy Bob Thorn | 0.0914451116996523
In the prior post we looked at it was between 0 to 1 where 1 was the higher Cosine Similarity. In this one according to the docs below 2 represents opposite vectors (0) and 0 represents identical vectors (1) hence why the above query use the filter < 0.2, to all intents and purposes this should be the same as > 0.9 in the top bucket if the scale is the same linear.
Was actually quite impressed. Thanks Microsoft. Not perfect but not bad at all.
In the next post I'll review the same using EventHouse, not as optimal but EventHouse is lovely and better for a lot of other things!
Happy trails.