postgres.new: a vector search conversation

postgres.new: a vector search conversation

I’ve just had an informative conversation with postgres.new, exploring PostgreSQL’s support for word embeddings and vector search. I’ve been meaning to look at this for a while; we use PostgreSQL extensively at 67 Bricks, and we are increasingly building products that leverage embeddings as a search mechanism.? In some scenarios it would be good to be able to use a tool we already understand well to do this, as an alternative to adopting an entirely new technology.

Postgres.new is PostgreSQL in your browser, running through the magic of PGlite and WASM, with visualisations and an LLM powered chat interface supplied by Supabase. I began the conversation with my tried and tested opening gambit:

I am building a system to ingest and tag documents.? I would like to store the following information: title, id, content, tags, embeddings

It helpfully created a documents table with the following schema:

CREATE TABLE documents (
    id bigint primary key generated always as identity,
    title text not null,
    content text not null,
    tags text[],
    embedding vector(384)
);        

That's great. Can you create some sample data; please make some of the articles about machine learning, and the others about the tradeoffs in using auto incrementing ids vs uuids for database identifiers.

Title: Introduction to Machine Learning
Content: Machine Learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to learn from data, without being explicitly programmed.
Tags: ['machine learning', 'AI']
...
Title: Auto Incrementing IDs in Databases
Content: Auto incrementing IDs are simple to use and understand, but they can lead to contention in distributed systems.
Tags: ['database', 'auto increment']

Title: UUIDs vs Auto Incrementing IDs
Content: UUIDs (Universally Unique Identifiers) provide a unique identifier across systems, but they are larger and can be less performant.
Tags: ['database', 'UUID']
        

Neat! It generated some plausible tags for me as well.

Please can you show me the top 2 items which are similar to "artificial intelligence". Include a similarity score in the output.

Title: Introduction to Machine Learning
Content: Machine Learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to learn from data, without being explicitly programmed.
Similarity Score: 0.1036

Title: Supervised Learning Explained
Content: Supervised learning is a type of machine learning where the model is trained on labeled data.
Similarity Score: 0.1498
        

The generated SQL was as follows:

select id, title, content, tags, embedding <= > (
    select embedding
    from meta.embeddings
    where id = 6
  ) as similarity_score
from
  documents
order by
  similarity_score
limit
  2;        

As an aside, Postgres.new creates embeddings for queries and stores them in the meta.embeddings table, because they can be quite large and expensive to send back and forth to the database (we went off on a side quest to find that out). Back to the main story:

What does the <=> operator do?

The <=> operator is used to calculate the cosine distance between two vectors in the pgvector extension for PostgreSQL…        

It then goes on to define cosine distance, and explains why a lower score means more similar content, which was going to be my next question.

Can you create a chart showing the similarity score vs title for the query “artificial intelligence” ?

So I now have an idea of how to create and query a schema using sentence embeddings in PostgreSQL.? This was all possible without installing any software, reading any documentation or indeed writing any SQL.? Well I didn’t write any SQL.? But before we go:

Please can you change the id column type to uuid.

要查看或添加评论,请登录

Simon Epstein的更多文章

社区洞察

其他会员也浏览了