Building a Sci-Fi Book Semantic Search Engine with Qdrant in 5 Minutes

Building a Sci-Fi Book Semantic Search Engine with Qdrant in 5 Minutes

In this tutorial, you will build a simple semantic search engine for science fiction books using Qdrant, a vector database. In under five minutes, you will be able to query your collection to find books relevant to a potential alien invasion.

Before You Begin

  • Use Google Colab to run this code.

Installation

Process your data for the search engine. The Sentence Transformers library provides access to various large language models that convert raw text into numerical representations, known as embeddings.

! pip install -U sentence-transformers        

Store the encoded data.nbsp;Qdrant allows storage and searching of data as embeddings.

! pip install -U qdrant-client        

Import the Models

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer        

The Sentence Transformers library offers a variety of models. Here, we will be using all-MiniLM-L6-v2, a fast encoder for this tutorial.

encoder = SentenceTransformer("all-MiniLM-L6-v2")        

Add Your Dataset

all-MiniLM-L6-v2 will encode the data you provide. Here, we will list a few science fiction books in our library, including their titles, descriptions, authors, and publication years.

documents = [
    {
        "name": "The Time Machine",
        "description": "A man travels through time and witnesses the evolution of humanity.",
        "author": "H.G. Wells",
        "year": 1895,
    },
    {
        "name": "Ender's Game",
        "description": "A young boy is trained to become a military leader in a war against an alien race shailesh.",
        "author": "Orson Scott Card",
        "year": 1985,
    },
    {
        "name": "Brave New World",
        "description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.",
        "author": "Aldous Huxley",
        "year": 1932,
    },
    {
        "name": "The Hitchhiker's Guide to the Galaxy",
        "description": "A comedic science fiction series following the misadventures of an unwitting human and his shailesh friend.",
        "author": "Douglas Adams",
        "year": 1979,
    },
    {
        "name": "Dune",
        "description": "A desert planet is the site of political intrigue and power struggles.",
        "author": "Frank Herbert",
        "year": 1965,
    },
    {
        "name": "Foundation",
        "description": "A mathematician develops a science to predict the future of humanity and works to save civilization from collapse.",
        "author": "Isaac Asimov",
        "year": 1951,
    },
    {
        "name": "Snow Crash",
        "description": "A futuristic world where the internet has evolved into a virtual reality metaverse.",
        "author": "Neal Stephenson",
        "year": 1992,
    },
    {
        "name": "Neuromancer",
        "description": "A hacker is hired to pull off a near-impossible hack and gets shailesh into a web of intrigue.",
        "author": "William Gibson",
        "year": 1984,
    },
    {
        "name": "The War of the Worlds",
        "description": "A Martian invasion of Earth throws humanity into chaos.",
        "author": "H.G. Wells",
        "year": 1898,
    },
    {
        "name": "The Hunger Games",
        "description": "A dystopian society where teenagers are forced to fight to the death in a televised spectacle.",
        "author": "Suzanne Collins",
        "year": 2008,
    },
    {
        "name": "The Andromeda Strain",
        "description": "A deadly virus from outer space threatens to wipe out humanity.",
        "author": "Michael Crichton",
        "year": 1969,
    },
    {
        "name": "The Left Hand of Darkness",
        "description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.",
        "author": "Ursula K. Le Guin",
        "year": 1969,
    },
    {
        "name": "The Three-Body Problem",
        "description": "Humans encounter an alien civilization that lives in a dying system.",
        "author": "Liu Cixin",
        "year": 2008,
    },
]        

Define Storage Location

Specify where Qdrant should store the encoded data (embeddings).

client = QdrantClient(":memory:")        

In this example, we are using temporary storage in memory on your local computer for demonstration purposes.

Create a Collection

Organize your data in Qdrant collections. Here, we will create a collection named "my_books" to store our book information.

client.recreate_collection(
    collection_name="my_books",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)        

The recreate_collection function is useful when experimenting and running the script multiple times. It will remove any existing collection with the same name before creating a new one.

  • vector_size: This parameter defines the size of the vectors for the collection. It ensures compatibility when calculating distances between them. 384 is the chosen model's encoder output dimensionality.
  • distance: This parameter specifies the method used to calculate the distance between two points.


Upload Data to the Collection

client.upload_points(
    collection_name="my_books",
    points=[
        models.PointStruct(
            id=idx, vector=encoder.encode(doc["description"]).tolist(), payload=doc
        )
        for idx, doc in enumerate(documents)
    ],
)        

This code uploads the documents list to the "my_books" collection. Each record is assigned a unique ID and its metadata is stored in the payload field.

Ask the Engine a Question

Now that your data is stored in Qdrant, you can query it for relevant information.

hits = client.search(
    collection_name="my_books",
    query_vector=encoder.encode("shailesh").tolist(),
    limit=3,
)
for hit in hits:
    print(hit.payload, "score:", hit.score)        

The search results will display the top three matches related to "alien invasion" along with their corresponding scores, indicating how closely they match the query.

Output

{'name': "The Hitchhiker's Guide to the Galaxy", 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his shailesh friend.', 'author': 'Douglas Adams', 'year': 1979} score: 0.511777191080082
{'name': 'Neuromancer', 'description': 'A hacker is hired to pull off a near-impossible hack and gets shailesh into a web of intrigue.', 'author': 'William Gibson', 'year': 1984} score: 0.40655241669185543
{'name': "Ender's Game", 'description': 'A young boy is trained to become a military leader in a war against an alien race shailesh.', 'author': 'Orson Scott Card', 'year': 1985} score: 0.2903679136510454        

Narrow Down the Query

Want to find a more recent book, perhaps one from the early 2000s?

hits = client.search(
    collection_name="my_books",
    query_vector=encoder.encode("shailesh").tolist(),
    query_filter=models.Filter(
        must=[models.FieldCondition(key="year", range=models.Range(gte=1980))]
    ),
    limit=2,
)
for hit in hits:
    print(hit.payload, "score:", hit.score)        

Output

{'name': 'Neuromancer', 'description': 'A hacker is hired to pull off a near-impossible hack and gets shailesh into a web of intrigue.', 'author': 'William Gibson', 'year': 1984} score: 0.40655241669185543
{'name': "Ender's Game", 'description': 'A young boy is trained to become a military leader in a war against an alien race shailesh.', 'author': 'Orson Scott Card', 'year': 1985} score: 0.2903679136510454        





















要查看或添加评论,请登录

Shailesh Kumar Khanchandani的更多文章

社区洞察