登录查看更多内容

Zifo Semantic Search Service - technical details

Zifo Data Science

发布日期: 2023年4月24日

Author: Ross Burton

Two weeks ago, we shared the exciting development of Zifo Semantic Search Service, a game-changing solution that offers AI-powered search of documents both within and outside your company's knowledge base. Today we want to share how we developed this solution with an overview of the technologies employed.?

Our semantic search service is centered on the idea of retrieval, a concept that requires that documents be converted into “embeddings” using large language models (LLMs), stored in a specialised vector database, and retrieved using a cosine similarity score.?

A general overview is given in the diagram above. The first step is to create a vector database in the domain of interest. In our demonstration, titles and abstracts were used to create this vector database, with 200,000 obtained from the PubMed Open Access dataset and a further 75,000 from the EBI Biostudies database. An LLM converts these documents into vector embeddings, numeric representations of the text in a shared embedded space such that documents can be compared based on their semantic similarity. The embeddings are stored in a specialised vector database, and comparisons made between vectors by cosine similarity. When presented with a new document, query text, or a question (step 2), the LLM computes an embedding that can be compared to those stored within the vector database. Relevant documents are found by ranking vectors by their cosine distance to the query.?

To achieve what is described above, we utilised several key technologies that, in combination, delivered our technical demonstration. The diagram below provides an overview of our infrastructure:?

领英推荐

Open Source Data Exploration Tools You Need to Know…

Open Data Science Conference (ODSC) 2 年前

AI Scraping for product data now available in Zyte API

Zyte 1 年前

Exploring the Frontier of AI Scraping: A Fireside Chat…

Zyte 1 年前

We use Haystack by deepset to orchestrate document embedding and retrieval with LLMs. Open-source models are obtained from HuggingFace and stored within our infrastructure during deployment. The raw unprocessed text of our documents is kept in a MongoDB database, and the vector embeddings in a Weaviate vector database. When performing document retrieval, first, Weaviate is queried for semantically similar text as vector embeddings, and then the corresponding raw text is retrieved from MongoDB.?

To perform our queries in production, we use a FastAPI server that provides a REST API for interaction with Haystack and our databases. A Streamlit application then uses this REST API to serve the interface we see in our demonstration. Our entire solution is performed using Amazon Web Services (AWS). Each component is deployed using docker. The Streamlit application and the databases are served using Amazon Elastic Compute Cloud. The LLM interface, using haystack and interfaced with FastAPI, is computationally expensive and is therefore deployed with Amazon Elastic Container Service with an auto-load balancer.?

We hope you have enjoyed this technical overview of our demonstration app. If you haven’t already, visit https://semanticboost.zifo-tech.com/ and try out our app. Stay tuned because in the coming weeks, we will be adding more models to our demo, including our own fine-tuned model for document retrieval. If you are interested in how Zifo’s Data Science team can support your use case or search challenges, please contact our team directly at [email protected].?We are here to help you solve your data integration and information search challenges.?

#Semantic #BiotechDataIntegration #DataScience #ContextualSearch #AI #NaturalLanguageProcessing #NLP #Search #Analytics #Zifo #Data #Insights #Unstructured #Curation #Indexing #KeywordsMatching #DataExtraction #DataManagement #DataAnalysis #DataDrivenDecisions?

Subhashree Bharathan

Business Analyst at Zifo RnD Solutions| Driving Digital Transformation| Scientific Informatics Consultant| Change Management

1 年

Would this extract data from source files in any format to render it in a standard semantic output ?

查看更多评论

要查看或添加评论，请登录

Zifo Data Science的更多文章

See all articles

Zifo Semantic Search Service - technical details

Zifo Data Science

领英推荐

Zifo Data Science的更多文章

社区洞察

其他会员也浏览了

A Brief History of AI

#28: Open Source LLMs at scale & LLMs Ebook release ??

Making vector search implementation easier and cost-efficient for your LLMs

Web Scraping vs. APIs: Which Data Extraction Method Should You Choose?

What is Semantic Convention in Observability and Why it Matters

RAG Beyond Basics:

Issue #297 - The ML Engineer ??

Diverse RAG AI Architecture Overview and Vector Search on Metadata Cloud Platform, Latest updates OpenAI o1 - Edition 3

Release of the SPHN Schema Forge web service and the SPHN Dataset2RDF Tool.

Vector Databases: Open Source and Commercial Solutions

领英推荐

Zifo Data Science的更多文章

Initiative for Greenhouse Gas and Carbon Footprint Reporting

Revolutionalising drug discovery using ontologies - the ZIFO approach

Introducing Zifo AI/ML Services to Elevate Your Business

Putting your digital house in order - the importance of a three-layered model approach

Streamlining Drug Discovery workflows with KNIME: Leveraging Open-Source Tools for Computational Drug Design

Virtualizing?chemical safety assessment: QSAR modelling to guide reduction of animal testing.

Large Language Models: an update for the perplexed

Challenges with Chemical Substructure Search

Migrate and search chemistry data at scale with Zifo

Get Semantically Similar Documents from Unstructured Data Sources

社区洞察

其他会员也浏览了

A Brief History of AI

#28: Open Source LLMs at scale & LLMs Ebook release ??

Making vector search implementation easier and cost-efficient for your LLMs

Web Scraping vs. APIs: Which Data Extraction Method Should You Choose?

What is Semantic Convention in Observability and Why it Matters

RAG Beyond Basics:

Issue #297 - The ML Engineer ??

Diverse RAG AI Architecture Overview and Vector Search on Metadata Cloud Platform, Latest updates OpenAI o1 - Edition 3

Release of the SPHN Schema Forge web service and the SPHN Dataset2RDF Tool.

Vector Databases: Open Source and Commercial Solutions