?? "Navigating the Nexus of Knowledge Graphs and AI: Illuminating Insights from PDFs" ??
Lalit Moharana
AWS Comunity Builder || AI Enthusiast || Data Engineer || Product Engineer
?? Harnessing the Power of Knowledge Graphs and AI: A PDF-based Information Retrieval System ??
In this comprehensive guide, we'll unveil an innovative application that seamlessly integrates knowledge graphs, Neo4j's graph database capabilities, and the transformative prowess of OpenAI's GPT-3. Our objective is to create an interactive system that extracts profound insights from PDF documents, strategically stores them within a knowledge graph, and dynamically crafts intelligent answers in response to user queries. Let's embark on an insightful journey through the intricate mechanics of this cutting-edge implementation.
1. Understanding the Components
Our technological ensemble thrives on the synergy of diverse components:
2. Breaking Down the Code
Let's immerse ourselves in the codebase, piece by piece, to comprehend its intricate workings.
Step 1: Initial Setup and Configuration
from dotenv import load_doten
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from neo4j import GraphDatabase
from langchain.embeddings import HuggingFaceEmbeddings
import openai
import os
# Load environment variables
load_dotenv()
openai.api_key = os.getenv('openai_secret_key')
In this section, we're importing the necessary libraries and tools. Notably, we're loading environment variables (such as OpenAI's API key) using dotenv.
Step 2: Main Function and Model Setup
def main()
? ? st.set_page_config(page_title="Ask your PDF")
? ? st.header("Ask your PDF ??")
? ? # Embeddings setup
? ? model_name = "sentence-transformers/all-mpnet-base-v2"
? ? model_kwargs = {'device': 'cpu'}
? ? encode_kwargs = {'normalize_embeddings': False}
? ? embeddings = HuggingFaceEmbeddings(
? ? ? ? model_name=model_name,
? ? ? ? model_kwargs=model_kwargs,
? ? ? ? encode_kwargs=encode_kwargs
? ? )
? ? # Neo4j connection setup
? ? neo4j_uri = "bolt://localhost:7687"? # Replace with your Neo4j URI
? ? neo4j_user = ""? # Replace with your Neo4j username
? ? neo4j_password = ""? # Replace with your Neo4j password
? ? driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_password))
:
The main() function serves as the entry point for our Streamlit application. We're also configuring the page's appearance and setting up our semantic embeddings model.
Step 3: Uploading and Extracting PDF Content
# Upload PDF and submit butto
pdf = st.file_uploader("Upload your PDF", type="pdf")
submit = st.button("ADD PDF")
# Extract text from PDF and split into chunks
if submit:
? ? pdf_reader = PdfReader(pdf)
? ? text = ""
? ? for page in pdf_reader.pages:
? ? ? ? text += page.extract_text()
? ? text_splitter = CharacterTextSplitter(
? ? ? ? separator="\n",
? ? ? ? chunk_size=1000,
? ? ? ? chunk_overlap=200,
? ? ? ? length_function=len
? ? )
? ? chunks = text_splitter.split_text(text)
? ? # Store in Neo4j
? ? with driver.session() as session:
? ? ? ? document_node = \
? ? ? ? ? ? session.run("CREATE (d:Document {text: $text, id: $id} ) RETURN d", text=text, id=pdf.name).single()[0]
? ? ? ? for idx, chunk in enumerate(chunks):
? ? ? ? ? ? embedding = embeddings.embed_query(chunk)
? ? ? ? ? ? session.run("""MATCH (d:Document {id: $id})
? ? ? ? ? ? CREATE (c:Chunk {embedding: $embedding, chunk: $ch})
? ? ? ? ? ? CREATE (d)-[:HAS_EMBEDDING]->(c)""", embedding=embedding, id=pdf.name, ch=chunk)
n
This section covers the PDF upload, text extraction, splitting into chunks, and the subsequent storage of the chunks in the Neo4j graph database.
Step 4: User Interaction and Answer Generation
# User input for questio
user_question = st.text_input("Ask a question about your PDF:")
# Answer generation
if user_question:
? ? # Query Neo4j for most similar chunk
? ? with driver.session() as session:
? ? ? ? result = session.run(
? ? ? ? ? ? "MATCH (d:Document)-[:HAS_EMBEDDING]->(c:Chunk) "
? ? ? ? ? ? "WITH c, gds.similarity.cosine(c.embedding, $user_embedding) AS similarity "
? ? ? ? ? ? "RETURN c.chunk as Text ORDER BY similarity DESC LIMIT 2",
? ? ? ? ? ? user_embedding=embeddings.embed_query(user_question)
? ? ? ? )
? ? ? ? most_similar_chunk = result.single()[0]
? ? # Generate answer using GPT-3
? ? with driver.session() as session:
? ? ? ? doc_text = most_similar_chunk
? ? ? ? response = openai.ChatCompletion.create(
? ? ? ? ? ? model="gpt-3.5-turbo",
? ? ? ? ? ? messages=[
? ? ? ? ? ? ? ? {"role": "system", "content": 'You are a helpful assistant'},
? ? ? ? ? ? ? ? {"role": "user", "content": f"question: {user_question}\ntext = {doc_text}"}
? ? ? ? ? ? ],
? ? ? ? )
? ? ? ? qa_result = response["choices"][0]["message"]["content"]
? ? st.write(f"Question: {user_question}")
? ? st.write(f"Answer: {qa_result}")
This final section enables users to input questions, leverages Neo4j's GDS library to find the most similar text chunk, and employs OpenAI's GPT-3 to generate coherent and informative answers.
3. Embarking on Your Exploration
As you dive into this innovative fusion of knowledge graphs, Neo4j, and AI, a realm of possibilities awaits. This system amplifies how we interact with information from tailored content recommendations to enriched semantic search.
Join us in the next post as we venture further into the realms of data-driven innovation. Feel free to share your thoughts, inquiries, and aspirations.
#KnowledgeGraphs #Neo4j #GDS #AI #OpenAI #GPT3 #PDFExtraction #SemanticEmbeddings #InformationRetrieval #DataInsights #Innovation #Streamlit #DataScience #NaturalLanguageProcessing #EnrichedData
Data Engineer at Involead| 5? SQL at Hacker rank| 5X Azure Certified|2X Databricks certified|AWS|Azure|PySpark|SQL|Python|Airflow
1 年Very useful
Data Analytics Engineer
1 年Thanks for sharing.
DATA ENGINEER || Python || AWS
1 年Very useful