登录查看更多内容

?? "Navigating the Nexus of Knowledge Graphs and AI: Illuminating Insights from PDFs" ??

Lalit Moharana

AWS Comunity Builder || AI Enthusiast || Data Engineer || Product Engineer

发布日期: 2023年8月6日

?? Harnessing the Power of Knowledge Graphs and AI: A PDF-based Information Retrieval System ??

In this comprehensive guide, we'll unveil an innovative application that seamlessly integrates knowledge graphs, Neo4j's graph database capabilities, and the transformative prowess of OpenAI's GPT-3. Our objective is to create an interactive system that extracts profound insights from PDF documents, strategically stores them within a knowledge graph, and dynamically crafts intelligent answers in response to user queries. Let's embark on an insightful journey through the intricate mechanics of this cutting-edge implementation.

1. Understanding the Components

Our technological ensemble thrives on the synergy of diverse components:

Streamlit Interface: A user-friendly web interface built using Streamlit, enabling users to upload PDFs and pose questions effortlessly.
PDF Text Extraction: PyPDF2, a robust PDF processing library, is employed to extract textual content from uploaded PDF documents. These extracted text chunks serve as the fundamental units of knowledge.
Neo4j Knowledge Graph: Neo4j, a powerful graph database, forms the backbone of our system. Extracted text chunks metamorphose into graph nodes, meticulously connected via relationships that mirror their intrinsic connections. Moreover, these chunks are empowered with semantic understanding through embeddings.
Semantic Text Embeddings: "sentence-transformers," a library dedicated to text embeddings, is embraced to metamorphose our textual data into high-dimensional vectors. This nuanced transformation infuses each text chunk with contextual depth and meaning.
GDS Similarity Search: The Neo4j Graph Data Science (GDS) library steps forward, facilitating similarity-based searches that unveil the most pertinent text chunks aligned with user queries.
GPT-3 Question Answering: OpenAI's GPT-3 model takes the reins when it comes to question-answering. The dynamic interplay of graph-based retrieval and AI-generated answers culminates in an enlightening fusion.

2. Breaking Down the Code

Let's immerse ourselves in the codebase, piece by piece, to comprehend its intricate workings.

Step 1: Initial Setup and Configuration

from dotenv import load_doten
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from neo4j import GraphDatabase
from langchain.embeddings import HuggingFaceEmbeddings
import openai
import os


# Load environment variables
load_dotenv()
openai.api_key = os.getenv('openai_secret_key')

In this section, we're importing the necessary libraries and tools. Notably, we're loading environment variables (such as OpenAI's API key) using dotenv.

Step 2: Main Function and Model Setup

def main()
? ? st.set_page_config(page_title="Ask your PDF")
? ? st.header("Ask your PDF ??")


? ? # Embeddings setup
? ? model_name = "sentence-transformers/all-mpnet-base-v2"
? ? model_kwargs = {'device': 'cpu'}
? ? encode_kwargs = {'normalize_embeddings': False}
? ? embeddings = HuggingFaceEmbeddings(
? ? ? ? model_name=model_name,
? ? ? ? model_kwargs=model_kwargs,
? ? ? ? encode_kwargs=encode_kwargs
? ? )


? ? # Neo4j connection setup
? ? neo4j_uri = "bolt://localhost:7687"? # Replace with your Neo4j URI
? ? neo4j_user = ""? # Replace with your Neo4j username
? ? neo4j_password = ""? # Replace with your Neo4j password
? ? driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_password))

:

The main() function serves as the entry point for our Streamlit application. We're also configuring the page's appearance and setting up our semantic embeddings model.

领英推荐

??Top ML Papers of the Week

DAIR.AI 9 个月前

Step 3: Uploading and Extracting PDF Content

# Upload PDF and submit butto
pdf = st.file_uploader("Upload your PDF", type="pdf")
submit = st.button("ADD PDF")


# Extract text from PDF and split into chunks
if submit:
? ? pdf_reader = PdfReader(pdf)
? ? text = ""
? ? for page in pdf_reader.pages:
? ? ? ? text += page.extract_text()


? ? text_splitter = CharacterTextSplitter(
? ? ? ? separator="\n",
? ? ? ? chunk_size=1000,
? ? ? ? chunk_overlap=200,
? ? ? ? length_function=len
? ? )
? ? chunks = text_splitter.split_text(text)


? ? # Store in Neo4j
? ? with driver.session() as session:
? ? ? ? document_node = \
? ? ? ? ? ? session.run("CREATE (d:Document {text: $text, id: $id} ) RETURN d", text=text, id=pdf.name).single()[0]
? ? ? ? for idx, chunk in enumerate(chunks):
? ? ? ? ? ? embedding = embeddings.embed_query(chunk)
? ? ? ? ? ? session.run("""MATCH (d:Document {id: $id})
? ? ? ? ? ? CREATE (c:Chunk {embedding: $embedding, chunk: $ch})
? ? ? ? ? ? CREATE (d)-[:HAS_EMBEDDING]->(c)""", embedding=embedding, id=pdf.name, ch=chunk)

n

This section covers the PDF upload, text extraction, splitting into chunks, and the subsequent storage of the chunks in the Neo4j graph database.

Step 4: User Interaction and Answer Generation

# User input for questio
user_question = st.text_input("Ask a question about your PDF:")


# Answer generation
if user_question:
? ? # Query Neo4j for most similar chunk
? ? with driver.session() as session:
? ? ? ? result = session.run(
? ? ? ? ? ? "MATCH (d:Document)-[:HAS_EMBEDDING]->(c:Chunk) "
? ? ? ? ? ? "WITH c, gds.similarity.cosine(c.embedding, $user_embedding) AS similarity "
? ? ? ? ? ? "RETURN c.chunk as Text ORDER BY similarity DESC LIMIT 2",
? ? ? ? ? ? user_embedding=embeddings.embed_query(user_question)
? ? ? ? )
? ? ? ? most_similar_chunk = result.single()[0]


? ? # Generate answer using GPT-3
? ? with driver.session() as session:
? ? ? ? doc_text = most_similar_chunk
? ? ? ? response = openai.ChatCompletion.create(
? ? ? ? ? ? model="gpt-3.5-turbo",
? ? ? ? ? ? messages=[
? ? ? ? ? ? ? ? {"role": "system", "content": 'You are a helpful assistant'},
? ? ? ? ? ? ? ? {"role": "user", "content": f"question: {user_question}\ntext = {doc_text}"}
? ? ? ? ? ? ],
? ? ? ? )
? ? ? ? qa_result = response["choices"][0]["message"]["content"]


? ? st.write(f"Question: {user_question}")
? ? st.write(f"Answer: {qa_result}")

This final section enables users to input questions, leverages Neo4j's GDS library to find the most similar text chunk, and employs OpenAI's GPT-3 to generate coherent and informative answers.

3. Embarking on Your Exploration

As you dive into this innovative fusion of knowledge graphs, Neo4j, and AI, a realm of possibilities awaits. This system amplifies how we interact with information from tailored content recommendations to enriched semantic search.

Join us in the next post as we venture further into the realms of data-driven innovation. Feel free to share your thoughts, inquiries, and aspirations.

#KnowledgeGraphs #Neo4j #GDS #AI #OpenAI #GPT3 #PDFExtraction #SemanticEmbeddings #InformationRetrieval #DataInsights #Innovation #Streamlit #DataScience #NaturalLanguageProcessing #EnrichedData

Soumya Ranjan Bisoyi

1 年

Very useful

N Budhismita Reddy

Data Analytics Engineer

1 年

Thanks for sharing.

1 次回应

Soubhagya Ranjan Jena

DATA ENGINEER || Python || AWS

1 年

Very useful

1 次回应

查看更多评论

要查看或添加评论，请登录

Lalit Moharana的更多文章

ZooKeeper’s Farewell: Apache Kafka Enters a New Era with KRaft ??

2025年3月17日

ZooKeeper’s Farewell: Apache Kafka Enters a New Era with KRaft ??

March 17, 2025 After over a decade of faithful service, ZooKeeper is officially stepping out of the Apache Kafka…

7 条评论
DuckDB Unleashed: Powering Distributed Data Processing with SmallPond

2025年3月3日

DuckDB Unleashed: Powering Distributed Data Processing with SmallPond

Overview SmallPond appears to be a promising tool for handling large-scale data, particularly for AI applications. It’s…

2 条评论
The Rise of Single-Node Processing Engines (and Their Distributed Cousins)

2025年1月28日

The Rise of Single-Node Processing Engines (and Their Distributed Cousins)

Introduction In the era of "bigger is better," the data engineering world is witnessing a counterintuitive shift. While…

2 条评论
Harnessing the Power of Amazon S3 Tables with PySpark Locally

2024年12月6日

Harnessing the Power of Amazon S3 Tables with PySpark Locally

Integrating Amazon S3 Tables with Apache Spark locally empowers developers to process and analyze data efficiently…

7 条评论
Building Real-Time Data Lakes with Apache Flink and Apache Paimon: A Hands-on Guide

2024年10月25日

Building Real-Time Data Lakes with Apache Flink and Apache Paimon: A Hands-on Guide

Hey #DataEngineers! ?? Today, I'm excited to share my experience building a real-time data lake pipeline using Apache…

21 条评论
RAG Fusion: Overcoming Limitations with the Wisdom of King Bhoj's Court ????

2024年9月25日

RAG Fusion: Overcoming Limitations with the Wisdom of King Bhoj's Court ????

In the ever-evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) has emerged as a…

5 条评论
?? Simplifying Serverless: Managing Multiple Endpoints in Python Lambda with APIGatewayRestResolver

2024年9月13日

?? Simplifying Serverless: Managing Multiple Endpoints in Python Lambda with APIGatewayRestResolver

Hey LinkedIn fam! ?? Today, I want to share a game-changing discovery that revolutionized how I handle multiple…
?? Mastering CDC with Hudi: The Tale of Managing Trains at an Indian Railway Station

2024年8月10日

?? Mastering CDC with Hudi: The Tale of Managing Trains at an Indian Railway Station

?? Mastering CDC with Hudi: The Tale of Managing Trains at an Indian Railway Station ???? Hello, data enthusiasts! ??…

12 条评论
?? AWS Lambda Idempotency: A Chai-Time Chat for Python Devs using Powertools

2024年7月31日

?? AWS Lambda Idempotency: A Chai-Time Chat for Python Devs using Powertools

Namaste, code warriors! ?? Grab your masala chai because we're about to dive deeper into the world of Lambda…

2 条评论
?? Demystifying Spark Cluster Configuration: A Desi Data Engineer's Guide

2024年7月28日

?? Demystifying Spark Cluster Configuration: A Desi Data Engineer's Guide

Hey there, data enthusiasts! ?? Today, let's chat about something that often gives us headaches - configuring Spark…

20 条评论

See all articles

?? "Navigating the Nexus of Knowledge Graphs and AI: Illuminating Insights from PDFs" ??

Lalit Moharana

AWS Comunity Builder || AI Enthusiast || Data Engineer || Product Engineer

1. Understanding the Components

2. Breaking Down the Code

领英推荐

3. Embarking on Your Exploration

Lalit Moharana的更多文章

社区洞察

其他会员也浏览了

A Complete Guide to Creating and Storing Vector Embeddings!

?? Infinite Text Input? This changes everything.

Understanding Traditional RAG vs GraphRAG

Embark on a Journey with Agentic RAG

Creating a Product Support AI Agent using Natural Language

Paper Review: Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Why Vector Databases Are Important for Large Language Models (LLMs)

Positive Thinking Company Newsletter November 2023

???????????? ?????????????????? ?????? ?????? ????????????????????????

A Comprehensive Guide to Building Multimodal RAG Systems

1. Understanding the Components

2. Breaking Down the Code

领英推荐

3. Embarking on Your Exploration

Lalit Moharana的更多文章

ZooKeeper’s Farewell: Apache Kafka Enters a New Era with KRaft ??

DuckDB Unleashed: Powering Distributed Data Processing with SmallPond

The Rise of Single-Node Processing Engines (and Their Distributed Cousins)

Harnessing the Power of Amazon S3 Tables with PySpark Locally

Building Real-Time Data Lakes with Apache Flink and Apache Paimon: A Hands-on Guide

RAG Fusion: Overcoming Limitations with the Wisdom of King Bhoj's Court ????

?? Simplifying Serverless: Managing Multiple Endpoints in Python Lambda with APIGatewayRestResolver

?? Mastering CDC with Hudi: The Tale of Managing Trains at an Indian Railway Station

?? AWS Lambda Idempotency: A Chai-Time Chat for Python Devs using Powertools

?? Demystifying Spark Cluster Configuration: A Desi Data Engineer's Guide

社区洞察

其他会员也浏览了

A Complete Guide to Creating and Storing Vector Embeddings!

?? Infinite Text Input? This changes everything.

Understanding Traditional RAG vs GraphRAG

Embark on a Journey with Agentic RAG

Creating a Product Support AI Agent using Natural Language

Paper Review: Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Why Vector Databases Are Important for Large Language Models (LLMs)

Positive Thinking Company Newsletter November 2023

???????????? ?????????????????? ?????? ?????? ????????????????????????

A Comprehensive Guide to Building Multimodal RAG Systems