My Experiment with Neo4j and the Power of Graph-Based RAG

My Experiment with Neo4j and the Power of Graph-Based RAG

Semantic search and knowledge graphs are two pre-dominant paradigms for working with knowledge sources.

Semantic search as we know leverages embeddings and semantic similarity to retrieve relevant results, while the knowledge graphs structure the data into nodes, edges to represent the explicit relationships.

Semantic search is easier to implement but is bit limited in reasoning the insights. Knowledge graphs are way ahead in this but are traditionally hard to build.

In this post, I’ll try to cover the challenges in building KGs and how tools like the Neo4j LLM Knowledge Graph Builder helps reduce these pains.


This morning - I did an experiment comparing the semantic search based RAG and KG based RAG using the Apple Inc. 10-K report. (Yes this is one of my favourite document that I use in experiments since I know it well ??).

The goal was to analyse how both approaches performs in extracting insights from this multi-section doc.

?This document contains structured and unstructured data supply chain details, financial metrics, risks and corporate governance information.


Where Knowledge graph took the lead! ??

Semantic search leverages embeddings (OpenAI embeddings in my case) to retrieve relevant chunks of a doc based on cosine similarity with the user query.

A knowledge graph on the other hand structures the document into nodes (entities e.g. ?Apple, suppliers, risks) and edges (relationships like "depends_on," "faces_risk_from"). This graph is then queried to provide precise and relationship-driven insights.


Advantages of Knowledge Graphs in the Experiment:?

  1. Multi-Hop Reasoning: KG enabled tracing relationships across multiple nodes and connections. E.g. a query about risks that Apple faces risks related to supply chain highlighted how dependencies on specific regions like manufacturing facilities in China could impact production during regional COVID lockdowns. This layered reasoning allowed deeper insights that are difficult to uncover with semantic search.
  2. Explicit Relationships: Semantic search retrieved text snippets based on similarity but KG explicitly encoded relationships such as "currency fluctuations impact revenue streams." These explicit relationships made insights more explainable.
  3. Cross-Section Integration: The 10-K document has related information across multiple sections. KGs excel at integrating these data points into a unified structure. E.g. financial risks mentioned in "Risk Factors" were directly linked to regional revenue breakdowns in the "Consolidated Financial Statements" section ensuring no critical detail is overlooked proving a holistic view of the document.
  4. Context Preservation: Semantic search struggled with maintaining the context of a query when I saw KG preserving the context of entities and their relationships ensuring that higher accuracy. e.g. a query about Apple’s capital expenditures linked investment figures from the financial sections to corresponding strategic initiatives.
  5. Scalability for Complex Queries: Complex queries e.g. "Which regions contribute most significantly to Apple’s revenue, and how are they impacted by geopolitical risks?" are resolved efficiently with KGs. The graph’s structured nature enabled multi-hop traversal while retaining query efficiency.
  6. Data Normalization: KG normalized and integrated data from structured tables, unstructured text and hierarchical sections into a single coherent representation. e.g. supplier information from the "Management’s Discussion and Analysis" section was aligned with corresponding risks and financial metrics.

?

?

?

The Challenge! ??

Creating the knowledge graph from unstructured documents is very challenging unlike creating VDB from embeddings.

  1. Entity and relationship extraction: Extracting entities like suppliers and risks would require fine-tuned Named Entity Recognition (NER) models and manual validation. e.g. The term "supplier" appeared generically in multiple sections, requiring additional context to map it to specific entities.
  2. Ontology design: Defining a schema to represent entities (e.g. Company/Risk/Region) and relationships (e.g. depends_on/operates_in) would involve iterative design. Balancing granularity (e.g., should "supply chain risk" be split into subtypes?) with could be very time-consuming.
  3. Data Integration Across Sections: The document contains diverse data formats tabular (financials), textual (risk descriptions) and hierarchical (organizational structures). Aligning financial metrics (e.g. revenue by region) with textual mentions of risks would require custom preprocessing pipelines.
  4. Scalability: The initial graph would 1000s of nodes and edges. Querying, visualizing and validating subgraphs efficiently is tough.

?

How Neo4j LLM Knowledge Graph Builder comes to the rescue! ??

?

To reduce the manual workload that I explained above - I explored Neo4j’s LLM Knowledge Graph Builder. And it was just amazing:

  1. Entity Extraction: Automatically identified entities like companies, risks, and regions from the document.
  2. Ontology Suggestions: Provided a starting schema, reducing the need for manual ontology design.
  3. Direct Ingestion: Populated the graph in Neo4j, enabling immediate querying using Cypher.

While some manual adjustments were still necessary (e.g. fine-tuning entity disambiguation), the tool significantly accelerated the process.

?Within minutes I could explore my documents entities and relationships:


The level of relationships in a doc is Insane and "scary" when you explore it

Conclusion: The Future of Graph-Based RAG

My experiment highlighted the clear advantages of graph-based RAG over semantic search in handling complex, interconnected data like the Apple 10-K. While the upfront effort of building a knowledge graph is higher - the value it delivers in terms of reasoning, explainability, and cross-domain integration is unparalleled.

?As tools like Neo4j’s LLM Knowledge Graph Builder continue to mature, the barrier to entry for graph-based RAG will lower, making it an essential strategy for enterprises handling rich, relational data.

?

#GenerativeAI #KnowledgeGraphs #RAG #Neo4j #AI #GraphDatabases

?

Dilum Bandara

Principal Research Scientist

1 个月

I agree Neo4j KG Build is a good starting point to the topic. Only caveat for a beginner is it's too much focus on parallelisation, which makes the code long and hard to follow. Anyway, my experience trying to extend to a custom application has been painful, with its `main` branch not being stable, use of many deprecated Langchain libraries, and need to writing own code to automate entity disambiguation.

回复
Siddhant Agarwal

DevRel Guy | Graph Enthusiast | Google Developer Expert AI/ML| Ex-Google, IBM

1 个月

Looks exciting. Would you be interested to talk about this at our next meetup in Delhi? https://www.meetup.com/graph-database-delhi-ncr/

要查看或添加评论,请登录

Rohit Sharma的更多文章

社区洞察

其他会员也浏览了