Vector Indexing plus Knowledge Graphs with Neo4j
TL;DR:? Normally vector indexing is thought of as a common implementation for Generative AI.? While this is true at a very rudimentary level, this blog discusses how to use vector indexes in Neo4j to complement normal graph analytics (non-AI) when data quality is poor resulting in loose text search based retrievals.? A long the way, it discusses the background to vector indexes and uses Neo4j’s Graph Data Science Client API in a Jupyter Notebook to run the queries against real data downloaded from www.clinicaltrials.gov in 2022.
Lies, lies and statistics
As a student, I hated statistics classes in high school and university.?? It seemed useless to me as it was so imprecise.? I also tried to get out of typing in high school (yes, I’m that old) as I didn’t think I’d ever need that skill…..and opted for a role in a software research group instead of a database group in my shore tour with the USN as I thought databases would be boring.??
So far, I’m 0 for 3.
Getting back to statistics, as any decent Operations Analysis course will tell you, there are lies, lies and statistics…..and given the same statistics they make you argue both sides of the discussion.?? From a concrete math perspective, it just…..really, really bugged me.? As I spent more time digging into the science of databases and query optimization in my last workplace, I saw statistics being used – and how that sometimes those same statistics led to poorly performing queries – so it sorta reinforced my position – at least in my mind.?? My older sister, on the other hand – thought statistics as a mathematical discipline was great.?? But, hey, she also lives in New Jersey….and she’s an electrical engineer…..two strikes - so she’s not perfect.??
Sometime after joining Neo4j, I called her up and said – “Hey, they finally found a use for statistics – it’s called Data Science”.? It’s the job of younger brothers to ping on their older sisters, and I love my job.
Stepping back into text searches
I mention that, because one of the problems we in the database world have always been plagued with is “how do you query those huge string globs?” that we have in every database.??? Initially, most databases offered rudimentary text indexes and text searches – Neo4j among them.?? Other DBMS’s took this on as a specific challenge and focused exclusively on text search – e.g. Elasticsearch (today) and many other predecessors.?? Often such databases excelled at not only finding words in text globs quickly but also use stemming and (if you provide one) use taxonomies and ontologies to assist with searches.?? I remember in my first foray into the world of genetics and gene sequencing the notion that searches for “fruit fly” and “Drosophila melanogaster” should be equivalent.
One of the other issues that using a specialized DBMS for text searches also brings up – is the problem of how you keep the text DBMS and the structured DBMS in sync.?? This can be achieved in a variety of methods – the most common is to use a message bus to send the data to both DBMS platforms – and then deal with the possible recovery scenarios of 2PC/distributed transactions and possibly the lack of ACID capabilities in many of the no-SQL DBMS’s complicating that….and now you get to maintain the integration of these technologies going forward.? Always fun.? Sometimes necessary.? Never as simple as desired.
The problem was – and still is today – that text searches are extremely imprecise.?? If you search for “fruit fly” on Google, you get lots of hits – gazillions – but the relevance is a question.?? Are you studying entomology, or do you want to know how to get the pesky things out of your kitchen??? The more words you provide, the more relevant the search results.?? This is where “scoring” enters the picture.? ?When you run such a search, you get back a list of results with a relevance score – ideally sorted by the most relevant and terminating with a reasonable threshold.
Hold that thought – we will get back to it.
But then the perfect world of Google was shattered by “sponsored” results.?? I remember a while back when the company I worked for had an interesting challenge to try to get searches for our product and command syntax to point to our documentation higher in the stack of results than completely unrelated topics – or competitor claims.?
We’ll get back to that thought too.
To solve this, we often resulted in Natural Language Processing (NLP) – which involved using taxonomies, etc. in an attempt to parse gross scads of unstructured data into nice, neat, structured entities.?
My point is…..we might have (finally!) found a way to do better text searches without all the hassle of taxonomies, sentiment, ontologies, etc.???? And it involves statistics.??? Stuff’s getting more useful all the time.
Enter the world of LLM’s and vectors
My first introduction into the wonderful world of vectors was in high school physics when the teacher informed me that 1+1 does not always equal 2.?? Great – 10 years of educational experience down the tubes.??? So how do vectors and Large Language Models (LLM’s) come into the picture???? Good question.?? And when you work in a company with a lot of classically trained data scientists (like I do) and you aren’t one (me), you often are playing catch-up (me again).? What I learned is that text vectors has to do with two key aspects:
I think the best introduction I had about this topic was Kevin Henner’s excellent blog post “An intuitive introduction to text embeddings” as well as the discussion at Word Embedding Demo .? Henner’s blog is a really good simplification of the problem, so …I liked it as a starting point.?? In Henner’s blog, he used the example of dogs and cats – and articulated that if you look at the space of language as an unlimited set of dimensions and words as vectors when you compare two words there is the angle between the words and the distance between the vectors – two different metrics.? While I like dogs and find cats only semi-useful in the world of domesticated pets, I opted to use apples and oranges for this discussion – because – well – who in technology hasn’t heard of the comparison of apples to oranges??? Soooo…consider the following diagram:
Using the principals discussed in Henner’s blog, if comparing a green apple to a red apple, the angles in “food dimension” would be very similar and the angular distance quite small.?? Comparing an apple to an orange, the angle increases as does the distance – one is tropical, mostly inedible thicker skin, etc.?? And then we get the banana.? Tropical fruit like an orange, inedible thicker skin – but definitely not round.?? At that point, we note that the angle and the angular distance both have increased significantly.??? Now, let’s consider a bushel of apples – there are dozens of apples (some of which are green and some are red in the pic above – not very realistic coming from an area where apples were a staple by the way).?? When considering text searches, one of the classic scoring relevance factors is the frequency in which a term occurs.?? So, when comparing 1 apple to a bushel, the angle is the same (or very close) but the size of the vector is considerably different due to the dozens of apples in a bushel.?? The same applies to a fruit basket that likely contains multiple apples – the length of the vector is different.? In that case, the angle has also increased due to the inclusion of other fruit.??
This is important because when comparing vectors representing words, you have to decide whether it is the distance (frequency) that is more important vs. the angle (similarity).?? Since often we are comparing multiple words/vectors, this then becomes the classic question of whether the Cosine Similarity or the Euclidean Distance is more desirable.? The Euclidean Distance is nothing more the good old Pythagorean theorem applied repeatedly (much like cosine similarity is simply the cosine function applied repeatedly):
In Henner’s blog, he points out that as a result, {10 dogs, 1 cat} is much closer to {10 cats, 1 dog} than {200 dogs, 1 cat} – which is possibly an undesirable result considering if searching for dog kennels.?? As a result, in general, the default is to use Cosine Similarity when comparing text embeddings vs. Euclidean Distance.
Neo4j, Text Embeddings & Vector Indexes
One of the outputs of vectors from LLM’s is the notion of text embeddings – essentially an array of all the vectors from the output of the LLM.? Theoretically, this vector array should be smaller (bytes stored) than the text itself – buttt……given the LLM vector outputs range from 768 to 4096 vectors – this may not always be the case....and I would argue likely not. OpenAI's vectors are 1536 floating point values or 1536*8 bytes (64 bits) = ~12KB - which is 6 typed pages of text on that old fashion typewriter. Far larger than most of our large strings we store in a structured system.??However, such a vector is much faster to search than doing string comparisons as simple math operations can tell you the distance/angle between two different vectors – and the vector itself includes some semantic context which allows for greater flexibility vs. manually searching for all the words you can think of in a taxonomy. Add in the CPU's hardware acceleration for SIMD instructions on arrays and ....wwwwwayyy faster than string comparisons.
There are multiple vector databases in the market today that specialize in rapid searches of vectors – including Pinecone and others.? There are two problems with this approach:
The reality is - that unless you are storing the contents of the book "War & Peace" in a text LOB in your database - you probably need to store the unstructured large strings.....in addition to the vector. I mean - if someone is going to do analysis, we as humans have a much better chance of understanding from the glob of characters whether the data is relevant than just the vector array.
A common phrase heard that highlights the last point is the “hallucinations” that current LLM generative AI systems generate that rely exclusively on vector databases.?? A blog post titled “Generative AI Benchmark: Increasing the Accuracy of LLMs in the Enterprise with a Knowledge Graph ” cited a study using LLM’s, SQL DBMS and Graph DBMS for accuracy in answering questions and found that LLM’s were best grounded when supported by a knowledge graph. A quick chart from that post illustrates my point:
This isn't a one vs. the other bashing aspect - this is a "better together" approach.
This brings about the entire concept of Retrieval Augmented Generation (RAG) in which Generative AI answers are grounded by supplementing LLM responses with context from the knowledge graph.
The obvious question is how is this accomplished in the context of Neo4j? ??If we ignore the graph semantics and just focus on the unstructured text data in some properties, this is where vector indexes come in.? When doing any search, we are often aided by an index.? We tend to think of indexes as simplistic single dimensional structures similar to B-Tree – one of the most common indexes in DBMS technology.?? However, more complex indexes are available – I remember in Sybase IQ, we implemented a special datetime index which would not only support the simplistic range searches that a B-Tree supports – but also understood and indexed the date parts such as month, year, day of the week, etc. – and facilitated searches based on date parts.??
With any index, the search boils down to the index entry comparison itself.? For standard B-tree, the index entry comparison is a simple exact match.?? For vectors, this is not really a solution.? Remember, we need to consider either the Cosine Similarity or the Euclidean Distance.?? A few months ago in Neo4j 5.11, we introduced the capability of vector indexes which supports either Cosine Similarity or Euclidean Distance based searches.
Much like with text searches, vector searches return a relevance score.? You can increase the relevance by further training the LLM – the equivalent of “sponsoring” results in Google.
An example with Clinical Trial recruitment
Even without entering the realm of Generative AI, vector indexes can be very useful as a technique to supplement typical text searches within a larger graph query.? One of the common challenges faced by pharmaceutical companies is the problem of recruiting patients for clinical trials of a drug they are developing.?? While “direct to consumer” recruitment is the ideal method, the reality is that a clinical trial patient needs to have supporting infrastructure of medical professionals who can determine if they fit the trial criteria and can monitor the impact of the drug as well as any adverse events – and report both in clinical accuracy to the manufacturer.?? As a consequence, one of the techniques to solving this challenge is to find facilities that have been involved in similar studies in the past or are currently involved in one as the facility has:
One of the common sources of clinical trial data can be found at www.clinicaltrials.gov – a site sponsored by the National Library of Medicine as part of the National Center for Biotechnology Information.?? It doesn’t contain all clinical trials – but it is a good source for publicly available clinical trial data.?? I loaded into the following very simple model in Neo4j:
?The original data is from what appears to be a very simple star-schema implemented in PostgreSQL.? The one thing I learned when loading this data into Neo4j is that the only thing worse than the handwriting of medical doctors is their data entry.?? For common conditions being researched, there often are dozens of condition entries with different spellings or sometimes identical.?? For “Interventions”, for example, there were something like 24000 distinct entries for “placebo”.?? A good example is the following list of “Condition” nodes for 2 common cardiac conditions:
"Myocardial Infarction (MI) or Acute Myocardial Infarction (AMI)"
"Myocardial Infarction (ST-Elevation Myocardial Infarction and Non-ST-Elevation Myocardial Infarction)"
"Myocardial Infarction Complicated With Cardiogenic Shock"
"Myocardial Infarction First"
"Myocardial Infarction Not Otherwise Specified"
"Myocardial Infarction Not Otherwise Specified (MI NOS)"
"Myocardial Infarction Old"
"Myocardial Infarction Postoperative"
"Myocardial Infarction Type 2"
"Myocardial Infarction With Non-obstructive Coronary Arteries"
"Myocardial Infarction With Nonobstructive Coronary Arteries"
"Myocardial Infarction [C14.907.585.500]"
"Myocardial Infarction or Chest Pain"
"Myocardial Infarction, Acute"
"Myocardial Infarction, Anterior Wall"
"Myocardial Infarction, Unstable Angina Pectoris, Sudden Cardiac Death, Stroke, Peripheral Artery Disease"
"Myocardial Infarctions"
"Myocardial Ischaemia"
"Myocardial Ischemia"
"Myocardial Ischemia(Implanted Drug-eluting Stents Because of Ischemic Heart Disease(Stable Angina, Acute Coronary Syndrome))"
"Myocardial Ischemia, Angina Pectoris"
"Myocardial Ischemic Reperfusion Injury"
"Myocardial Ischemic-reperfusion Injury"
24 conditions instead of 2. Data quality problems such as this make any cross-study analysis much more difficult.? Those familiar with entity resolution problems understand the issue. For shorter strings such as names, text comparison algorithms such as Jaro-Winkler, Sorensen-Dice, Hamming Distance - all allow some aspects if the text values are likely the same - but even then they don't take into consideration context.....and don't scale well to larger string entries with free from text.
I tried to clean up “Interventions” which added the area of the model with dotted red lines.?? More germane to this discussion is the area outlined in dotted purple – where another customer was trying to use the CHIA graph implementation of study conditions to try to identify studies based on similar criteria.? That just proves that this problem of trying to find similar studies has been an ongoing common one. However, for this use case I decided to leave the data as is…..make it more realistic - and use the raw Study.criteria property.
To solve the basic problem, what we needed was to have a method that identified facilities (highlighted green node at top) involved in studies (highlighted brown node in center) for desired conditions (highlighted orange node center bottom) and criteria (property on Study node). ??
Adding vector indexes to Neo4j data
Note:? In this example, I use Google GCP’s VertexAI LLM.? However, Neo4j supports Amazon’s AWS Bedrock, Microsoft Azure’s OpenAI and can interface with any LLM including the open-source Llama implementation.
领英推荐
The “Condition” data entries were all fairly short – mostly a few words.?? The study “Criteria” was a much more free-form text field with paragraphs of inclusion and exclusion criteria with specific condition related verbiage.? As such, I decided to add a text embedding to the Study based on criteria.?? The way I achieved this was using the following code which exploited our integration with Google GCP’s VertexAI platform.?? The first step was to use the following via the GCP desktop SDK to get an access-token:
gcloud auth print-access-token
Then using the output to set some parameters as the runtime for this could be long and may require restarting (more on this in a second):
//then set parameters for access token and project id
:param gcp_auth_key => "<long access token>";
:param gcp_project => "<project id>";
:param rate_limit => 2500
Then I used the following cypher code in Neo4j browser to add the embeddings:
// now loop through all the Study nodes and add a vector based on the text in criteria
// we will use apoc.ml.vertexai.embedding() procedure to inteface to VertexAI directly vs.
// having to create a python program to loop through all the nodes. Because text
// embeddings are rate limited, we need to add a sleep after each vector is created to
// avoid overrunning the rate limit
:auto MATCH (s:Study)
WHERE s.criteria IS NOT NULL
AND s.criteriaEmbedding IS NULL
WITH s, [s.criteria] as criteriaList
CALL apoc.ml.vertexai.embedding(criteriaList, $gcp_auth_key, $gcp_project, {region: "us-central1"}) yield index, text, embedding
WITH s, embedding
CALL {
WITH s, embedding
// SET s.criteriaEmbedding = embedding
CALL db.create.setNodeVectorProperty(s,"criteriaEmbedding", embedding)
CALL apoc.util.sleep(60000/$rate_limit)
} IN TRANSACTIONS OF 10 ROWS
ON ERROR FAIL
I could have done this programmatically, but this was my first attempt at creating a text embedding and vector index, so I wanted to keep things simple….and easy to debug.?? This turned out to be a good thing as I found out our field engineering rate limit initially was ~600/hr…..which when you are trying to create vectors for ~400000 nodes….translates to 666.667 hours – which is way longer than I anticipated.? I also (initially) hadn’t put in the sleep and so frequently I would exceed the rate limit and it would error out….and I would have to go through the whole process of refetching the access token as they only had a 1 hour life span.?? Painful.?? 2 days of pain.? On day 3, I convinced my boss to jack up the rate limit as just this little demo I was developing for a customer quickly was blocking anyone else from doing anything as the rate limit was per project.? Oooopps…. How inconsiderate of me!
Setting things up in a Jupyter Notebook
As I was doing this for a customer, I decided to put the code for the query in a Jupyter Notebook.?? To set things up, I issued the following commands from the command line:
pip install google-cloud-aiplatform first
gcloud config set project <my project id>
gcloud auth application-default login
These commands:
That way I would have to enter authentication information, etc. in my notebook itself.?? The first cell of my notebook was simply importing some libraries
import os
import numpy as np
import pandas as pd
from graphdatascience import GraphDataScience
The second cell was reading in some authentication information for Neo4j from a json file to avoid hardcoding it in the program:
import json
f = open('logininfo.json')
info = json.load(f)
#print(info)
HOST = info['neo4jURL']
USERNAME = info['neo4jUserid']
PASSWORD = info['neo4jPassword']
DATABASE = 'clinicaltrials'
And the third cell was initializing a connection to Neo4j using the GDS client API which extremely simplifies python programs interacting with Neo4j.?? While other functions exist to verify the connection, I like to output the GDS debug info just so I am aware of which GDS version and whether enterprise features are enabled, etc.
# Use Neo4j URI and credentials according to your setup
gds = GraphDataScience(HOST, auth=(USERNAME, PASSWORD), aura_ds=False)
gds.set_database(DATABASE)
gds.debug.sysInfo()
Using data science as a starting point
I decided I would look for something fairly specific – fetal heart conditions – as this would be what a typical researcher would do vs. looking across all studies.? To start my query, I first decided to use a simple similarity based on Study -> Condition.?? This appears to be a classic bipartite and easily implemented according to the model above.?? However:
The reason I wanted a similarity is that often when analyzing graph output, you want to see “connected” entities.? So, for example, if I found an interesting study but the facilities were not usable for some reason, I could follow the similarity relationship to find the closest similar study and consider those facilities.?? Since this involves dealing with our data quality issue as is, I chose to use a cypher projection to project the graph into GDS memory using the cell:
G, result = gds.graph.cypher.project(
"""
MATCH (source:Study)-[rel:STUDY_CONDITION]->(target:Condition)
WHERE (target.downcaseName CONTAINS 'fetal' OR target.downcaseName CONTAINS 'neonatal')
AND (target.downcaseName CONTAINS 'heart'
OR target.downcaseName CONTAINS 'arrhythmia'
OR target.downcaseName CONTAINS 'bradycardia'
OR target.downcaseName CONTAINS 'cardiac'
OR target.downcaseName CONTAINS 'cardiovascular'
OR target.downcaseName CONTAINS 'long qt')
WITH source, target, (CASE WHEN source.completionDate IS NULL then 1.0
WHEN date(source.completionDate).year>=date().year then 1.0
ELSE 1.0-(0.03*(date().year - coalesce(date(source.completionDate).year,date().year-1)))
END) as weight
RETURN gds.graph.project(
$graph_name,
source,
target,
{
sourceNodeLabels: labels(source),
targetNodeLabels: labels(target),
relationshipType: "FETAL_HEART_STUDY_CONDITION",
relationshipProperties: { weight:weight}
}
)
""",
database = 'clinicaltrials',
graph_name = 'jeff_studyFacilities'
)
Runs in sub-seconds (0.1 seconds according to notebook).?? I then used this cell to compute the similarities and create the similarity relationship between the Study nodes the above projection isolated:
gds.nodeSimilarity.write(G,relationshipWeightProperty='weight', \?? similarityMetric='JACCARD', similarityCutoff=0.33, \??? writeRelationshipType='JT_FETAL_HEART_SIMILARITY', writeProperty='jaccardScore')
…which ran in 0.6 seconds per my notebook.? Then I could simply ask for a list of facilities used by those studies by running the following cell:
similarityQuery="""
MATCH p=(facility:Facility)-[:STUDY_FACILITY]->(study:Study)
WHERE EXISTS ((study)-[:JT_FETAL_HEART_SIMILARITY]->(:Study))
RETURN facility.facility as facilityName, collect(study.NCT_ID) as studyList
ORDER BY facilityName"""
gds.run_cypher(similarityQuery)
In 2.2 seconds, it comes back with a list of ~33 facilities.??
facilityName studyList
0 Assiut university [NCT05374135]
1 Augusta University Medical Center [NCT05399979]
2 Binza-Delvaux Maternity Hospital [NCT03799861]
3 Brigham and Women's Hospital [NCT01881685]
4 Centre Hospital Kingasani [NCT03799861]
5 Congenital Heart Collaborative, Nationwide Chi... [NCT05386173]
6 Department of Paediatric Cardiology, Helsinki ... [NCT05386173]
7 Department of Pediatric Cardiology, Skane Univ... [NCT05386173]
8 Department of Pediatric and Congenital Cardiol... [NCT05386173]
9 Department of Pediatrics, Ume?¥ University Hos... [NCT05386173]
10 Department of Perinatal Cardiology and Congeni... [NCT05386173]
11 Department of pediatric cardiology, Karolinska... [NCT05386173]
12 Fetal Cardiovascular Program, University of Ca... [NCT05386173]
13 Fetal Medicine Unit, Dept. Obstetrics & Gyneco... [NCT05386173]
14 Hospital Universitario Dr. Jos?? Eleuterio Gon... [NCT03703037]
15 Kinderherzzentrum Linz [NCT05386173]
16 Medical College of Wisconsin [NCT03775954]
17 Mother and Child Hospital Bumbu [NCT03799861]
18 National Research center [NCT05172336]
19 Oslo University Hospital [NCT02347241]
20 Pediatric Cardiology - University Hospital Bonn [NCT05386173]
21 Rabin Medical Center [NCT02331888]
22 Royal Alexandra Hospital [NCT03902652]
23 S. Orsola-Malpighi University Hospital [NCT04184245]
24 Sant'Orsola-Malpighi University Hospital [NCT04123691]
25 Shirley Andrade Santos [NCT02666794]
26 Summa Center for Women's Health Research [NCT01881685]
27 The Hospital for Sick Children Toronto [NCT05386173]
28 University Hospital Bonn, Clinic for Diagnosti... [NCT05066399]
29 University hospital Technical university, moth... [NCT05386173]
30 University of Alberta [NCT05369247]
31 University of Wisconsin - Madison [NCT03775954]
32 University of Wisconsin-Madison [NCT01903564, NCT03047161]
Cool!
Supplementing with a vector search
Because we are using primitive word comparisons in the above cypher projection, I then decided to use a vector search to complement the above.?? First, I have to implement the VertexAI libraries and initialize the LLM model I will be using via the cell (takes 4 seconds first time):
from vertexai.preview.language_models import TextEmbeddingModel
model = TextEmbeddingModel.from_pretrained("textembedding-gecko")
Then I can run a vector index query using the cell:
embeddings = model.get_embeddings(["fetal, neonatal, infant, heart, arrhythmia, bradycardia, cardiac, cardiovascular, long qt"])
#for embedding in embeddings:
# vector = embedding.values
# print(vector)
# since we only sent a single list element, we only care about the first vector
myVector=embeddings[0].values
print(myVector)
vectoryQuery="""
CALL db.index.vector.queryNodes("Study_criteria_vector", 10, $vector)
YIELD node as similarStudy, score as vectorSimilarity
MATCH (f1:Facility)-[sf:STUDY_FACILITY]->(similarStudy)
RETURN f1.facility as facilityName, collect(similarStudy.NCT_ID) as studyList
ORDER BY facilityName
"""
vectorParms={'vector': myVector}
gds.run_cypher(vectoryQuery,vectorParms)
This took 9.3 seconds on initial cold start and 0.2 seconds on subsequent executions – returning ~8 new studies with only 1 facility that was in the previous list – but a different study than associated with previously (** in below list).
facilityName studyList
0 Centre Hospitalier de Nantes [NCT00238056]
1 Guangdong General Hospital [NCT01045252]
2 Mayo Clinic in Rochester [NCT04441892]
3 Neonatal Intensive Care Unit [NCT01376544]
4 Oslo University Hospital [NCT04315610]**
5 Rady Children's Hospital [NCT02564796]
6 Shands Hospital at the University of Florida [NCT01765205]
7 UC Davis Children's Hospital [NCT01426542]
Being a good citizen, I then cleaned up my mess by first removing the similarity relationship:
# clean up the GDS relationship
gds.run_cypher("MATCH ()-[rel:JT_FETAL_HEART_SIMILARITY]->() CALL {WITH rel DELETE rel} IN TRANSACTIONS")
Finally, clearing the graph projection from GDS memory and dropping my connection to the database:
# clean up the GDS projection
gds.graph.drop(G)
gds.close()
Comments on the results
Now a couple of comments - remember, this was a "better together" as a complementary approach vs. contrasting the two .... so I have a net of 40 (33+7) facilities. In addition, I have ~30+ studies to look at and if I find one that I like but I can't use those facilities (out of area), I can traverse my similarity relationship to find one very similar....which shortens my investigative time.
Some might question why the vector search didn't return more facilities. It turns out to be a maybe bit of a bias in the choice of using the criteria property. I took a look at the Study with NCT ID of NCT05386173 and the criteria never had the exact works I provided in my list for the vector search - but it did have related terms such as aortic, atrial, gestational, echocardiographic, etc. - which does suggest a context for fetal heart problems. So the implicit bias was that in selecting criteria, not all criteria mentioned the condition terms I was using. Perhaps, this is a situation in which a medical trained LLM might have been a better fit than Google's generic Gecko model.
It also simply could be that a language model with larger vectors (OpenAI @ 1536) would have yielded better results than Gecko @ 768.....but....meh - I like the next thoughts as a more easily implemented approach better.
The other option is that as part of the data model, while each "Study" had 1 or more "Conditions" associated with it, it also had typically quite a few more "MeshTerms" used for Browse Conditions, etc. A simple cypher statement could have comprised a list of those "MeshTerms" from Studies in the first result set to complement the terms I was using for the vector approach. This also points out an advantage of sometimes running the KG query first prior to running the text search - which is almost the opposite of what we traditionally have done - using the text search to filter for possible nodes and then running the KG query on those nodes. Advantage - no need to re-vectorize all the data.
Extending that a bit, when creating the vectors, I could have created a vector that spanned the criteria, the list of conditions and the list of mesh terms - the concatenation could have been in the WITH clause - no need to actually store the concatenated text. This is similar to TEXT indexes in Neo4j which can span multiple labels and properties. Interestingly, the vector API's all use a list interface (with a max of 5 elements in some cases) - so not sure which would have been better - a single concatenated string - or a list of [criteria, mesh terms, conditions].....and that I think is where data analysts kill hours - finding the best approaches.
Conclusion
If you have large unstructured text properties, using vector indexing to facilitate text searches can complement existing graph queries and generally run quite fast if well focused.?? In the above example, the entire notebook runs from beginning to end in <20 seconds on cold system and well under 10 seconds on a system in which data is cached….without having to go through the heavy weight process of doing NLP.?? Arguably, vector creation is an extensive process, but given that the criteria field didn’t have any consistent formatting, NLP would likely have been an error-prone process at best.?? It does consume considerable space – the vectors consumed about ~1.8 GB of disk space by itself, plus another ~1.2GB of space for the vector index for a total increase of ~3GB – which nearly doubled the on-disk storage for this database.?? However, considering each vector had a dimension of 768 float (double precision) values (~6KB per vector * 400000 = 2.457GB)…well…one can’t complain too much.?? The point, though, is that vector indexes probably should be reserved for those long (>>128 byte) unstructured string properties.
Outstanding article! I'm late to reading it, but better late than never!
Interesting article.
Doctoral Student| Academic MSc| Customer Success Partner, Expert, SAP BTP | Datalake and Database Engineer|
1 年You simple the best! Ever!
Solution Architect | Graph Databases & GenAI
1 年Great article Jeff ????