House-holding / Entity Resolution with Neo4j & Graph Data Science Library

Jeff Tallman

Principal Sales Engineer at Neo4j

发布日期: 2022年3月7日

After 3 years at Neo4j and my sparse blogs being very low level, it probably is causing some to question whether I have learned anything…..

…..wwwellllllll….maybe this post will prove an old dog learned some new tricks.

One of the problems IoT and Webfront businesses face is determining from all the plethora gazillions of interactions, which ones are from the same user/household or business entity.??Yeah – it is simple if you force them to log in…..but before that a customer may be browsing your site for days just thinking about buying that brand spankin’ new battery powered dual bevel sliding miter saw.???In the process, they might have also looked at the battery powered framing nailer (power tools – its why guys really go to Home Depot and Lowes!) a few days ago…?and before that some time ago a new drill….?You get the picture – a bunch of customer touchpoints scattered over days, weeks, months, ….?It would be great if you could somehow link all those sessions together ….even if sessions happened on different devices (work laptop, cell phone, tablet, spouse’s tablet, ….).??

There is a name for this problem…actually….several names:

House-holding
Entity Resolution
Disambiguation
Know Your Customer
...

It also often is a critical step in fraud detection – especially with synthetic identities.?With low volumes of data ….meh….you might be able to figure it out with some sampling queries and optics (good old eyeballs).???But at higher volumes…whoa!!??

This is where data science comes in.??Now my Pennsyltucky definition of Data Science is that it is what Statistics wanted to be when it grew up.?Or maybe it was finally figuring out a way to make statistics useful.??I hated statistics….???Sigh.??I tried to get out of typing class in high school too.??Then when I transitioned from flying to a ground job in the Navy, I turned down a job working with databases thinking it would be boring.?You can see how all those worked out.??Karma.

We going to show you one way to do this using that magical thingy called “Data Science” – and more specifically, Neo4j’s Graph Data Science Library.???A much more intelligent colleague of mine already has written up entire four series you can read (intro at https://neo4j.com/developer-blog/exploring-fraud-detection-neo4j-graph-data-science-summary/ ) if my drivel is too simplistic. I think he also can be blamed (at least partially) for the whitepaper on fraud and GDS found at https://neo4j.com/whitepapers/graph-data-science-use-cases-entity-resolution/.?

We have a number of customers doing this in real life – one of which is Meredith – who is cited in the above paper as using Graph Data Science on 4+TB of data comprised of 14 billion nodes and 20+ billion relationships….

….yes, those were “b’s”

….so you can tune out the fears, questions, doubts and deceitful marketing by competitors about Neo4j’s scalability.??

Our test case

Okay – so let’s assume we are a fictional bank – and we want to track customer session behavior for multiple reasons – help detect fraud via behavior (vs. just matching reused identifying attributes), identify customer behaviors for better recommendations, prevent account takeovers, etc.??Accordingly, we have a really simple graph model that looks like this:

For those unfamiliar with Neo4j, the labels in the bubbles are secondary labels – think of them as replacements for very low cardinality data attributes – but can be accessed a lot faster and used in searches as a result way more efficiently than your RDBMS database allows.?Some of these will be used in the next section on preprocessing.

Back on track, this represents all the data you have from the web application….plus some internal sources of truth (customer info such address, etc.) and some hastily acquired publicly available datasets that resolves IP addresses to the ISP and location.?First, let’s consider how good some of the data elements are in identifying users.

Webpage – unfortunately, unless the URL contains the account number, nearly every user that traverses to your site will hit the same pages.?Looking at some of my personal banking websites – very few URLs include a hashed account number – most are very generic.??As a result, not likely a good indicator of identity – but might be useful to retain for customer journey type analytics
OS – very low cardinality – probably 4 basic (Windows, MacOS, Android, iOS) plus version variations
Network – again very low cardinality – simply the type of network (WiFi, Cellular, etc.) which is very low cardinality – unless the SSID is reported for wireless networks.
Screen resolution – much like OS’s, likely a small mix of certain aspect ratios
Application – in this case the type of application – e.g. mobile application, web browser – probably the lowest cardinality of all

By contrast, consider some of the other data elements:

Device – this can be very identifiable – e.g. your cell phone reports its ID (or it can be pulled by applications and browsers) and even your desktop browser reports an identity.??By itself, it doesn’t identify the specific user, however repeated hits from the same device suggests the same user.
Cookies – applications/web browsers can put cookies on your system.?Until such a time that you delete your cookies either by clearing the browser cache or by using a virus software’s removal of tracking cookies, this is just as accurate as individual device ID’s
Account – obviously as soon as the session accesses an account, it provides a good indication of who the customer is. You might only get this info every 5th or 6th visit when they actually log in - although for a bank, its probably higher (most visits).
IP address – this can be fairly distinctive if repeated visits from the same IP address within a very short time frame – typically less than a week due to ISP DHCP lease renewal frequency.

There are exceptions to the rules, of course – library and other organizations allow multiple users to share the same computer over time.??Customers may be at a location such as an airport, Starbucks or McDonalds using a public WiFi which resolves to a common IP address.??An accountant may access multiple accounts on the behalf of their customers.???Given ISP DHCP lease renewals, you might connect from different IP addresses from the same device …..but at least on the same ISP band of IP addresses and possibly resolved to the same near-by ISP service location.

We are going to do this with a completely fake set of data that has ~100K customers, 4.1 million sessions, 1.5 million cookies, 3.6 million IP addresses, 115K devices, ~500K accounts (100K deposit accounts, 400K credit accounts)….

….while not real life size….it isn’t toy sized either ….appreciable data that should show how the techniques work at scale.

Preprocessing

Before we run the GDS algo’s we need to a bit of preprocessing:

We need to add labels to high degree attributes (and by inverse the distinctive) attributes such as Tagging IP addresses as “CommunityIP” (Starbucks, etc.) vs. “ResidentialIP” based on whether 5 or more devices had sessions from the same IP within 7 days

// Add a label to shared/community IP addresses (e.g. Starbucks, etc.)
// to allow filtering later
MATCH p=(d1:Device)<-[:SESSION_DEVICE]-(s1:AccountSession)
	-[:SESSION_IP]->(ipaddr:IPAddress)<-[:SESSION_IP]-
	(s2:AccountSession)-[:SESSION_DEVICE]->(d2:Device)
WHERE id(s1)<>id(s2)
? AND id(d1)<>id(d2) 	// we can exclude sessions from same device
? AND abs(duration.between(s1.sessionDateTime,s2.sessionDateTime).days) <= 7
// aggregate number of distinct devices plus our original 1
WITH ipaddr, count(distinct d2)+1 as numDevices?
WHERE numDevices > 5
SET ipaddr:CommunityIP
;
// Add a label to likely residential IP addresses (e.g. not Starbucks, etc.)
// to allow filtering later
MATCH (ipaddr:IPAddress)
WHERE NOT ipaddr:CommunityIP
SET ipaddr:ResidentialIP
;??

Tagging Devices as “CommunityDevice” (e.g. Library computer) vs. “PersonalDevice” for devices that access more than 20 different accounts within the sample population (1 year).

// Add a label to filter devices for high incident devices
MATCH (dev:Device)<-[:SESSION_DEVICE]-(sess:AccountSession)
	<-[:ACCOUNT_SESSIONS]-(acct:DepositAccount)
WITH dev, count(acct) as numAccts
WHERE numAccts > 20
SET dev:CommunityDevice
//Added 90987 labels, completed after 11123 ms
;
// Add a label to filter devices for low incident devices
MATCH (dev:Device)
WHERE NOT dev:CommunityDevice
SET dev:PersonalDevice
//Added 24313 labels, completed after 59 ms.

Annotate IP locations used by the same device (which may cover IP renewals getting a different IP) - such as SESSION_DEVICE_LOCATION and SESSION_DEVICE_CITY relationships to reduce false positives skewing results.

That last one takes a bit of explanation.?Think about your less than favorite metropolitan area…..probably a gazillion sessions a day.??We can filter the noise a bit for what we want and only create relationships between the sessions and the locations if a device had more than one session from that location.?Think about it – yes, it does remove the possibility that I can learn that a customer makes a business trip to NYC and accesses his account there.?But our goal is not to learn our customer’s behaviors….the data is there, we can still do that if we want….?Our real goal is to use session behaviors to identify customers – and one-off device sessions just isn’t going to help us.?The cypher to do this is:

// Same ISP subnet
call apoc.periodic.iterate("
MATCH p=(d:Device)<-[:SESSION_DEVICE]-(s1:AccountSession)
	-[:SESSION_LOCATION]->(loc:IPLocation)
	<-[:SESSION_LOCATION]-(s2:AccountSession)
	-[:SESSION_DEVICE]->(d)
WHERE id(s1)<>id(s2)
? AND NOT exists((s1)-[:SESSION_DEVICE_LOCATION]->(loc))
RETURN id(s1) as s1_id, id(s2) as s2_id, id(loc) as loc_id
","
WITH s1_id, s2_id, loc_id
MATCH (s1:AccountSession) WHERE id(s1)=s1_id
MATCH (s2:AccountSession) WHERE id(s2)=s2_id
MATCH (loc:IPLocation) WHERE id(loc)=loc_id
MERGE (s1)-[:SESSION_DEVICE_LOCATION]->(loc)
MERGE (s2)-[:SESSION_DEVICE_LOCATION]->(loc)
", {})
// Created 4184597 relationships, completed after 934529 ms.

A bit of explanation – we are going to be modifying millions of nodes….in a single statement.?In most databases that are ACID compliant, any single statement is a single transaction – it can be rolled back in its entirety if it fails.??For many databases, single large transactions always lead to fear of “blowing out the log” – meaning the transaction log fills and ALL processing halts.?The DBA then extends the log and whines about the additional disk space that is hard to recover.??Neo4j is just smidge smarter.?In log-based databases, large modifications first log all the changes to be made to the log and then commits….?If it fails, in some situations, the database has to log Compensating Log Records to “undo” all the changes so that on recovery the changes are not applied anyhow.??Neo4j instead records uncommitted changes in heap memory – if the transaction exceeds memory – it is canceled and rolled back.??Processing continues.??Man, oh, man – it is so much more fun for a database developed in the 21st century where memory was plentiful vs. being forced to live with implementations and architectures from back when 640KB was all you had.??Yep – those RDBMS’s we all learned on are still dragging the legacy of antique hardware restrictions into the new millennium.??APOC is a library of Neo4j utility stored procedures – and this one takes two queries (a selector and a writer) and batches it into iterative smaller transactions – so in this case, it breaks our really large single transaction into bunches of 10000 (I think is the default) smaller transactions.

Running WCC Community Detection

Community detection for the non-data scientists is a simple way of saying “segmentation” – organizing a plethora of things into related groups.?There are many different algorithms for doing this, the most popular being Weakly Connected Components (WCC) which just looks for any connection between entities and Louvain – which does the same but builds hierarchies for subgroups within the larger.

Our plan of attack is simple – create communities of sessions and then see how many of these communities only have a single possible customer associated.?For those, we can then directly associate the devices, etc. with the customer for future analysis.??Other communities of sessions may have more than one customer – we will call those “collisions” and we will deal with them next.

Community detection algorithms strictly work on what is termed a “monopartite” graph – meaning there is only a single entity of focus.?In our case – “AccountSessions”.?If we were trying to perform community detection among Customers, our single entity would be “AccountHolders”.??Now, then, as a second part, community detection only works for entities that are related to one another – so we need to create a relationship between sessions where there is a common shared secondary data element such as device, IP address, account, customer, etc.??To do this, we will use the cypher:

// Find shared info for top tier quality indicators
CALL apoc.periodic.iterate(
"MATCH (sess1:AccountSession)-[:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS]
	->(sharedInfo)<-[:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS]
	-(sess2:AccountSession)
WHERE NOT sharedInfo:IPAddress?
? AND NOT sharedInfo:CommunityDevice?
? AND id(sess1)<id(sess2)
RETURN id(sess1) as sess1_id, id(sess2) as sess2_id
	, labels(sharedInfo) as myLabels
","
WITH sess1_id, sess2_id, myLabels
MATCH (sess1:AccountSession) WHERE id(sess1)=sess1_id
MATCH (sess2:AccountSession) WHERE id(sess2)=sess2_id
MERGE (sess1)<-[r:SHARED_SESSION_INFO]-(sess2)
SET r.sharedInfo=apoc.coll.unionAll(r.sharedInfo,myLabels)
",{}
)
;


// add in medium quality indicator for shared IP within 7 days
CALL apoc.periodic.iterate(
"MATCH (sess1:AccountSession)-[:SESSION_IP]->(ipaddr:ResidentialIP)
	<-[:SESSION_IP]-(sess2:AccountSession)
WHERE abs(duration.inDays(sess1.sessionDateTime,sess2.sessionDateTime).days)<=7?
? AND id(sess1)<id(sess2)
RETURN id(sess1) as sess1_id, id(sess2) as sess2_id, labels(ipaddr) as myLabels
	, abs(duration.between(sess1.sessionDateTime,sess2.sessionDateTime).days)
		 as numDays
","
WITH sess1_id, sess2_id, myLabels, numDays
MATCH (sess1:AccountSession) WHERE id(sess1)=sess1_id
MATCH (sess2:AccountSession) WHERE id(sess2)=sess2_id
MERGE (sess1)<-[r:SHARED_SESSION_INFO]-(sess2)
SET r.sharedInfo=apoc.coll.unionAll(r.sharedInfo,myLabels)
? , r.sharedIPdays=numDays
",{}
)
;

// append shared info for lower quality indicators but only
// if a top-tier quality indicator existed
// use the new CALL {} IN TRANSACTIONS as alternative?
// to apoc.periodic.iterate
MATCH (sess1:AccountSession)<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
CALL {
	WITH sess1, sess2, ssi
	MATCH (sess1)-[:SESSION_LOCATION]->(:IPLocation)<-[:SESSION_LOCATION]-(sess2)
	SET ssi.sharedInfo=ssi.sharedInfo+"IPLocation"
} IN TRANSACTIONS
//Set 18757633 properties, completed after 657619 ms.
;


MATCH (sess1:AccountSession)<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
CALL {
	WITH sess1, sess2, ssi
	MATCH (sess1)-[:SESSION_DEVICE_CITY]->(:City)
		<-[:SESSION_DEVICE_CITY]-(sess2)
	SET ssi.sharedInfo=ssi.sharedInfo+"City"
} IN TRANSACTIONS
//Set 18757633 properties, completed after 623525 ms.
;

….and now we have our monopartite graph – which was depicted in the center of the earlier model diagram:

The problem is that not all associations are equal – for example sessions that share cookies are quite distinctive.??We could add a score and then run WCC in weighted mode.?However, we also more or less attributed weights in building our monopartite graph – e.g. removed shared IP addresses that were not within 7 days, only looked at devices in same ISP subnet, etc.??So, we can run WCC in unweighted mode.?To do this, first we need to project our graph into GDS memory:

CALL gds.graph.create(
? ? "shared-sessions",
? ? ["AccountSession"],
? ? {
? ? ? ? SHARED_SESSION_INFO: {
? ? ? ? 		type: "SHARED_SESSION_INFO",?
? ? ? ? 		orientation: "UNDIRECTED",?
? ? ? ? }?
? ? },
? ? {readConcurrency: 8}
)
// creates in 11806 milliseconds (11.8 seconds)

The second step is to run the WCC algorithm against that graph and record the communities that each session belongs to.??The problem is – there is likely a large number of nodes that are isolated and will create a ton of communities with a size of 1.??We really don’t care about them.??A single session is not going to really help us identify probable customers across sessions.??So rather than having the GDS function for WCC write directly back to the database, we will instead stream the results through additional cypher statements to “filter” the results and remove communities of size 1.

CALL gds.wcc.stream(
? ? "shared-sessions",
? ? {
? ? ? ? concurrency: 8
? ? }
) YIELD nodeId, componentId
WITH gds.util.asNode(nodeId) as session, componentId as WCC_id
WITH WCC_id, collect(session) as wcc_list
WHERE size(wcc_list) > 1
WITH WCC_id, wcc_list
UNWIND wcc_list as curSession
SET curSession.WCC_communityID=WCC_id
//Set 4175007 properties, completed after 16192 ms.

Now let’s see how well things worked out.??First, we can get an idea of how many communities we created – we are expecting quite a few:

领英推荐

How Do Artificial Intelligence, Machine Learning and…

Pratibha Kumari J. 2 年前

Spotlight on Data Intelligence Platform

Maria Pere-Perez 1 年前

Bridging the Gap: Integrating Data Science with Domain…

Iain Brown PhD 1 年前

MATCH (sess:AccountSession)
RETURN count(distinct sess.WCC_communityID)
//444707

Not too surprising.??The bigger question is how effective was this – how many communities exist with multiple customers associated???

MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->(sess1:AccountSession)
	<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
	<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE sess1.WCC_communityID=sess2.WCC_communityID
? AND id(ah1)<>id(ah2)
RETURN count(p) as numCollisions
//8517 total

Hmmmm…..?Now we know that one of our aspects we considered was shared IP addresses.??We restricted it to 7 days – but the reality is that due to an ISP refresh of DHCP leases, an IP address technically could be reassigned to a different location within the same day.?To see how this impacts the quality of our results, we can do a query like:

MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->(sess1:AccountSession)
	<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
	<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE sess1.WCC_communityID=sess2.WCC_communityID
? AND id(ah1)<>id(ah2)
RETURN ssi.sharedIPdays, count(p) as numCollisions
ORDER BY ssi.sharedIPdays DESC
;

From which we get results similar to:

ssi.sharedIPdays	numCollisions
	null				 533
  	7				    1009
  	6				    1008
  	5				     980
  	4				     981
  	3				     997
  	2				    1024
  	1				     964
  	0				    1021

Hmmm…..a bit more than we expected at the lower end.???Now if we take a look at some of these collisions:

// shows false positives (or fraud) due to overlapping IP addresses
MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->
		(sess1:AccountSession)<-[ssi:SHARED_SESSION_INFO]-
		(sess2:AccountSession)<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE sess1.WCC_communityID=sess2.WCC_communityID
? AND id(ah1)<>id(ah2)
? AND NOT ssi.sharedIPdays IS NULL
OPTIONAL MATCH p2=(sess1)<-[:ACCOUNT_SESSIONS]-(:DepositAccount)
OPTIONAL MATCH p3=(sess2)<-[:ACCOUNT_SESSIONS]-(:DepositAccount)
RETURN p, p2, p3
LIMIT 50
;

We get a picture similar to:

So, we have a lot of communities of 2 and some quite a bit larger.??If we look at a single community, we see in some cases a result similar to:

Which shows a very weak link actually connecting two subcommunities.??While it might be tempting to run Louvain instead and look at the resulting subcommunities in isolation, we really aren’t sure how many links between them exists – e.g. 2??4??and then what happens with Louvain???

Jaccard Similarity to the Rescue

For those who are new to this, data science has two very common similarity algorithms that try to score how similar two items are. The first is Jaccard - which uses either no weight or a single weight to evaluate similarity. For example, if two sessions are from the same IP address, use the same device, etc. - no real concern about scoring. The second is Cosine Similarity - which uses multiple weights - e.g. it might consider the frequency of same device, frequency of same IP address....etc. As you can imagine - it can be a bit more precise - but.....it takes a LOT longer to run, so for most purposes, Jaccard Similarity - which we call Node Similarity - is good enough. If you get bored, you can look it up, but the basic measurement is the intersection over the union of the sets.

Back to our problem. Once the initial communities are established, a second step is often to run Node (aka Jaccard) Similarity algorithm on the communities to “refine” those that have collisions.??Remember, for use cases such as fraud, we need to have a human look at the results and evaluate whether fraud is really being performed.??With ~8500 communities with collisions, that is a lot of evaluation.?If we had 10 investigators and if each could do 10 cases per day, it would take 85 days – or ~2.5 months.?Ouch.?Remember, most fraud windows are within a 30 days.

Now Neo4j’s GDS has 4 classes of graph projections divided among 2 principal categories – named/unnamed and native/cypher.?Named/unnamed simply refers to whether the graph projection has a memory.??Above, we created a named graph “shared-sessions” for running WCC.?The name is important – without a name, you cannot perform subsequent GDS algorithms as ….well….without a name, we simply can’t find it….and so Neo4j GDS automatically drops any unnamed graphs.?Once and done – so to speak.??Native vs. cypher refers to the aspect of whether the graph projection is simply constructed by loading the nodes/relationships into memory as they exist in the database or whether we have to execute a cypher statement to identify which nodes/relationships to load.??Generally speaking, we prefer named native projections as they are much faster for the GDS library to load.??However, in this case, we want to iterate through each community and run node similarity just within that community to reduce the false positives.?The good news is we only need to run node similarity on the communities with more than one customer associated – so rather than 445K executions, we only have 8.5K.??Still, rather than creating 8500 named native graphs, we will exploit the unnamed graphs and use a cypher projection to only focus the algo on the limited subset.

Now, a bit of details about a projected graph.??It is similar to an adjacency matrix:

It consists of node identifiers
It uses node identifier pairs to establish relationships

As a consequence, the cypher projection has two main parts – a query to identify all the nodes; and a query to identify the relationships between those nodes.?Our cypher query will take the form of:

// node query
MATCH (s:AccountSession {WCC_communityID: $curCommunityID})
RETURN id(s) as id
UNION?
MATCH (s:AccountSession {WCC_communityID: $curCommunityID})
	-[r:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS|SESSION_IP|
		SESSION_LOCATION|SESSION_CITY]-(sharedInfo)
RETURN id(sharedInfo) as id

The UNION might be a bit weird.??However, those familiar with similarities know that we are comparing using a “bipartite” graph.??That means there are two types of entities involved.??The first one is the entity we are focused on – AccountSession in this case.??The second type is all the other entities (plural) that we want to compare the sessions to see how similar they are – e.g. shared devices, cookies, accounts, IP addresses, etc.??As a result, we need to load both the nodes we are focusing on as well as all the nodes that are related.

Then we need to define the relationship between those nodes – and ideally provide a score for how important we think that relationship is (or how to weight it in comparison).??Note that a direct relationship between the nodes doesn’t have to exist – we are simply returning a pair of nodes implying a relationship – and a weight.??So, our relationship query resembles:

// relationship query
MATCH (s1:AccountSession {WCC_communityID: $curCommunityID})
	-[r1:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS|SESSION_IP|
		SESSION_LOCATION|SESSION_CITY]-(sharedInfo)
	-[r2:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS|SESSION_IP|
		SESSION_LOCATION|SESSION_CITY]-
	(s2:AccountSession {WCC_communityID: $curCommunityID})
RETURN id(s1) as source, id(s2) as target,
	(CASE
		WHEN type(r1) IN ['SESSION_COOKIE','ACCOUNT_SESSIONS'] THEN 0.9
		WHEN type(r1)='SESSION_DEVICE' THEN 0.8
		WHEN type(r1)='SESSION_IP'?
			AND abs(duration.between(s1.sessionDateTime,
						s2.sessionDateTime).days)<=1 THEN 0.7
		WHEN type(r1)='SESSION_IP'?
			AND abs(duration.between(s1.sessionDateTime,
						s2.sessionDateTime).days)<=3 THEN 0.5
		WHEN type(r1)='SESSION_IP'?
			AND abs(duration.between(s1.sessionDateTime,
						s2.sessionDateTime).days)<=7 THEN 0.4
		WHEN type(r1)='SESSION_LOCATION' THEN 0.025
		WHEN type(r1)='SESSION_CITY' THEN 0.0125
		ELSE 0.0?
	END) as weight

Put in English - for each community (identified by a parameter), we want to compare sessions that share information and assign a weight based on how good we think that relationship is.?Note, in this case, rather than generically looking at a 7-day window for IP addresses, we adjust the score for the overlap from 1 or fewer days (0.7) to the max of 7 days (0.4).??To assist our query, we will add an index on the community ID property we created so that the execution can find sessions by community quickly:

CREATE INDEX AccountSession_WCC_communityID_idx IF NOT EXISTS
	FOR (s:AccountSession) ON (s.WCC_communityID)
;?

The final cypher query to perform the node similarity iterations is:

//first part of query is to identify session communities with more than
// 1 customer associated with the community
MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->(sess1:AccountSession)
		<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
		<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE sess1.WCC_communityID=sess2.WCC_communityID
? AND id(ah1)<>id(ah2)
?
// now we collect each community with a list of customers in that community
// including the customer associated with the session we started with
WITH sess1.WCC_communityID as communityID, id(ah1) as curCustomer
	, collect(id(ah2)) as curCustomerList
WITH DISTINCT communityID, curCustomerList+curCustomer as customerList


// because each session may have multiple related sessions, we want to?
// remove duplicate customer ID's from the list.? ?We aren't going to?
// use this list in the actual graph, but it is used in some earlier
// queries where we look at sample data to get an idea of how big the
// similarities could be
WITH communityID
	, apoc.coll.dropDuplicateNeighbors(
			apoc.coll.sort(customerList)) as customerIDList
WITH collect({communityID: communityID
			, numCustomers: size(customerIDList)
			, customerIDList: customerIDList}) as communityListMap


// now we construct a map that is a list of all the communities
// along with their customer lists
WITH communityListMap, size(communityListMap) as numMaps
	, range(1,size(communityListMap)) as mapIdxList


// and then iterate through that list executing the node similarity
// and creating new relationships between sessions that have a?
// similarity score higher than our threshold.
UNWIND mapIdxList as curMapIdx
WITH communityListMap[curMapIdx-1].communityID as curCommunityID
	, communityListMap[curMapIdx-1].customerIDList as curCustomerList
	, curMapIdx, numMaps
CALL gds.nodeSimilarity.write({
	nodeQuery: "MATCH (s:AccountSession {WCC_communityID: $curCommunityID})
			RETURN id(s) as id
			UNION?
			MATCH (s:AccountSession {WCC_communityID: $curCommunityID})
				-[r:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS|
				SESSION_IP|SESSION_LOCATION|SESSION_CITY]-(sharedInfo)
			RETURN id(sharedInfo) as id",
	relationshipQuery: "MATCH (s1:AccountSession {WCC_communityID:?
			$curCommunityID})-[r1:SESSION_DEVICE|SESSION_COOKIE|
			ACCOUNT_SESSIONS|SESSION_IP|SESSION_LOCATION|
			SESSION_CITY]-(sharedInfo)-[r2:SESSION_DEVICE|
			SESSION_COOKIE|ACCOUNT_SESSIONS|SESSION_IP|
			SESSION_LOCATION|SESSION_CITY]-
			(s2:AccountSession {WCC_communityID: $curCommunityID})
			RETURN id(s1) as source, id(s2) as target,
			(CASE
				WHEN type(r1) IN ['SESSION_COOKIE',
							'ACCOUNT_SESSIONS'] THEN 0.9
				WHEN type(r1)='SESSION_DEVICE' THEN 0.8
				WHEN type(r1)='SESSION_IP'?
				? ?AND abs(duration.between(s1.sessionDateTime,
				? ?		s2.sessionDateTime).days)<=1 THEN 0.7
				WHEN type(r1)='SESSION_IP'?
				? ?AND abs(duration.between(s1.sessionDateTime,
				? ?		s2.sessionDateTime).days)<=3 THEN 0.5
				WHEN type(r1)='SESSION_IP'?
				? ?AND abs(duration.between(s1.sessionDateTime,
				? ?		s2.sessionDateTime).days)<=7 THEN 0.4
				WHEN type(r1)='SESSION_LOCATION' THEN 0.025
				WHEN type(r1)='SESSION_CITY' THEN 0.0125
				ELSE 0.0?
			END) as weight",
	parameters: {curCommunityID: curCommunityID},
	concurrency: 16,
	relationshipWeightProperty: "weight",
	writeRelationshipType: "COMMUNITY_SIMILARITY",
	writeProperty: "weight",
	similarityCutoff: 0.5
})
YIELD relationshipsWritten
WITH curMapIdx, numMaps, curCommunityID, relationshipsWritten
RETURN curMapIdx, numMaps, curCommunityID, relationshipsWritten
;
//Started streaming 6771 records after 137 ms and completed after 91372 ms,?
//displaying first 2500 rows.

Now, we want to see how well we did:

MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->(:AccountSession)
		-[r:COMMUNITY_SIMILARITY]->(:AccountSession)
		<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE id(ah1)<>id(ah2)?
RETURN count(p)
;
//╒══════════╕
//│"count(p)"│
//╞══════════╡
//│499? ? ? ?│
//└──────────┘
// Started streaming 1 records after 8 ms and completed after 14286 ms.

Cool!?Rather than 8500 cases, our investigators only need to look at ~500.?With 10 investigators performing 10 cases/day as noted earlier, we are looking at a week’s work vs. 2.5 months.??From the perspective of householding/entity resolution, starting from over 4 million sessions, we can now associate 99.99% or something like that of the sessions with individual customers and establish formal links for devices and IP addresses to aid in quickly identifying them on future visits to our site.

As note, the sample database used included fraudulent accounts where synthetic identities and other shared aspects (e.g. devices) were created for demonstration purposes.?The ~500 collisions remaining are most likely due to those fraudulent accounts – which also shows a way of using behaviors over time (such as web sessions) as a possible method for finding fraud as well as entity resolution.

Cool!?Now we just need to apply the same methods to payment chains and we can chase down money laundering.??But that is a discussion for another day.

Evaluating Effort for Production

What about level of effort and time taken to do this.??Let’s see…..if we discount the preprocessing time (a few hours – can be implemented once and maintained), the overall execution was:

Creating monopartite graph - ~630 seconds
Loading the initial graph projection – ~12 seconds
Running the WCC algorithm - ~16 seconds
Creating the index – meh….5 minutes? (300 seconds).?Didn’t time it.
Node similarity iterations - ~90 seconds

Some parts of this in production can be pre-maintained – e.g. using Label Propagation to maintain SHARED_SESSION_INFO relationships which would eliminate the first part and pre-creating the index…but even with all that, we are talking about a ~15 minute process ….on 4 million sessions!!!!??…and if we maintain it via Label Propagation….~2 minutes!

Less time than it took you to read this.

Now for the schema changes.??

We added some relationships for bipartite modeling (SESSION_DEVICE_IP)
We added the monopartite relationship SHARED_SESSION_INFO (with a property)
We added a community ID property to the AccountSession nodes
We added a relationship between sessions for node similarity

Now, if you’d would have been using a competitive product, you likely would have had to reload all the data (aaarrrghhhh) – recompiled all the queries…..and figured out how you were going to distribute those relationships across the distributed nodes…or if it was distributed relationships which are a nightmare to maintain……

….why would you willingly subject yourself to that?????

Deepak Pahuja

Database Expert(Sybase/PostgreSQL/Neo4J/MS-SQL/Graph Databases/Cassandra) at Crédit Agricole CIB

3 年

Very nicely explained all the concepts used specially the use of named Vs unnamed graphs and label propagation. This use case can be useful in fraud detection by applying various graph algorithms as you have mentioned. Thanks for sharing.

要查看或添加评论，请登录

Jeff Tallman的更多文章

Time Series Vectors in Neo4j

2024年6月19日

Time Series Vectors in Neo4j

What really is time series data? When people of time series data, the context is not always the same. In some cases, it…

2 条评论
Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

2024年4月4日

Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

For those of you near to the "Big Apple" and willing to brave the arduous trek into the city..

3 条评论
The Challenges of Virtual Knowledge Graphs

2023年12月21日

The Challenges of Virtual Knowledge Graphs

Nothing is new under the sun. The idea of virtual access to data in remote (and often heterogeneous) data stores is…
Vector Indexing plus Knowledge Graphs with Neo4j

2023年11月21日

Vector Indexing plus Knowledge Graphs with Neo4j

TL;DR: Normally vector indexing is thought of as a common implementation for Generative AI. While this is true at a…

4 条评论
Modeling Longitudinal/Time Series/Sequential Data in Neo4j

2023年7月31日

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

…and Why Analyzing It Is a Graph Problem One of the most common modeling questions we get asked is how to model time…
Modeling Discrete Relationships in Neo4j

2023年7月13日

Modeling Discrete Relationships in Neo4j

Graph Modeling One of the comments on my earlier blog was about graph modeling – it is an area that has tremendous…

1 条评论
Graph Feature Engineering for Longitudinal Data (aka Time Series)

2023年6月24日

Graph Feature Engineering for Longitudinal Data (aka Time Series)

TL;DR: Graph feature engineering in Neo4j for longitudinal (aka time-series) data allows path analysis and other…

2 条评论
DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

2023年6月12日

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

A prospective customer asked me a question the other day – why use Neo4j for data science instead of my [existing]…

5 条评论
Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

2022年12月15日

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

TL;DR: Why RDF databases don’t scale well for Knowledge Graphs Let’s face it, technologists are the worst purveyors of…

4 条评论
Hamiltonian Paths

2022年11月14日

Hamiltonian Paths

Last time I posted – admittedly a while ago…..

See all articles

House-holding / Entity Resolution with Neo4j & Graph Data Science Library

Jeff Tallman

Principal Sales Engineer at Neo4j

Our test case

Preprocessing

Running WCC Community Detection

领英推荐

Jaccard Similarity to the Rescue

Evaluating Effort for Production

Jeff Tallman的更多文章

社区洞察

其他会员也浏览了

Ghosts In The Machine: Uncovering Five Hidden Patterns In Your Data

Identifying Data Science Use Cases – Boosting Business Through Data Science Series

What is Data Science?

DATA Pill #094 - PyAirbyte and why Gemini 1.5 are bullish for RAG

#178 The Evolution of Data Lake Formats: Delta Table and JSONL

Data Talks: Are you listening?

Understanding Data Science and Its Workflow

People are catching up with post-deployment data science

Mastering Vector Embeddings: A Comprehensive Guide to Revolutionizing Data Science

The Revolutionizing Power of Data Science: Unleashing Insights and Driving Transformation Across Our World

Our test case

Preprocessing

Running WCC Community Detection

领英推荐

Jaccard Similarity to the Rescue

Evaluating Effort for Production

Jeff Tallman的更多文章

Time Series Vectors in Neo4j

Hands on with GenAI & Neo4j Apr 25th in NYC @ AWS

The Challenges of Virtual Knowledge Graphs

Vector Indexing plus Knowledge Graphs with Neo4j

Modeling Longitudinal/Time Series/Sequential Data in Neo4j

Modeling Discrete Relationships in Neo4j

Graph Feature Engineering for Longitudinal Data (aka Time Series)

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

Buzzword Salad: Ontologies, Digital Twins, Knowledge Graphs….

Hamiltonian Paths

社区洞察

其他会员也浏览了

Ghosts In The Machine: Uncovering Five Hidden Patterns In Your Data

Identifying Data Science Use Cases – Boosting Business Through Data Science Series

What is Data Science?

DATA Pill #094 - PyAirbyte and why Gemini 1.5 are bullish for RAG

#178 The Evolution of Data Lake Formats: Delta Table and JSONL

Data Talks: Are you listening?

Understanding Data Science and Its Workflow

People are catching up with post-deployment data science

Mastering Vector Embeddings: A Comprehensive Guide to Revolutionizing Data Science

The Revolutionizing Power of Data Science: Unleashing Insights and Driving Transformation Across Our World