House-holding / Entity Resolution with Neo4j & Graph Data Science Library
After 3 years at Neo4j and my sparse blogs being very low level, it probably is causing some to question whether I have learned anything…..
…..wwwellllllll….maybe this post will prove an old dog learned some new tricks.
One of the problems IoT and Webfront businesses face is determining from all the plethora gazillions of interactions, which ones are from the same user/household or business entity.??Yeah – it is simple if you force them to log in…..but before that a customer may be browsing your site for days just thinking about buying that brand spankin’ new battery powered dual bevel sliding miter saw.???In the process, they might have also looked at the battery powered framing nailer (power tools – its why guys really go to Home Depot and Lowes!) a few days ago…?and before that some time ago a new drill….?You get the picture – a bunch of customer touchpoints scattered over days, weeks, months, ….?It would be great if you could somehow link all those sessions together ….even if sessions happened on different devices (work laptop, cell phone, tablet, spouse’s tablet, ….).??
There is a name for this problem…actually….several names:
It also often is a critical step in fraud detection – especially with synthetic identities.?With low volumes of data ….meh….you might be able to figure it out with some sampling queries and optics (good old eyeballs).???But at higher volumes…whoa!!??
This is where data science comes in.??Now my Pennsyltucky definition of Data Science is that it is what Statistics wanted to be when it grew up.?Or maybe it was finally figuring out a way to make statistics useful.??I hated statistics….???Sigh.??I tried to get out of typing class in high school too.??Then when I transitioned from flying to a ground job in the Navy, I turned down a job working with databases thinking it would be boring.?You can see how all those worked out.??Karma.
We going to show you one way to do this using that magical thingy called “Data Science” – and more specifically, Neo4j’s Graph Data Science Library.???A much more intelligent colleague of mine already has written up entire four series you can read (intro at https://neo4j.com/developer-blog/exploring-fraud-detection-neo4j-graph-data-science-summary/ ) if my drivel is too simplistic. I think he also can be blamed (at least partially) for the whitepaper on fraud and GDS found at https://neo4j.com/whitepapers/graph-data-science-use-cases-entity-resolution/.?
We have a number of customers doing this in real life – one of which is Meredith – who is cited in the above paper as using Graph Data Science on 4+TB of data comprised of 14 billion nodes and 20+ billion relationships….
….yes, those were “b’s”
….so you can tune out the fears, questions, doubts and deceitful marketing by competitors about Neo4j’s scalability.??
Our test case
Okay – so let’s assume we are a fictional bank – and we want to track customer session behavior for multiple reasons – help detect fraud via behavior (vs. just matching reused identifying attributes), identify customer behaviors for better recommendations, prevent account takeovers, etc.??Accordingly, we have a really simple graph model that looks like this:
For those unfamiliar with Neo4j, the labels in the bubbles are secondary labels – think of them as replacements for very low cardinality data attributes – but can be accessed a lot faster and used in searches as a result way more efficiently than your RDBMS database allows.?Some of these will be used in the next section on preprocessing.
Back on track, this represents all the data you have from the web application….plus some internal sources of truth (customer info such address, etc.) and some hastily acquired publicly available datasets that resolves IP addresses to the ISP and location.?First, let’s consider how good some of the data elements are in identifying users.
By contrast, consider some of the other data elements:
There are exceptions to the rules, of course – library and other organizations allow multiple users to share the same computer over time.??Customers may be at a location such as an airport, Starbucks or McDonalds using a public WiFi which resolves to a common IP address.??An accountant may access multiple accounts on the behalf of their customers.???Given ISP DHCP lease renewals, you might connect from different IP addresses from the same device …..but at least on the same ISP band of IP addresses and possibly resolved to the same near-by ISP service location.
We are going to do this with a completely fake set of data that has ~100K customers, 4.1 million sessions, 1.5 million cookies, 3.6 million IP addresses, 115K devices, ~500K accounts (100K deposit accounts, 400K credit accounts)….
….while not real life size….it isn’t toy sized either ….appreciable data that should show how the techniques work at scale.
Preprocessing
Before we run the GDS algo’s we need to a bit of preprocessing:
// Add a label to shared/community IP addresses (e.g. Starbucks, etc.)
// to allow filtering later
MATCH p=(d1:Device)<-[:SESSION_DEVICE]-(s1:AccountSession)
-[:SESSION_IP]->(ipaddr:IPAddress)<-[:SESSION_IP]-
(s2:AccountSession)-[:SESSION_DEVICE]->(d2:Device)
WHERE id(s1)<>id(s2)
? AND id(d1)<>id(d2) // we can exclude sessions from same device
? AND abs(duration.between(s1.sessionDateTime,s2.sessionDateTime).days) <= 7
// aggregate number of distinct devices plus our original 1
WITH ipaddr, count(distinct d2)+1 as numDevices?
WHERE numDevices > 5
SET ipaddr:CommunityIP
;
// Add a label to likely residential IP addresses (e.g. not Starbucks, etc.)
// to allow filtering later
MATCH (ipaddr:IPAddress)
WHERE NOT ipaddr:CommunityIP
SET ipaddr:ResidentialIP
;??
// Add a label to filter devices for high incident devices
MATCH (dev:Device)<-[:SESSION_DEVICE]-(sess:AccountSession)
<-[:ACCOUNT_SESSIONS]-(acct:DepositAccount)
WITH dev, count(acct) as numAccts
WHERE numAccts > 20
SET dev:CommunityDevice
//Added 90987 labels, completed after 11123 ms
;
// Add a label to filter devices for low incident devices
MATCH (dev:Device)
WHERE NOT dev:CommunityDevice
SET dev:PersonalDevice
//Added 24313 labels, completed after 59 ms.
That last one takes a bit of explanation.?Think about your less than favorite metropolitan area…..probably a gazillion sessions a day.??We can filter the noise a bit for what we want and only create relationships between the sessions and the locations if a device had more than one session from that location.?Think about it – yes, it does remove the possibility that I can learn that a customer makes a business trip to NYC and accesses his account there.?But our goal is not to learn our customer’s behaviors….the data is there, we can still do that if we want….?Our real goal is to use session behaviors to identify customers – and one-off device sessions just isn’t going to help us.?The cypher to do this is:
// Same ISP subnet
call apoc.periodic.iterate("
MATCH p=(d:Device)<-[:SESSION_DEVICE]-(s1:AccountSession)
-[:SESSION_LOCATION]->(loc:IPLocation)
<-[:SESSION_LOCATION]-(s2:AccountSession)
-[:SESSION_DEVICE]->(d)
WHERE id(s1)<>id(s2)
? AND NOT exists((s1)-[:SESSION_DEVICE_LOCATION]->(loc))
RETURN id(s1) as s1_id, id(s2) as s2_id, id(loc) as loc_id
","
WITH s1_id, s2_id, loc_id
MATCH (s1:AccountSession) WHERE id(s1)=s1_id
MATCH (s2:AccountSession) WHERE id(s2)=s2_id
MATCH (loc:IPLocation) WHERE id(loc)=loc_id
MERGE (s1)-[:SESSION_DEVICE_LOCATION]->(loc)
MERGE (s2)-[:SESSION_DEVICE_LOCATION]->(loc)
", {})
// Created 4184597 relationships, completed after 934529 ms.
A bit of explanation – we are going to be modifying millions of nodes….in a single statement.?In most databases that are ACID compliant, any single statement is a single transaction – it can be rolled back in its entirety if it fails.??For many databases, single large transactions always lead to fear of “blowing out the log” – meaning the transaction log fills and ALL processing halts.?The DBA then extends the log and whines about the additional disk space that is hard to recover.??Neo4j is just smidge smarter.?In log-based databases, large modifications first log all the changes to be made to the log and then commits….?If it fails, in some situations, the database has to log Compensating Log Records to “undo” all the changes so that on recovery the changes are not applied anyhow.??Neo4j instead records uncommitted changes in heap memory – if the transaction exceeds memory – it is canceled and rolled back.??Processing continues.??Man, oh, man – it is so much more fun for a database developed in the 21st century where memory was plentiful vs. being forced to live with implementations and architectures from back when 640KB was all you had.??Yep – those RDBMS’s we all learned on are still dragging the legacy of antique hardware restrictions into the new millennium.??APOC is a library of Neo4j utility stored procedures – and this one takes two queries (a selector and a writer) and batches it into iterative smaller transactions – so in this case, it breaks our really large single transaction into bunches of 10000 (I think is the default) smaller transactions.
Running WCC Community Detection
Community detection for the non-data scientists is a simple way of saying “segmentation” – organizing a plethora of things into related groups.?There are many different algorithms for doing this, the most popular being Weakly Connected Components (WCC) which just looks for any connection between entities and Louvain – which does the same but builds hierarchies for subgroups within the larger.
Our plan of attack is simple – create communities of sessions and then see how many of these communities only have a single possible customer associated.?For those, we can then directly associate the devices, etc. with the customer for future analysis.??Other communities of sessions may have more than one customer – we will call those “collisions” and we will deal with them next.
Community detection algorithms strictly work on what is termed a “monopartite” graph – meaning there is only a single entity of focus.?In our case – “AccountSessions”.?If we were trying to perform community detection among Customers, our single entity would be “AccountHolders”.??Now, then, as a second part, community detection only works for entities that are related to one another – so we need to create a relationship between sessions where there is a common shared secondary data element such as device, IP address, account, customer, etc.??To do this, we will use the cypher:
// Find shared info for top tier quality indicators
CALL apoc.periodic.iterate(
"MATCH (sess1:AccountSession)-[:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS]
->(sharedInfo)<-[:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS]
-(sess2:AccountSession)
WHERE NOT sharedInfo:IPAddress?
? AND NOT sharedInfo:CommunityDevice?
? AND id(sess1)<id(sess2)
RETURN id(sess1) as sess1_id, id(sess2) as sess2_id
, labels(sharedInfo) as myLabels
","
WITH sess1_id, sess2_id, myLabels
MATCH (sess1:AccountSession) WHERE id(sess1)=sess1_id
MATCH (sess2:AccountSession) WHERE id(sess2)=sess2_id
MERGE (sess1)<-[r:SHARED_SESSION_INFO]-(sess2)
SET r.sharedInfo=apoc.coll.unionAll(r.sharedInfo,myLabels)
",{}
)
;
// add in medium quality indicator for shared IP within 7 days
CALL apoc.periodic.iterate(
"MATCH (sess1:AccountSession)-[:SESSION_IP]->(ipaddr:ResidentialIP)
<-[:SESSION_IP]-(sess2:AccountSession)
WHERE abs(duration.inDays(sess1.sessionDateTime,sess2.sessionDateTime).days)<=7?
? AND id(sess1)<id(sess2)
RETURN id(sess1) as sess1_id, id(sess2) as sess2_id, labels(ipaddr) as myLabels
, abs(duration.between(sess1.sessionDateTime,sess2.sessionDateTime).days)
as numDays
","
WITH sess1_id, sess2_id, myLabels, numDays
MATCH (sess1:AccountSession) WHERE id(sess1)=sess1_id
MATCH (sess2:AccountSession) WHERE id(sess2)=sess2_id
MERGE (sess1)<-[r:SHARED_SESSION_INFO]-(sess2)
SET r.sharedInfo=apoc.coll.unionAll(r.sharedInfo,myLabels)
? , r.sharedIPdays=numDays
",{}
)
;
// append shared info for lower quality indicators but only
// if a top-tier quality indicator existed
// use the new CALL {} IN TRANSACTIONS as alternative?
// to apoc.periodic.iterate
MATCH (sess1:AccountSession)<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
CALL {
WITH sess1, sess2, ssi
MATCH (sess1)-[:SESSION_LOCATION]->(:IPLocation)<-[:SESSION_LOCATION]-(sess2)
SET ssi.sharedInfo=ssi.sharedInfo+"IPLocation"
} IN TRANSACTIONS
//Set 18757633 properties, completed after 657619 ms.
;
MATCH (sess1:AccountSession)<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
CALL {
WITH sess1, sess2, ssi
MATCH (sess1)-[:SESSION_DEVICE_CITY]->(:City)
<-[:SESSION_DEVICE_CITY]-(sess2)
SET ssi.sharedInfo=ssi.sharedInfo+"City"
} IN TRANSACTIONS
//Set 18757633 properties, completed after 623525 ms.
;
….and now we have our monopartite graph – which was depicted in the center of the earlier model diagram:
The problem is that not all associations are equal – for example sessions that share cookies are quite distinctive.??We could add a score and then run WCC in weighted mode.?However, we also more or less attributed weights in building our monopartite graph – e.g. removed shared IP addresses that were not within 7 days, only looked at devices in same ISP subnet, etc.??So, we can run WCC in unweighted mode.?To do this, first we need to project our graph into GDS memory:
CALL gds.graph.create(
? ? "shared-sessions",
? ? ["AccountSession"],
? ? {
? ? ? ? SHARED_SESSION_INFO: {
? ? ? ? type: "SHARED_SESSION_INFO",?
? ? ? ? orientation: "UNDIRECTED",?
? ? ? ? }?
? ? },
? ? {readConcurrency: 8}
)
// creates in 11806 milliseconds (11.8 seconds)
The second step is to run the WCC algorithm against that graph and record the communities that each session belongs to.??The problem is – there is likely a large number of nodes that are isolated and will create a ton of communities with a size of 1.??We really don’t care about them.??A single session is not going to really help us identify probable customers across sessions.??So rather than having the GDS function for WCC write directly back to the database, we will instead stream the results through additional cypher statements to “filter” the results and remove communities of size 1.
CALL gds.wcc.stream(
? ? "shared-sessions",
? ? {
? ? ? ? concurrency: 8
? ? }
) YIELD nodeId, componentId
WITH gds.util.asNode(nodeId) as session, componentId as WCC_id
WITH WCC_id, collect(session) as wcc_list
WHERE size(wcc_list) > 1
WITH WCC_id, wcc_list
UNWIND wcc_list as curSession
SET curSession.WCC_communityID=WCC_id
//Set 4175007 properties, completed after 16192 ms.
Now let’s see how well things worked out.??First, we can get an idea of how many communities we created – we are expecting quite a few:
领英推荐
MATCH (sess:AccountSession)
RETURN count(distinct sess.WCC_communityID)
//444707
Not too surprising.??The bigger question is how effective was this – how many communities exist with multiple customers associated???
MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->(sess1:AccountSession)
<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE sess1.WCC_communityID=sess2.WCC_communityID
? AND id(ah1)<>id(ah2)
RETURN count(p) as numCollisions
//8517 total
Hmmmm…..?Now we know that one of our aspects we considered was shared IP addresses.??We restricted it to 7 days – but the reality is that due to an ISP refresh of DHCP leases, an IP address technically could be reassigned to a different location within the same day.?To see how this impacts the quality of our results, we can do a query like:
MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->(sess1:AccountSession)
<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE sess1.WCC_communityID=sess2.WCC_communityID
? AND id(ah1)<>id(ah2)
RETURN ssi.sharedIPdays, count(p) as numCollisions
ORDER BY ssi.sharedIPdays DESC
;
From which we get results similar to:
ssi.sharedIPdays numCollisions
null 533
7 1009
6 1008
5 980
4 981
3 997
2 1024
1 964
0 1021
Hmmm…..a bit more than we expected at the lower end.???Now if we take a look at some of these collisions:
// shows false positives (or fraud) due to overlapping IP addresses
MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->
(sess1:AccountSession)<-[ssi:SHARED_SESSION_INFO]-
(sess2:AccountSession)<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE sess1.WCC_communityID=sess2.WCC_communityID
? AND id(ah1)<>id(ah2)
? AND NOT ssi.sharedIPdays IS NULL
OPTIONAL MATCH p2=(sess1)<-[:ACCOUNT_SESSIONS]-(:DepositAccount)
OPTIONAL MATCH p3=(sess2)<-[:ACCOUNT_SESSIONS]-(:DepositAccount)
RETURN p, p2, p3
LIMIT 50
;
We get a picture similar to:
So, we have a lot of communities of 2 and some quite a bit larger.??If we look at a single community, we see in some cases a result similar to:
Which shows a very weak link actually connecting two subcommunities.??While it might be tempting to run Louvain instead and look at the resulting subcommunities in isolation, we really aren’t sure how many links between them exists – e.g. 2??4??and then what happens with Louvain???
Jaccard Similarity to the Rescue
For those who are new to this, data science has two very common similarity algorithms that try to score how similar two items are. The first is Jaccard - which uses either no weight or a single weight to evaluate similarity. For example, if two sessions are from the same IP address, use the same device, etc. - no real concern about scoring. The second is Cosine Similarity - which uses multiple weights - e.g. it might consider the frequency of same device, frequency of same IP address....etc. As you can imagine - it can be a bit more precise - but.....it takes a LOT longer to run, so for most purposes, Jaccard Similarity - which we call Node Similarity - is good enough. If you get bored, you can look it up, but the basic measurement is the intersection over the union of the sets.
Back to our problem. Once the initial communities are established, a second step is often to run Node (aka Jaccard) Similarity algorithm on the communities to “refine” those that have collisions.??Remember, for use cases such as fraud, we need to have a human look at the results and evaluate whether fraud is really being performed.??With ~8500 communities with collisions, that is a lot of evaluation.?If we had 10 investigators and if each could do 10 cases per day, it would take 85 days – or ~2.5 months.?Ouch.?Remember, most fraud windows are within a 30 days.
Now Neo4j’s GDS has 4 classes of graph projections divided among 2 principal categories – named/unnamed and native/cypher.?Named/unnamed simply refers to whether the graph projection has a memory.??Above, we created a named graph “shared-sessions” for running WCC.?The name is important – without a name, you cannot perform subsequent GDS algorithms as ….well….without a name, we simply can’t find it….and so Neo4j GDS automatically drops any unnamed graphs.?Once and done – so to speak.??Native vs. cypher refers to the aspect of whether the graph projection is simply constructed by loading the nodes/relationships into memory as they exist in the database or whether we have to execute a cypher statement to identify which nodes/relationships to load.??Generally speaking, we prefer named native projections as they are much faster for the GDS library to load.??However, in this case, we want to iterate through each community and run node similarity just within that community to reduce the false positives.?The good news is we only need to run node similarity on the communities with more than one customer associated – so rather than 445K executions, we only have 8.5K.??Still, rather than creating 8500 named native graphs, we will exploit the unnamed graphs and use a cypher projection to only focus the algo on the limited subset.
Now, a bit of details about a projected graph.??It is similar to an adjacency matrix:
As a consequence, the cypher projection has two main parts – a query to identify all the nodes; and a query to identify the relationships between those nodes.?Our cypher query will take the form of:
// node query
MATCH (s:AccountSession {WCC_communityID: $curCommunityID})
RETURN id(s) as id
UNION?
MATCH (s:AccountSession {WCC_communityID: $curCommunityID})
-[r:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS|SESSION_IP|
SESSION_LOCATION|SESSION_CITY]-(sharedInfo)
RETURN id(sharedInfo) as id
The UNION might be a bit weird.??However, those familiar with similarities know that we are comparing using a “bipartite” graph.??That means there are two types of entities involved.??The first one is the entity we are focused on – AccountSession in this case.??The second type is all the other entities (plural) that we want to compare the sessions to see how similar they are – e.g. shared devices, cookies, accounts, IP addresses, etc.??As a result, we need to load both the nodes we are focusing on as well as all the nodes that are related.
Then we need to define the relationship between those nodes – and ideally provide a score for how important we think that relationship is (or how to weight it in comparison).??Note that a direct relationship between the nodes doesn’t have to exist – we are simply returning a pair of nodes implying a relationship – and a weight.??So, our relationship query resembles:
// relationship query
MATCH (s1:AccountSession {WCC_communityID: $curCommunityID})
-[r1:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS|SESSION_IP|
SESSION_LOCATION|SESSION_CITY]-(sharedInfo)
-[r2:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS|SESSION_IP|
SESSION_LOCATION|SESSION_CITY]-
(s2:AccountSession {WCC_communityID: $curCommunityID})
RETURN id(s1) as source, id(s2) as target,
(CASE
WHEN type(r1) IN ['SESSION_COOKIE','ACCOUNT_SESSIONS'] THEN 0.9
WHEN type(r1)='SESSION_DEVICE' THEN 0.8
WHEN type(r1)='SESSION_IP'?
AND abs(duration.between(s1.sessionDateTime,
s2.sessionDateTime).days)<=1 THEN 0.7
WHEN type(r1)='SESSION_IP'?
AND abs(duration.between(s1.sessionDateTime,
s2.sessionDateTime).days)<=3 THEN 0.5
WHEN type(r1)='SESSION_IP'?
AND abs(duration.between(s1.sessionDateTime,
s2.sessionDateTime).days)<=7 THEN 0.4
WHEN type(r1)='SESSION_LOCATION' THEN 0.025
WHEN type(r1)='SESSION_CITY' THEN 0.0125
ELSE 0.0?
END) as weight
Put in English - for each community (identified by a parameter), we want to compare sessions that share information and assign a weight based on how good we think that relationship is.?Note, in this case, rather than generically looking at a 7-day window for IP addresses, we adjust the score for the overlap from 1 or fewer days (0.7) to the max of 7 days (0.4).??To assist our query, we will add an index on the community ID property we created so that the execution can find sessions by community quickly:
CREATE INDEX AccountSession_WCC_communityID_idx IF NOT EXISTS
FOR (s:AccountSession) ON (s.WCC_communityID)
;?
The final cypher query to perform the node similarity iterations is:
//first part of query is to identify session communities with more than
// 1 customer associated with the community
MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->(sess1:AccountSession)
<-[ssi:SHARED_SESSION_INFO]-(sess2:AccountSession)
<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE sess1.WCC_communityID=sess2.WCC_communityID
? AND id(ah1)<>id(ah2)
?
// now we collect each community with a list of customers in that community
// including the customer associated with the session we started with
WITH sess1.WCC_communityID as communityID, id(ah1) as curCustomer
, collect(id(ah2)) as curCustomerList
WITH DISTINCT communityID, curCustomerList+curCustomer as customerList
// because each session may have multiple related sessions, we want to?
// remove duplicate customer ID's from the list.? ?We aren't going to?
// use this list in the actual graph, but it is used in some earlier
// queries where we look at sample data to get an idea of how big the
// similarities could be
WITH communityID
, apoc.coll.dropDuplicateNeighbors(
apoc.coll.sort(customerList)) as customerIDList
WITH collect({communityID: communityID
, numCustomers: size(customerIDList)
, customerIDList: customerIDList}) as communityListMap
// now we construct a map that is a list of all the communities
// along with their customer lists
WITH communityListMap, size(communityListMap) as numMaps
, range(1,size(communityListMap)) as mapIdxList
// and then iterate through that list executing the node similarity
// and creating new relationships between sessions that have a?
// similarity score higher than our threshold.
UNWIND mapIdxList as curMapIdx
WITH communityListMap[curMapIdx-1].communityID as curCommunityID
, communityListMap[curMapIdx-1].customerIDList as curCustomerList
, curMapIdx, numMaps
CALL gds.nodeSimilarity.write({
nodeQuery: "MATCH (s:AccountSession {WCC_communityID: $curCommunityID})
RETURN id(s) as id
UNION?
MATCH (s:AccountSession {WCC_communityID: $curCommunityID})
-[r:SESSION_DEVICE|SESSION_COOKIE|ACCOUNT_SESSIONS|
SESSION_IP|SESSION_LOCATION|SESSION_CITY]-(sharedInfo)
RETURN id(sharedInfo) as id",
relationshipQuery: "MATCH (s1:AccountSession {WCC_communityID:?
$curCommunityID})-[r1:SESSION_DEVICE|SESSION_COOKIE|
ACCOUNT_SESSIONS|SESSION_IP|SESSION_LOCATION|
SESSION_CITY]-(sharedInfo)-[r2:SESSION_DEVICE|
SESSION_COOKIE|ACCOUNT_SESSIONS|SESSION_IP|
SESSION_LOCATION|SESSION_CITY]-
(s2:AccountSession {WCC_communityID: $curCommunityID})
RETURN id(s1) as source, id(s2) as target,
(CASE
WHEN type(r1) IN ['SESSION_COOKIE',
'ACCOUNT_SESSIONS'] THEN 0.9
WHEN type(r1)='SESSION_DEVICE' THEN 0.8
WHEN type(r1)='SESSION_IP'?
? ?AND abs(duration.between(s1.sessionDateTime,
? ? s2.sessionDateTime).days)<=1 THEN 0.7
WHEN type(r1)='SESSION_IP'?
? ?AND abs(duration.between(s1.sessionDateTime,
? ? s2.sessionDateTime).days)<=3 THEN 0.5
WHEN type(r1)='SESSION_IP'?
? ?AND abs(duration.between(s1.sessionDateTime,
? ? s2.sessionDateTime).days)<=7 THEN 0.4
WHEN type(r1)='SESSION_LOCATION' THEN 0.025
WHEN type(r1)='SESSION_CITY' THEN 0.0125
ELSE 0.0?
END) as weight",
parameters: {curCommunityID: curCommunityID},
concurrency: 16,
relationshipWeightProperty: "weight",
writeRelationshipType: "COMMUNITY_SIMILARITY",
writeProperty: "weight",
similarityCutoff: 0.5
})
YIELD relationshipsWritten
WITH curMapIdx, numMaps, curCommunityID, relationshipsWritten
RETURN curMapIdx, numMaps, curCommunityID, relationshipsWritten
;
//Started streaming 6771 records after 137 ms and completed after 91372 ms,?
//displaying first 2500 rows.
Now, we want to see how well we did:
MATCH p=(ah1:AccountHolder)-[:ONLINE_SESSIONS]->(:AccountSession)
-[r:COMMUNITY_SIMILARITY]->(:AccountSession)
<-[:ONLINE_SESSIONS]-(ah2:AccountHolder)
WHERE id(ah1)<>id(ah2)?
RETURN count(p)
;
//╒══════════╕
//│"count(p)"│
//╞══════════╡
//│499? ? ? ?│
//└──────────┘
// Started streaming 1 records after 8 ms and completed after 14286 ms.
Cool!?Rather than 8500 cases, our investigators only need to look at ~500.?With 10 investigators performing 10 cases/day as noted earlier, we are looking at a week’s work vs. 2.5 months.??From the perspective of householding/entity resolution, starting from over 4 million sessions, we can now associate 99.99% or something like that of the sessions with individual customers and establish formal links for devices and IP addresses to aid in quickly identifying them on future visits to our site.
As note, the sample database used included fraudulent accounts where synthetic identities and other shared aspects (e.g. devices) were created for demonstration purposes.?The ~500 collisions remaining are most likely due to those fraudulent accounts – which also shows a way of using behaviors over time (such as web sessions) as a possible method for finding fraud as well as entity resolution.
Cool!?Now we just need to apply the same methods to payment chains and we can chase down money laundering.??But that is a discussion for another day.
Evaluating Effort for Production
What about level of effort and time taken to do this.??Let’s see…..if we discount the preprocessing time (a few hours – can be implemented once and maintained), the overall execution was:
Some parts of this in production can be pre-maintained – e.g. using Label Propagation to maintain SHARED_SESSION_INFO relationships which would eliminate the first part and pre-creating the index…but even with all that, we are talking about a ~15 minute process ….on 4 million sessions!!!!??…and if we maintain it via Label Propagation….~2 minutes!
Less time than it took you to read this.
Now for the schema changes.??
Now, if you’d would have been using a competitive product, you likely would have had to reload all the data (aaarrrghhhh) – recompiled all the queries…..and figured out how you were going to distribute those relationships across the distributed nodes…or if it was distributed relationships which are a nightmare to maintain……
….why would you willingly subject yourself to that?????
Database Expert(Sybase/PostgreSQL/Neo4J/MS-SQL/Graph Databases/Cassandra) at Crédit Agricole CIB
3 年Very nicely explained all the concepts used specially the use of named Vs unnamed graphs and label propagation. This use case can be useful in fraud detection by applying various graph algorithms as you have mentioned. Thanks for sharing.