Sydney Trains - Most Critical Station Connection (Part 1)
Question
If we could connect any two #sydneytrains stations together, which two would be best?
Context
Yesterday, I calculated the importance of the station within the network using a metric called the Betweenness Centrality (BC) score. It is a measure of how important each train station is in the network. The score can tell us how badly the system would suffer if that station is down for some reason, whether it be a maintenance issue or electrical vault or a medical emergency. The higher the BC score, the more disruption it would cause. My biggest take away was identifying three train stations that were highly central to the network: Central, Chatswood and Sydenham.
The Plan
The question I asked is looking at what is the best link I could build between two stations that would reduce the centrality of the entire system.
This is my plan of attack:
Achievement
Today's exercise was to get more practice using the Python Driver to interact with the knowledge graph.
Here's what I achieved:
The Baseline
Yesterday, I ported all the train stations as nodes on a Neo4J knowledge graph and calculated the between the centrality score for each of them. Now that I have this data, I can use it as a baseline.
Scripting
Now I have an idea on where to start, I open up my trusty PyCharm.
I’ll need to interact with the Neo 4J database using the Python driver.
The package that I used was the neo4J library.
Install that and import the graph database driver.
from neo4j import GraphDatabase
Using this driver, I connect to the database using the credentials that I've set up for my Neo4J instance.
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "password"
NEO4J_DATABASE = "trains"
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD), database=NEO4J_DATABASE)
Now we run the driver to execute a query.
In this particular query, I create a projection for the baseline.
query = f"""
MATCH (s1:Stop)-[r:NEXT_STOP]->(s2:Stop)
WITH gds.graph.project('{projection_name}', s1, s2) AS g
RETURN g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS relationships
"""
session.run(query)
I then use this projection to calculate the baseline BC scores.
# Calculate the betweenness centrality on the baseline projection
query = f"""
CALL gds.betweenness.stream('{projection_name}')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS station, score
"""
result = session.run(query)
baseline = pd.Series({record['station']: record['score'] for record in result})
A little bit of forward thinking. Since I'll be using the pandas library to make statistical calculations on these scores, I would need to create a data frame or series and return it as part of my function.
Gotchas
Here are some of the gotchas that I ran into whilst developing the script.
It seems like there is a deprecated function to create cypher projections.
The new function to call it is: gds.graph.project
The documentation of it is here:
The queries create an error whenever you want to create a projection with the same name as one that exists.
An error also occurs when you try to drop a projection that does not exist.
This means that when I create or drop the projection, I would need to wrap it up in an existence check.
CALL gds.graph.exists('{projection_name}') YIELD exists
WHERE exists
CALL gds.graph.drop('{projection_name}') YIELD graphName
RETURN graphName;
The Code
You can see all my code in the Github Gist here: