Sydney Trains - Most critical station connection (Part 2)

Sydney Trains - Most critical station connection (Part 2)


The Question:

If we could connect any two #sydneytrains stations together, which two would be best?


This is my plan of attack:

  • Create a baseline of the BC scores of the existing network
  • Generate candidate train lines linking stations that don’t yet have a connection.
  • Recalculate the BC scores with the new connection
  • Compare against the baseline and identify which one is best.


Recap:

  • Set up my python script to interact with the Neo4J Python driver.
  • Learnt how to use the GDS functions to create projections.
  • Determined a format to store the BC scores as a Pandas Series.


Candidates

I used Neo4J's query language #cypher in identifying unconnected stations.

So how many candidate pairs are there?

Let's do some math to have a sanity check on the numbers.

The train network has 177 stations. To calculate the number of possible pairs in the network, we can use the “choose” function we learned in high school maths when dealing with combinations (nCr).

177 C 2 = 177 * 176 / 2 = 15,576        

If we remove the existing connections (191), then we have:

15,576 - 191 = 15,385        

possible candidate pairs to explore.

That's a lot!


Scripting

I continue to build on my script.

I created a function that looks through the knowledge graph for any stations that are not yet connected.

I built up this query by searching for pairs of stops.

MATCH (s1:Stop), (s2:Stop)
RETURN s1.name AS station1, s2.name AS station2        

This gives me the Cartesian product of all the stops.

The problem with this statement is that it matches all stops that still have a connection.

We would then have to filter out all the pairs that already have a connection.

We do this by adding a WHERE clause.

MATCH (s1:Stop), (s2:Stop)
WHERE NOT (s1)-[:NEXT_STOP]-(s2)
RETURN s1.name AS station1, s2.name AS station2        

Because relationships in the knowledge graphs are directed, we need to filter out all the pairs that have this NEXT_STOP relationship in both direction. So we leave the arrow character out of the relationship syntax to check for a match in either direction.

The last thing we need to do is to filter out all the duplicate pairs because we don't need to consider the reverse direction. We can do this by forcing the pairs to be in alphabetical order. In the same stroke, eliminate the pairs with two of the same stop since the same name cannot be before itself.

MATCH (s1:Stop), (s2:Stop)
WHERE s1.name < s2.name AND NOT (s1)-[:NEXT_STOP]-(s2)
RETURN s1.name AS station1, s2.name AS station2        

Now that I have the query finished I can add that to the script.

I then transform the results of the query into a list of tuples.

def get_candidate_pairs(session):

    query = """
        MATCH (s1:Stop), (s2:Stop)
        WHERE s1.name < s2.name AND NOT (s1)-[:NEXT_STOP]-(s2)
        RETURN s1.name AS station1, s2.name AS station2
    """
    result = session.run(query)
    candidates = [(record["station1"], record["station2"]) for record in result]
    return candidates        


要查看或添加评论,请登录

Dougy Lee的更多文章