Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs
Ajay Taneja
Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics
1. Introduction:
?
This article is the continuation of my series of blogs on “Graphs” and is the eleventh article in the series. From this article onwards, the focus of this series is going to be on Heterogeneous Graphs and Knowledge Graph Embeddings. We will be discussing in sufficient detail about the concepts involving “Knowledge Graph Completion” – the details of which will become clear as we move further through this blog and series.
?
So far in my Graph series of blogs from Part 1 through Part 10 , I had been discussing about Graphs with only one node type and one relation type. The concepts involving Simple Graph Convolutional Neural Networks discussed in Part 8 and Part 9 of this series involved only one node type and one relation/edge type. We are now going to extend the concepts involving Simple Graph Convolutional Networks to Relational Graph Convolutional Networks which have several node or edge types, that is: multiple types of connections. Such Graphs are termed as “Heterogeneous Graphs”.
?
Heterogeneous Graphs are termed so because they contain heterogeneous type of entities – node and edge types. Through this article we will be discussing about the following:
?
?
This article is organized as follows:
?
2. Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs):
?
?What are Heterogeneous Graphs?
?
A Heterogeneous Graph is defined by a set of Nodes “V”, the nodes could be of different types, a set of edges which connect different nodes, and each edge may have a different relation type as illustrated in the Figure 1 and Figure 2.
?
Thus, this situation may be symbolically represented as:
Examples of Heterogeneous Graphs:
?
Example 1: One common example of Heterogeneous Graph is in the field of biomedicine wherein you can take different biomedicine entities and the relationship between them and represent it as a heterogeneous graph. E.g. one can have different entities in biomedicine area like: Drug, Disease, Adverse event, Proteins, pathways and we have different types of relation between different type of entities.
?
This is illustrated in the Figure 1 below. We have a different node type for a disease such as migraine and we have a different node type such as Fluvestrant that is connected (or treats the disease) to a disease such Breast Neoplasm through a relationship “Treats” as illustrated below:
In the above example of the Biomedical Knowledge Graph, we have:
?
We can have different types of relationship between the drug and side effect it may cause as illustrated in the figure above. ?
?
This is one example of heterogeneous graph with different type of entities and different type of relationships between them.
?
Example 2: Another example of Heterogeneous Graph is an event graph as shown in the Figure 2 below:
?
Here again, we can have different type nodes/entities: a node type could be a flight or an airport, we could have a destination of a flight as an edge type or relation type “cancelled by” as illustrated in figure above.
?
3. Revisiting Simple Graph Convolutional Neural Networks (GCNs):
?
Now, given a heterogeneous graph and different types of relationships, we would like to extend the simple Graph Convolutional Networks to handle this type of heterogeneity. To extend simple GCNs to RGCNs, we should be able to handle different node types, edges with different relationship types. So far, we have defined Graph Convolutional Networks on simple graphs that is: one type of node and one type of relationship this has to be expanded to make more complex heterogeneous graphs.
Let us first recall the working of simple GCNs – that is – graph of one node type and one relationship type as discussed in the preceding articles of this series:
?
?
Working of Simple Graph Convolutional Neural Networks (GCNs):
I had discussed about the concepts involving computational graph and message passing in the Part 9 of this series referenced above. Given a Graph as shown in the figure below:
And considering the node A as the target node, its corresponding computation graph will be as shown below:
The above architecture may be justified as follows:
?
We then added the non-linearity for feature transformation. The non-linearities could be ReLu or sigmoid. Thus, mathematically as single layer of GCN could be represented as follows:
?
The above equation depicts that a node v goes over its neighbours u takes the previous layer representation normalises, sums and transforms and ends it through a non-linearity.
?
The main point to understand about the difference between Graph Neural Networks and Vanilla Neural Networks is that in Graph Neural Networks each node has its own neural network architecture. Every node in the Graph Network gets to define its own computation graph based on the network around its neighbourhood as shown in the Figure below:
?
4. Graph Convolutional Operations in Relational GCNs:
?
Having understood what are Relational GCNs (sections 1 and 2) and recalling the Graph Convolutional operations in a simple GCN (section 3), let us now try to understand what the convolutional operations will be when we have multiple node types and multiple relation types.
?
Consider the graph below which has multiple type of relations:
The edges representing each relation are shown in a unique colour. The way we can generalise GCNs to heterogeneous Graphs is by using different neural network weights or different neural network parameters for different relationship types is shown in the figure below:
?
This means that when we are doing message transformation, we’re not going to apply the same matrix “W” for every incoming edge but we’re going to apply a different matrix “W” depending on the relationship type. Therefore, for three relationship types, we will be using three different weight matrices as shown in the figure above.
?
?
Now, if we look at the computation graph for the node A, it will be as shown in the figure below:
?As we may notice from the figure above, that the message coming from B and D towards the node A is transformed using the same operator “W” because the message is travelling along the relation type “r1” while the message from the node C towards the node A will be transformed using a different operator as it travels along the relational edge “r2”.
5. Mathematical Representation of Relational GCNs:
?
The transformation matrix “W” will be indexed based on the relationship type. Therefore, in a relational GCN, the message computation at level (l + 1) is written as:
?
As per the above equation, we are going to sum over different relationship types. For every relationship type, we’re going to look at the neighbours, we will take the transformation specific to the relationship “r” and take the neighbours and take the messages connected to node “v” based on the relation “r” and take the embedding from the node “v” from the previous layer and transform it.
Therefore, we have “w” for every layer and every relation as seen in the above equation. Cv,r is normalization parameter – which is the node degree. I have discussed about node degree in section 5 of Part 1 of my series on Graphs. In Undirected Graphs, Node Degree denotes the number of edges adjacent to a given node. For example, the node A in the figure below has degree 4 as it’s got 4 edges connected to it.
How to write this as message plus aggregation?
Message:
For each neighbour of a given relation:
Self-loop (nodes own embedding transformation):
领英推荐
Aggregation:
?
This involves summing over messages from neighbours and self-loop and then apply aggregation:
This is called Relational Graph Convolutional Neural Networks where we have different message transformation operators based on the embedding of the previous layer/ type of relation between a pair of nodes.
6. Problems in Relational GCNs:
?
With Relational GCNs, for every relation “r”, we need “L” matrices – that is: for every layer, the transformation matrices are different, and they are different for every relation.
?
The size of the Wr matrix is going to be the embedding dimension of upper layer x embedding dimension of lower layer: dl+1 x d(l). Therefore, we have one such matrix for every layer and every relation type. Because heterogeneous graphs and in particular knowledge graphs can have several relation types, we can get thousands of such matrices. The parameters of the model grow with relation types; hence the model becomes too big to train and because of the large number of parameters, overfitting becomes a common problem.
?
There are two approaches to reduce the number of parameters during RGCN training:
?
?
?
?Block Diagonal Approach to reduce the parameters of RGCN models:
?
Let us first talk about the use of Block Diagonal matrices to reduce the parameters during training – they key idea here is to make the transformation matrices Wr sparse and one way to make these matrices sparse is to enforce a block diagonal structure so that non-zero elements of the matrix lie only on the specific blocks of the transformation/weight matrices “Wr” as shown in the figure below:
This reduces the number of non-zero elements which means that during computation you will only have to estimate the parameters along the non-zero elements. Thus, if we assume that Wr is composed of “B” lower dimension block matrices then the number of parameters will reduce by a factor “B”. So, if use B low dimension block matrices, then the number of parameters reduce from d(l+1) x d(l) to B x d(l) / B x d(l+1)/B.
What do you lose by using Block Diagonal matrices?
?
We do lose some information by using block diagonal matrices – the embedding dimension far from one another will not be able to interact with each other – only the embedding dimension along nonzero entities will be able to interact with each other. That is: only the neurons along the non-zero elements – neurons along the same block – will be able to exchange information with each other.
?
?
?
Basis Learning to reduce the parameters of RGCN models:
?
The second idea to make the RGCNs more scalable is to share the weights amongst the relations. That is – we do not want the relations to be independent of each other, but we want the relations to share some information. This can be achieved elegantly by representing the transformation matrix Wr as a linear combination of “Basis” matrices. These Basis matrices are also termed “Dictionaries”. That is – the wight matrix “Wr” will be represented as a linear combination – that is a weighted sum of dictionary matrices Vb. Vb is shared across all relations. This is commonly referred to as: “Basis Learning” or “Dictionary Learning”.
?
This may be represented as below:
Here, arb is the importance weight corresponding to the relation for the dictionary matrix “b”
?
Thus, every relation will have to learn the importance weight arb for each basis matrix arb is a scalar. This significantly reduces the number of parameters. “B” may be small – say 10 and you may have thousands of relations for which arb that is: 1000 scalars will have to be learnt. This is highlighted through the expression below:
Having talked about the scalability of GCNs using Block Diagonal and Basis learning let us talk about how one would define various prediction tass on the heterogeneous graphs.
?
?
7. Entity / Node Classification using RGCs
?
In case of prediction task involving node classification/entity classification, things are not different. RGCN will compute the representation or an embedding for every node and based on that embedding a prediction head is created.
For example, if we want to classify the nodes into “k” different classes, then the final prediction head will involve the linear transformation of an embedding vector – that is taking the embedding vector of the final layer and multiplying by a weight matrix and then passing it through a nonlinearity/activation function and then have a Softmax in order to predict the probability of a given class. The output involve a “k” head output for “k” different classes.
8. Entity / Link Prediction
?
Stratification – Splitting edges into different types for training/validation and test
For link prediction, we approach things bit differently. Links/Edges might have different types – that is different relationship types – and out of the different relationship types some might be common, and some might be uncommon. For every relationship type, we would split them into:
?
?
We would do that for all the relationship types and then merge all the edges – that is we will independently split the edge types and then merge them. This will mean that even for any rare/infrequent relation types, some of the instances will be in training, some in validation and some in the test set. This independent splitting of edges of each relationship type is called: stratification.
?
In the figure below, we have the original graph shown on the left-hand side and the edges split into training message edges, training supervision edges, validation edges and test edges as shown on the right-hand side with unique colours. The splitting will have to be done for every relation type because the frequency of edges of a specific relation type through the whole graph might be more/less. Once the splitting is done, the training message edges, training supervision edges, validation and test edges are merged as shown in the Figure 14.
?
Formalizing link prediction in Heterogeneous Graphs:
Let us now see how we can formalize link prediction in Heterogeneous Graphs. Consider the graph below:
Imagine that we want to be able to predict the probability of having the Edge between nodes E and A and that edge is of relation type 3 as shown in the figure above. Let us assume that the edge is a training supervision edge and all other edges are training message edges. ?
The way we would like the RGCN to score the edges is that: we will take the final layer embedding of node E and then the final layer embedding of node A and then have a relation specific scoring function “f” that would take these embeddings and transform into a real value.
That is – we take the embedding of node E, do a linear transformation by multiplying it with weight matrix and then take the dot product with the embedding of A pass it through an activation function followed by a Softmax.
The probability obtained will indicate if there is a link between the node E and A. This is one way of formalizing the problem. Let us see the training process now.
?
?
9. RGCN for Entity / Link Prediction: Training Process Summary
?
Consider the Graph as shown in the figure below:
a)??Let us assume that the edge from E to A is the supervision edge
?
b)?Let us assume that all other edges shown by solid lines in the Figure 16 above are training message edges.
?
c)??We are going to use the training message edges to predict the likelihood of existence of the training supervision edge: that is the edge from E to A in this example.
?
d)??Another thing that we have to do as part of link prediction – is to create negative instances. That is, we have to create negative edges.
?
e)??One way of creating the negative edge is perturbing the supervision edge.
For example, in the graph of Figure 16, if we have the supervision edge from E to A, then, one way to create a negative edge is to corrupt the tail of the supervision edge. Then, in this example graph, if we have the supervision edge from E to A, then one way to create the negative edge is to corrupt the tail of his edge. So, we maintain the head at E but corrupt the tail so that edge now points from E to C (or E to B) than E to A.
?
f)??It should be noted that the negative edge does not belong to the training message edge or training supervision edge – for example, the edge from the node E to C is a training supervision edge not a negative edge. Therefore, we have to be careful in sampling negative edges. The RGCN is used to score both the positive and negative edges.
?
g)??The loss function that we want to optimize is the standard cross entropy loss where want to maximise the score of training supervision edge (positive edge) and minimize the score of the negative edge. Therefore, we would write down the penalty as below:
?And then we will have an optimizer like stochastic gradient descent to optimize the parameters of the GCN assigning high probability score to training supervision edge and low score to negative edges.
?
10. RGCN for Entity / Link Prediction: Evaluation Process
?
Having the model trained, we move to the Validation – to validate the performance of th model.
Let us assume that the validation edge we are interested in is between node E and node D as shown below. We’re interested to know if the edge is of type: r3
We are interested to know if this is of type r3. Now, the training message edges, and training supervision edges are used for message propagation, and we will have to score the edge from E to D as shown in the figure 17 above. The score of this edge from E to D will have to be more than the score of all the negative edges.
?
Following will be the evaluation process:
?
?
?
11. Summary of this article:
?
This article covered the following aspects of RGCNs:
?
?
?
?
?
?