Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs


1. Introduction:

?

This article is the continuation of my series of blogs on “Graphs” and is the eleventh article in the series. From this article onwards, the focus of this series is going to be on Heterogeneous Graphs and Knowledge Graph Embeddings. We will be discussing in sufficient detail about the concepts involving “Knowledge Graph Completion” – the details of which will become clear as we move further through this blog and series.

?

So far in my Graph series of blogs from Part 1 through Part 10 , I had been discussing about Graphs with only one node type and one relation type. The concepts involving Simple Graph Convolutional Neural Networks discussed in Part 8 and Part 9 of this series involved only one node type and one relation/edge type. We are now going to extend the concepts involving Simple Graph Convolutional Networks to Relational Graph Convolutional Networks which have several node or edge types, that is: multiple types of connections. Such Graphs are termed as “Heterogeneous Graphs”.

?

Heterogeneous Graphs are termed so because they contain heterogeneous type of entities – node and edge types. Through this article we will be discussing about the following:

?

  1. Firstly, we will talk about Relational Graph Convolutional Networks (RGCNs) that will take care of different node types and relation types.
  2. We will then talk about Knowledge Graphs which are a class of Heterogeneous Graphs. We will talk about Knowledge Graph Embeddings and how we can construct Knowledge Graph Embeddings as well as Knowledge Graph completion which is a very common task in the life of a practicing Graph Analyst / AI Engineer. This discussion will be a part of the twelfth / thirteenth article of this series.

?

This article is organized as follows:

?

  • Firstly, in section 2 of this article we understand the definition of Heterogeneous Graphs with some examples


  • In section 3, we review / recapitulate the underlying concepts of Simple Graph Convolutional Neural Networks (GCNs), the concepts involving: computation graph of a node, formulation of a single layer of GCN – the operations involving Message Passing and Aggregation. ·


  • Having reviewed the concepts involving GCNs we then understand the fundamental difference between GCNs and RGCNS – the Relational Graph Convolutional Neural Networks in section 4.


  • The math behind the definition of message passing and aggregation in RGCNs is elaborated in section 5.


  • Section 6 gets deeper into RGCNs – understanding the problems in RGCNs and means of solving these problems.


  • Sections 7 through 10 are mainly about the training and the evaluation process in RGCNs for node classification and link prediction problems. More emphasis is laid on edge prediction problems because things have to be treated differently in an edge prediction problem in Graphs than in a conventional ML problem.


2. Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs):

?

?What are Heterogeneous Graphs?

?

A Heterogeneous Graph is defined by a set of Nodes “V”, the nodes could be of different types, a set of edges which connect different nodes, and each edge may have a different relation type as illustrated in the Figure 1 and Figure 2.

?

Thus, this situation may be symbolically represented as:


  • A heterogeneous graph is defined as:

Equation: Symbolic representation of a Graph.



  • Nodes with type “i” represented as:

Equation: Symbolic representation of a Node of type i


  • A set of edges connecting different nodes and each edge may have a different relationship type, represented as:

Equation: Symbolic represnation of a set of edges of different relation types between nodes vi and vj


  • Each node is of different type – “T” is a set of nodes:


Equation: Symbolic representaion of a node type


Examples of Heterogeneous Graphs:

?

Example 1: One common example of Heterogeneous Graph is in the field of biomedicine wherein you can take different biomedicine entities and the relationship between them and represent it as a heterogeneous graph. E.g. one can have different entities in biomedicine area like: Drug, Disease, Adverse event, Proteins, pathways and we have different types of relation between different type of entities.

?

This is illustrated in the Figure 1 below. We have a different node type for a disease such as migraine and we have a different node type such as Fluvestrant that is connected (or treats the disease) to a disease such Breast Neoplasm through a relationship “Treats” as illustrated below:


Figure 1: Biomedical Knowledge Graph


In the above example of the Biomedical Knowledge Graph, we have:

?

  • Example node: Migraine


  • Example relation: (fulvestrant, Treats, Breast Neoplasm)


  • Example node type: Protein


  • Example edge type: Causes



We can have different types of relationship between the drug and side effect it may cause as illustrated in the figure above. ?

?

This is one example of heterogeneous graph with different type of entities and different type of relationships between them.

?

Example 2: Another example of Heterogeneous Graph is an event graph as shown in the Figure 2 below:


Figure 2: Event Graphs


?

Here again, we can have different type nodes/entities: a node type could be a flight or an airport, we could have a destination of a flight as an edge type or relation type “cancelled by” as illustrated in figure above.

?

3. Revisiting Simple Graph Convolutional Neural Networks (GCNs):

?

Now, given a heterogeneous graph and different types of relationships, we would like to extend the simple Graph Convolutional Networks to handle this type of heterogeneity. To extend simple GCNs to RGCNs, we should be able to handle different node types, edges with different relationship types. So far, we have defined Graph Convolutional Networks on simple graphs that is: one type of node and one type of relationship this has to be expanded to make more complex heterogeneous graphs.

Let us first recall the working of simple GCNs – that is – graph of one node type and one relationship type as discussed in the preceding articles of this series:

?


?

Working of Simple Graph Convolutional Neural Networks (GCNs):


I had discussed about the concepts involving computational graph and message passing in the Part 9 of this series referenced above. Given a Graph as shown in the figure below:


Figure 3: A Graph with a set of nodes V



And considering the node A as the target node, its corresponding computation graph will be as shown below:


Figure 4: Computation Graph of node A


Figure 5: Computation Graph of node A – the boxes indicate the neural networks where linear transformation and non-linearities are applied



The above architecture may be justified as follows:

?

  • The node A as shown in the Figure 4/5 above, is going to take information from the neighbours in the network. The neighbours of A being B, C and D.? This is one-hop of information. This can be unfolded to multiple hops of information.


  • If we want to consider another hop, then, the node D takes the information from its neighbours A and node C takes the information from A, B, E and F (that is: the neighbours of C) and similarly node B takes information from its neighbours A and C.


  1. Next, if the above (Figure 4/5) is the structure of the network:The messages (embeddings) from node A and C to the node B, from nodes A, B, E and F to the node C and from node A to the node C and the messages from A to the node D will have to be aggregated and transformed.
  2. Similarly, the message from B will have to be aggregated and transformed before passing it to A and the same is with the other nodes. The transformation will be parametrised, and the parameters will have to be learnt.


We then added the non-linearity for feature transformation. The non-linearities could be ReLu or sigmoid. Thus, mathematically as single layer of GCN could be represented as follows:


Equation - Mathematics in simple GCN – equation of a single layer


?

The above equation depicts that a node v goes over its neighbours u takes the previous layer representation normalises, sums and transforms and ends it through a non-linearity.

?

The main point to understand about the difference between Graph Neural Networks and Vanilla Neural Networks is that in Graph Neural Networks each node has its own neural network architecture. Every node in the Graph Network gets to define its own computation graph based on the network around its neighbourhood as shown in the Figure below:



Figure 6: Each node of the Graph has its own neural network architecture – its own computation graph.

?


4. Graph Convolutional Operations in Relational GCNs:

?

Having understood what are Relational GCNs (sections 1 and 2) and recalling the Graph Convolutional operations in a simple GCN (section 3), let us now try to understand what the convolutional operations will be when we have multiple node types and multiple relation types.

?

Consider the graph below which has multiple type of relations:


Figure 7: Heterogeneous Graph – Graph with multiple relation types



The edges representing each relation are shown in a unique colour. The way we can generalise GCNs to heterogeneous Graphs is by using different neural network weights or different neural network parameters for different relationship types is shown in the figure below:


Figure 8: Generalising GCNs to Heterogeneous Graphs – using unique weights for each relation type

?

This means that when we are doing message transformation, we’re not going to apply the same matrix “W” for every incoming edge but we’re going to apply a different matrix “W” depending on the relationship type. Therefore, for three relationship types, we will be using three different weight matrices as shown in the figure above.

?

?

Now, if we look at the computation graph for the node A, it will be as shown in the figure below:



Figure 9: Input Graph and Computation Graph for Node A


?As we may notice from the figure above, that the message coming from B and D towards the node A is transformed using the same operator “W” because the message is travelling along the relation type “r1” while the message from the node C towards the node A will be transformed using a different operator as it travels along the relational edge “r2”.


5. Mathematical Representation of Relational GCNs:

?

The transformation matrix “W” will be indexed based on the relationship type. Therefore, in a relational GCN, the message computation at level (l + 1) is written as:

?

Equation: Mathematics in a single layer of RGCN


As per the above equation, we are going to sum over different relationship types. For every relationship type, we’re going to look at the neighbours, we will take the transformation specific to the relationship “r” and take the neighbours and take the messages connected to node “v” based on the relation “r” and take the embedding from the node “v” from the previous layer and transform it.


Therefore, we have “w” for every layer and every relation as seen in the above equation. Cv,r is normalization parameter – which is the node degree. I have discussed about node degree in section 5 of Part 1 of my series on Graphs. In Undirected Graphs, Node Degree denotes the number of edges adjacent to a given node. For example, the node A in the figure below has degree 4 as it’s got 4 edges connected to it.


Figure 10: Node Degree in Undirected Graphs


How to write this as message plus aggregation?

Message:

For each neighbour of a given relation:


Equation: Message - transformed embeddings from a nodes neighbour's normalised over the node degree



Self-loop (nodes own embedding transformation):


Equation: Message self loop - transformation over nods own embedding



Aggregation:

?

This involves summing over messages from neighbours and self-loop and then apply aggregation:


Equation: Aggregation of transformed messages of a node’s neighbours and self-loop


This is called Relational Graph Convolutional Neural Networks where we have different message transformation operators based on the embedding of the previous layer/ type of relation between a pair of nodes.


6. Problems in Relational GCNs:

?

With Relational GCNs, for every relation “r”, we need “L” matrices – that is: for every layer, the transformation matrices are different, and they are different for every relation.


Expression: L matrices for every relation – “L” denotes the number of layers


?

The size of the Wr matrix is going to be the embedding dimension of upper layer x embedding dimension of lower layer: dl+1 x d(l). Therefore, we have one such matrix for every layer and every relation type. Because heterogeneous graphs and in particular knowledge graphs can have several relation types, we can get thousands of such matrices. The parameters of the model grow with relation types; hence the model becomes too big to train and because of the large number of parameters, overfitting becomes a common problem.

?

There are two approaches to reduce the number of parameters during RGCN training:

?

  • Using Block Diagonal matrix
  • Basis / Dictionary Learning

?

?

?Block Diagonal Approach to reduce the parameters of RGCN models:

?

Let us first talk about the use of Block Diagonal matrices to reduce the parameters during training – they key idea here is to make the transformation matrices Wr sparse and one way to make these matrices sparse is to enforce a block diagonal structure so that non-zero elements of the matrix lie only on the specific blocks of the transformation/weight matrices “Wr” as shown in the figure below:


Figure 11: Block diagonal matrix indicates non-zero elements only on specific blocks of the structure



This reduces the number of non-zero elements which means that during computation you will only have to estimate the parameters along the non-zero elements. Thus, if we assume that Wr is composed of “B” lower dimension block matrices then the number of parameters will reduce by a factor “B”. So, if use B low dimension block matrices, then the number of parameters reduce from d(l+1) x d(l) to B x d(l) / B x d(l+1)/B.


Expression: Reduction in parameters as a result of the weight matrices being block diagonal


What do you lose by using Block Diagonal matrices?

?

We do lose some information by using block diagonal matrices – the embedding dimension far from one another will not be able to interact with each other – only the embedding dimension along nonzero entities will be able to interact with each other. That is: only the neurons along the non-zero elements – neurons along the same block – will be able to exchange information with each other.

?

?

?

Basis Learning to reduce the parameters of RGCN models:

?

The second idea to make the RGCNs more scalable is to share the weights amongst the relations. That is – we do not want the relations to be independent of each other, but we want the relations to share some information. This can be achieved elegantly by representing the transformation matrix Wr as a linear combination of “Basis” matrices. These Basis matrices are also termed “Dictionaries”. That is – the wight matrix “Wr” will be represented as a linear combination – that is a weighted sum of dictionary matrices Vb. Vb is shared across all relations. This is commonly referred to as: “Basis Learning” or “Dictionary Learning”.

?

This may be represented as below:


Equation – weight matrix for relation “r” represented as linear combination of B basis matrices


Here, arb is the importance weight corresponding to the relation for the dictionary matrix “b”

?

Thus, every relation will have to learn the importance weight arb for each basis matrix arb is a scalar. This significantly reduces the number of parameters. “B” may be small – say 10 and you may have thousands of relations for which arb that is: 1000 scalars will have to be learnt. This is highlighted through the expression below:


Expression – each relation will have to learn “B” importance factors where “B” is the number of basis matrices


Having talked about the scalability of GCNs using Block Diagonal and Basis learning let us talk about how one would define various prediction tass on the heterogeneous graphs.

?

?

7. Entity / Node Classification using RGCs

?

In case of prediction task involving node classification/entity classification, things are not different. RGCN will compute the representation or an embedding for every node and based on that embedding a prediction head is created.


For example, if we want to classify the nodes into “k” different classes, then the final prediction head will involve the linear transformation of an embedding vector – that is taking the embedding vector of the final layer and multiplying by a weight matrix and then passing it through a nonlinearity/activation function and then have a Softmax in order to predict the probability of a given class. The output involve a “k” head output for “k” different classes.


Figure 12: Node Classification task – each prediction head involves transformation of the embedding using the weight matrix and application of non-linearity


8. Entity / Link Prediction

?

Stratification – Splitting edges into different types for training/validation and test

For link prediction, we approach things bit differently. Links/Edges might have different types – that is different relationship types – and out of the different relationship types some might be common, and some might be uncommon. For every relationship type, we would split them into:

?

  • Training Message Edges


  • Training Supervision Edges


  • Validation Edges


  • Test Edges

?

We would do that for all the relationship types and then merge all the edges – that is we will independently split the edge types and then merge them. This will mean that even for any rare/infrequent relation types, some of the instances will be in training, some in validation and some in the test set. This independent splitting of edges of each relationship type is called: stratification.

?

In the figure below, we have the original graph shown on the left-hand side and the edges split into training message edges, training supervision edges, validation edges and test edges as shown on the right-hand side with unique colours. The splitting will have to be done for every relation type because the frequency of edges of a specific relation type through the whole graph might be more/less. Once the splitting is done, the training message edges, training supervision edges, validation and test edges are merged as shown in the Figure 14.


Figure 13: The original graph and edges split into training message, training supervision, validation and test edges


?

Figure 14: Link Prediction Split



Formalizing link prediction in Heterogeneous Graphs:

Let us now see how we can formalize link prediction in Heterogeneous Graphs. Consider the graph below:


Figure 15: Example Input Graph



Imagine that we want to be able to predict the probability of having the Edge between nodes E and A and that edge is of relation type 3 as shown in the figure above. Let us assume that the edge is a training supervision edge and all other edges are training message edges. ?


The way we would like the RGCN to score the edges is that: we will take the final layer embedding of node E and then the final layer embedding of node A and then have a relation specific scoring function “f” that would take these embeddings and transform into a real value.


Equation: Relation specific scoring function for link from E to A


That is – we take the embedding of node E, do a linear transformation by multiplying it with weight matrix and then take the dot product with the embedding of A pass it through an activation function followed by a Softmax.


The probability obtained will indicate if there is a link between the node E and A. This is one way of formalizing the problem. Let us see the training process now.

?

?

9. RGCN for Entity / Link Prediction: Training Process Summary

?

Consider the Graph as shown in the figure below:


Figure 16: Example Input Graph


a)??Let us assume that the edge from E to A is the supervision edge

?

b)?Let us assume that all other edges shown by solid lines in the Figure 16 above are training message edges.

?

c)??We are going to use the training message edges to predict the likelihood of existence of the training supervision edge: that is the edge from E to A in this example.

?

d)??Another thing that we have to do as part of link prediction – is to create negative instances. That is, we have to create negative edges.

?

e)??One way of creating the negative edge is perturbing the supervision edge.


For example, in the graph of Figure 16, if we have the supervision edge from E to A, then, one way to create a negative edge is to corrupt the tail of the supervision edge. Then, in this example graph, if we have the supervision edge from E to A, then one way to create the negative edge is to corrupt the tail of his edge. So, we maintain the head at E but corrupt the tail so that edge now points from E to C (or E to B) than E to A.

?

f)??It should be noted that the negative edge does not belong to the training message edge or training supervision edge – for example, the edge from the node E to C is a training supervision edge not a negative edge. Therefore, we have to be careful in sampling negative edges. The RGCN is used to score both the positive and negative edges.

?

g)??The loss function that we want to optimize is the standard cross entropy loss where want to maximise the score of training supervision edge (positive edge) and minimize the score of the negative edge. Therefore, we would write down the penalty as below:


Equation – Cross Entropy Loss Function - maximizing the score of training supervision edges and minimizing the score of negative edges


?And then we will have an optimizer like stochastic gradient descent to optimize the parameters of the GCN assigning high probability score to training supervision edge and low score to negative edges.

?

10. RGCN for Entity / Link Prediction: Evaluation Process

?

Having the model trained, we move to the Validation – to validate the performance of th model.

Let us assume that the validation edge we are interested in is between node E and node D as shown below. We’re interested to know if the edge is of type: r3


Figure 17: Evaluation Process in RGCNs – validation edge of relation type “r3” from E to D



We are interested to know if this is of type r3. Now, the training message edges, and training supervision edges are used for message propagation, and we will have to score the edge from E to D as shown in the figure 17 above. The score of this edge from E to D will have to be more than the score of all the negative edges.

?

Following will be the evaluation process:


  • Use RGCN to calculate the score from E to D (E, r3, D)

?

  • Calculate the score of all the negative edges (E to B and E to F)

?

  • Obtain the score of all the edges i.e. rank them. The ranking of E to D will be higher than the output score of E to F and E to B.

?

  • Calculate the metrics: Hits @k Rk < = k. Higher is better denoting how often the positive edge was ranked better than other negative edges.



11. Summary of this article:

?

This article covered the following aspects of RGCNs:


  • What are Heterogeneous Graphs

?

  • Knowledge Graphs and Heterogeneous Graphs


  • Simple Graph Neural Networks (GCNs) vs Relational Graph Neural Networks (RGCNs)

?

  • The overfitting problem in RGCNs

?

  • The Block Diagonal and Basis Matrices approach to address overfitting in RGCNs

?

  • Solving a Node Classification problem in Heterogeneous Graphs with RGCNs

?

  • Solving a Link Prediction problems in Heterogeneous Graphs with RGCNs

?

要查看或添加评论,请登录

Ajay Taneja的更多文章

社区洞察

其他会员也浏览了