Deep Learning with Graphs: Part 9 of my Graph Series of blogs
Ajay Taneja
Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics
1. Introduction:
?
This is the continuation of my series of blogs on Graphs. Having discussed a detailed overview of Graph Neural Networks in my preceding article here: An Introduction to Graph Neural Networks: Part 8 of my Graph Series as well as the foundational principles of Deep Learning in this article, it is now time to get deeper into the concepts involving Graph Neural Networks – and this is going to be the focus of this article. The overall goal being to generalise the architecture of the neural networks so that they can be applicable to Graphs.
?
The main idea of the content of this article will be focussed on the understanding of:
?
?
The article will focus on describing the model parameters and the training process of Graph Neural Network and the possibilities of training a Graph Neural Network in supervised and unsupervised fashion.
With the above in mind, this article is organized as follows:
?
2. Overall Setup of a Graph Network:
?
Let us assume we have a Graph “G” that comprises of:
a)????? A set of Vertices / Nodes
b)???? An Adjacency matrix representing the Graph.
I had discussed about the Adjacency Matrix in the Part 1 of the ongoing Graph Series: An Introduction to Graph Networks in section 7.? For simplicity, let us assume that the Adjacency matrix is binary, that is: all the edges are “unweighted” and we are talking of Undirected Graphs.
?
Let us also assume that every node has a node feature vector X, associated with it. Node features may correspond to:
?
Let us assume that N(v) denotes the set of neighbours of a given node v.
?
3. Naive approach to apply Deep Neural Networks to Graphs:
?
Having discussed about Adjacency matrix in Part 1 of my Graph series of Blogs, a naive, simple approach to apply Deep Neural Networks might be to represent the Graph through its Adjacency matrix and append every node with the node feature (that is: append each row of the Adjacency matrix) with the node features as shown in the Figure 3 below.
?
Each row of the Adjacency matrix may be considered as one training example and the total number of columns of the Adjacency matrix is equivalent to the total number of features in the model. One could then think of feeding this information to the neural network as shown in the figure below:
?
However, although on the face of it the approach might look logical, there are several problems associated with it:
?
?
4. Graph Neural Networks: A Generalization of Convolutional Neural Networks
?
Graph Neural Networks essentially borrow the idea from Convolutional Neural Networks (CNNs) and generalize it to Graphs. Let us understand this important aspect of Deep Learning with Graphs!
?
Overview of the convolutional neural networks and the convolutional operations:
?
Let us say we have an image which is a grid of pixels. In convolutional neural networks, we define the convolutional operation which slides over the image in a series of steps to cover the whole image as shown in the figure below:
We define the convolutional operation which slides over the image in a series of steps to cover the whole image. In each step, we carry out some aggregation over the pixel values. Along every step of the sliding window, we get a new pixel value and after a series of steps, we get a new image of different size. The process is repeated through a sequence of convolutional layers which together form a convolutional neural network as shown in the figure below:
In Graph Networks, the goal is to generalize the notion of convolutional operations beyond simple lattices or grid of pixels to an arbitrary shaped topology as in graphs and also leverage node features / attributes as described in section 2 above.
?
?
4.1? ?Images vs Graphs:
?
Contrary to images, which are represented as a simple grid of pixels, a graph network is highly complex in topology as shown in the figure below:
?Defining a notion of sliding window as n Convolutional Neural Networks which wraps a section of pixels is not feasible as there is no fixed/regular topology in Graphs as represented in the figure above.
?
In case of images, a single convolutional neural network layer takes a 3 x 3 filter, applies a transformation and creates a new pixel as shown in the figure below:
?We then take the filter and slide it in steps through the whole image doing the same aggregation. The process continues until the whole image is covered and we cover the whole process through different convolutional layers.
?
?
Convolutional Operation in Graphs:
?
We do something similar in Graphs as that of the convolutional operation in Images. In a Graph, if we want to apply the convolutional operator in terms of a sliding window, then, we are going to have the centre of the sliding window which is the node at the centre as shown in the figure below. It is this centre that is going to aggregate information from the neighbours as shown in the figure below:
?
In case of an image, we can think about the centre as the pixel in the centre of the 3 x 3 filter and the neighbour being the rest of the pixels in the filter.
?
?
Therefore, the basic concept to be understood is what the convolutional operators are really doing in images – they are transforming the information from the neighbours, pixels, combining (aggregating) it and creating a new kind of image. The same operation happens in Graphs – in Graphs the convolutional operators are collecting the information from the neighbouring nodes, aggregating it and creating a new embedding / message.
5. How Graph Convolutional Neural Networks are actually going to work?
?
In Graph Convolutional Neural Networks, the idea is that a Node’s neighbourhood defines its neural network architecture. That is: the structure of the Graph around the node of interest defines the structure of the neural network.
?Thus, if we want to make a prediction for the node “i”, marked in the figure below, then the node “i” will take the information from the neighbouring nodes and the neighbours are going to take the information from the neighbours – neighbours the information is going to be propagated or transformed along the edges of the network.
Thus, one can think of computation in Graph Neural Network as a 2-step process:
?
?
The computation graph defines the architecture or the structure of the underlying neural network. In this fashion, the network learns as the information of the Graph structure is propagated and finally learning the node embedding. It should be underscored that a computation graph is highlighted in the Figure 9/10 above will have to be formulated for every node in the Graph in order to learn it’s embedding.
?
In the section, now, let us see the calculation detail in a computation graph of a node to understand the process better.
?
?6. Example of calculations in a Computation Graph of a Node
?
Let us go deeper into the explanation and discussion of section 4 and section 5 through an example. Consider an example graph with 6 nodes as shown in the Figure below:
?The goal of the problem is to generate the embeddings of the node based on the local neighbourhood of the target node under consideration.? Therefore, considering the node A as the target node. The structure of the neural network on which the computations are to be made in order to make the prediction for the node A is as shown in the Figure 12/13 below.
?
?
The above architecture may be justified as follows:
?
?
?
Next, if the above (Figure 12/13) is the structure of the network:
?
?
领英推荐
?The main point to understand about the difference between Graph Neural Networks and Vanilla Neural Networks is that in Graph Neural Networks each node has its own neural network architecture. Every node in the Graph Network gets to define its own computation graph based on the network around its neighbourhood as shown in the Figure below:
?
The computation graph of each of the nodes: A, B, C, D, E and F will be based on their local neighbourhood is as shown in the Figure 14 above. Depending upon its local neighbourhood, the neural network architecture of the node may be long and narrow as that of the node D or wider as that of the node C or the node A as marked in the same Figure. The most important fact to be understood is that every node in the graph will involve its own computation graph and hence its own neural network architecture.
?
Therefore, the training of a Graph Neural Network will involve the training of multiple neural networks for the respective node – the training being carried out simultaneously and the structure of each neural network will be determined by the local neighbourhood.
7. Layers in a Graph Neural Network
?
The neural network corresponding to a node in the graph can be of arbitrary depth as shown in the Figure 15 below:
One can create an arbitrary number of layers. Nodes will have embedding in each layer and the embedding at Layer 0 is initialized as its input feature X as shown in the Figure 15 above.? Therefore, for Layer 0 the embedding for the node A and C will be the node feature (feature vector) for node A and C and the embedding for Layer 1 will be the aggregation of feature vectors of node A and C plus its own embedding.
?
Thus, for K-hops long local neighbourhood, we will need a K-layer neural network. Thus, the Layer K embedding will get information from the nodes which are K-hops away.
?
8. Neighbourhood Aggregation and Transformations
Having defined the notion of how to create a computation graph per node based on the local neighbourhood in sections 5 through 7, next, let us talk about the transformation that happens in a neural network.
?
Having drawn the computation graph corresponding to a node A in the graph as shown in Figure 15, the question now arises, what is happening in the boxes as shown in the Figure below:
?Thus, we will have to see; how do we define the transformation, how do we parametrise and how do we learn.
?
An important point to note is that – because the ordering of the nodes in a graph is arbitrary, this means that the aggregation operator in a graph will have to be permutation invariant. That is – if you order the nodes in any order, the aggregation should always remain the same. Thus, the neighbourhood aggregation function has to be permutation or order invariant.
?
The basic approach is to average the information from the neighbours and apply a neural network. Since average/summation being permutation invariant, it can be used as a form of aggregation in the input blocks as shown in the Figure 17 below.
Once the messages are aggregated, we apply the neural network corresponding to the block as shown in the Figure 16/17 above. Applying the neural network will mean that we apply a linear transformation followed by a non-linearity and create a next level message. This is what is happening in the box of Figure 17 as referenced above. The math of the transformation and the application of non-linearities/activations is discussed in the subsequent section (Section 9).
?
The key distinction between different architectures of graph neural networks is about how the aggregation is done – that is: how the information from children’s nodes is aggregated and combined is aggregated and combined with information from the parent.
?
9. Mathematics in a Graph Neural Network Computation
?
As mentioned in section 8, that, the basic approach is that we want to average the messages coming from the children: that is, coming from the neighbours of the target node and apply a neural network to them. Let us understand the mathematics involved here which is fundamentally not different than the mathematics of the operations in conventional/vanilla neural networks.
Consider:
where, h is the embedding, and the subscript denotes the node, and the superscript denotes the layer of the neural network. As mentioned in the preceding sections, the 0th layer, the embedding of the node v is simply its feature representation denoted by xv above.
?
Thus,
hv0 denotes the 0th layer node embedding denoted by its feature representation.
?
Next, we are going to create the higher order embeddings at the subsequent layers.
So, at Layer-1, we will have:
This is what we essentially do in the above math:
This forms one layer of the neural network. This can be done several times over several iterations/several layers and whatever is the hidden final representation is the final embedding of the node. This is the standard type of process / steps in a conventional neural network that take place in the hidden layers of the network.
?
?
10. Training a Graph Convolutional Network Model
?
Next, we will have to train the Graph Convolutional Network model where every node has its own computational architecture as shown below:
We feed the parameters (as discussed in section 9 – the matrices W and B) into the stochastic Gradient Descent and the parameters are to be trained in accordance with the following loss function:
Considering the equation above, the parameters to be trained are:
?
?
An important point to note is that all the nodes in the network use the same transformation matrices: Wl and Bl thus enabling the embeddings to be generated for un-seen nodes in the Graph as discussed in section 11.
?
10.1? Supervised or Unsupervised setting during training:
?
The node embedding zv are the function of the input graph and the training can happen in a supervised setting or unsupervised setting. In supervised setting, where the node labels are available, we formulate the loss function and make a prediction minimizing the discrepancy between the prediction and the truth.
In an Unsupervised setting where the node labels aren’t available, we determine the similarity in the network using Jaccard similarity (normally) as described in section XX of Part Y of this Graph-series. The loss function is formulated as the difference between the similarity in the network and the similarity in the embedding space – denoted by the dot product of the node embeddings in the embedding space.
?
As in other problems, for supervised training we would define cross entropy loss for classification problems as shown in the equation below:
In the above equation:
?
?
11. Conclusion: Synopsis of the discussions in section 1 through 10
?
From the discussions in sections 1 through 9, we can now think of model design as follows:
Given a graph:
?We would like to compute the embedding of the target node. We construct the computation graph for the nodes in the graph as shown in the figure below:
?We define the neighbourhood aggregation function and the loss function like the cross entropy and train the model. Normally, for training, we select a batch of nodes as shown in the figure below.
Application of trained model on “unseen” nodes:
Once the model is trained, we can apply the model or “unseen” nodes/nodes never trained on. This means that the model will now have inductive capability unlike the node embeddings that were trained on shallow networks like a skip-gram model.? This is possible because the same parameters that were i.e. W and b (as mentioned in section 10) are shared across all nodes.
thus, allowing the graph neural networks to generalize on “unseen” nodes. For example, in networks of biomedicine we can train the model on one organism (which is a graph network) and use the model to predict on another organism (another network) as shown in the figure below:
The prediction on unseen nodes is very useful for networks that evolve – because you can take a snapshot of the network, create a computation graph, forward pass and and predict? the embedding of the unseen nodes.
Thus, the sections from 1 through 10 provide a detailed description of Graph Neural Networks, difference between vanilla neural networks and Graph Neural Networks, comparison between convolutional neural networks and graph neural networks, the loss function and training process in GNNs.
?
My subsequent blogs will focus on different architectures of Graph Networks, Knowledge Graphs, etc!