Deep Learning with Graphs: Part 9 of my Graph Series of blogs

Deep Learning with Graphs: Part 9 of my Graph Series of blogs


1. Introduction:

?

This is the continuation of my series of blogs on Graphs. Having discussed a detailed overview of Graph Neural Networks in my preceding article here: An Introduction to Graph Neural Networks: Part 8 of my Graph Series as well as the foundational principles of Deep Learning in this article, it is now time to get deeper into the concepts involving Graph Neural Networks – and this is going to be the focus of this article. The overall goal being to generalise the architecture of the neural networks so that they can be applicable to Graphs.

?

The main idea of the content of this article will be focussed on the understanding of:

?

  • Local neighbourhood network in a Graph.
  • Understand what a computation Graph for each node in the network is.
  • Understand what constitutes a single layer of a Graph Network.
  • And then understand how we can stack multiple layers over the neural network.

?

The article will focus on describing the model parameters and the training process of Graph Neural Network and the possibilities of training a Graph Neural Network in supervised and unsupervised fashion.


Figure 1: Computation Graph for a node in the network


With the above in mind, this article is organized as follows:

?

  • Section 2 provides an overview of the composition of a graph, it’s numerical representation and its architecture in terms of being directed or undirected.
  • Section 3 attempts to explain why classical neural network approach – in terms of the architecture / structure of the neural network is not suitable for graphs.
  • Section 4 describes the important details of how a graph neural network borrows the conceptual idea from convolutional neural networks and then explains about the similarity of convolutional operations in a convolutional neural network and the convolutional operations in a graph neural network. The section emphasizes why it is challenging to process a graph than processing images or texts. This section is the highlight of this blog!
  • Section 5 and 6 explain the working of a graph convolutional neural network along with the concept of a computation graph and its architecture.
  • Sections 7, 8 and 9 are about Layers in a graph convolutional neural network, the aggregations and the transformation operations in these layers and the mathematics involved in the calculations.
  • Finally, we have a concluding section that summarizes the whole article!



2. Overall Setup of a Graph Network:

?

Let us assume we have a Graph “G” that comprises of:

a)????? A set of Vertices / Nodes

b)???? An Adjacency matrix representing the Graph.


I had discussed about the Adjacency Matrix in the Part 1 of the ongoing Graph Series: An Introduction to Graph Networks in section 7.? For simplicity, let us assume that the Adjacency matrix is binary, that is: all the edges are “unweighted” and we are talking of Undirected Graphs.

?

Let us also assume that every node has a node feature vector X, associated with it. Node features may correspond to:

?

  • user profile or user image in a social network graph, or
  • gene functional information, gene expression profile in network involving biomedicine.
  • In cases where there are no node features in a network and emphasis is only on the structure of the Graph, Data Scientists often use node degrees (I have discussed the concept of node degree in section 5 of Part 1 of this series) or one hot encoding of the node as node features.


Let us assume that N(v) denotes the set of neighbours of a given node v.


Figure 2: A Graph with set of nodes “v”


?

3. Naive approach to apply Deep Neural Networks to Graphs:

?

Having discussed about Adjacency matrix in Part 1 of my Graph series of Blogs, a naive, simple approach to apply Deep Neural Networks might be to represent the Graph through its Adjacency matrix and append every node with the node feature (that is: append each row of the Adjacency matrix) with the node features as shown in the Figure 3 below.

?

Each row of the Adjacency matrix may be considered as one training example and the total number of columns of the Adjacency matrix is equivalent to the total number of features in the model. One could then think of feeding this information to the neural network as shown in the figure below:


Figure 3: Each row of the Adjacency matrix has corresponding node features appended and considered as a single training example.


?

Figure 4: Feeding each training example into a neural network.


However, although on the face of it the approach might look logical, there are several problems associated with it:

?

  1. ?Firstly, the number of parameters in the model will be equal to the number of nodes plus the number of node features whereas the number of training examples are equal to the number of nodes. Therefore, the number of parameters to be learnt will be more than the number of training examples making the training unstable and easily overfit.
  2. Another issue that will arise is that the approach cannot be applicable to Graphs of different sizes. For example, if we have a Graph of seven nodes than a Graph of five nodes as shown in the above example, it becomes unclear how to apply the same model.
  3. ?Another issue, with the approach described will be that it is sensitive to node ordering. In the above illustrated example, the input is based on the node ordering defined by: A, B, C, D and E but for graphs there is no fixed node ordering and therefore the learning should be independent / invariant of node ordering - but that’s not the case considering the above approach. This can be easily interpreted from Figure 4, above.

?


4. Graph Neural Networks: A Generalization of Convolutional Neural Networks

?

Graph Neural Networks essentially borrow the idea from Convolutional Neural Networks (CNNs) and generalize it to Graphs. Let us understand this important aspect of Deep Learning with Graphs!

?

Overview of the convolutional neural networks and the convolutional operations:

?

Let us say we have an image which is a grid of pixels. In convolutional neural networks, we define the convolutional operation which slides over the image in a series of steps to cover the whole image as shown in the figure below:


Figure 5: Convolutional operations on an image in a CNN


We define the convolutional operation which slides over the image in a series of steps to cover the whole image. In each step, we carry out some aggregation over the pixel values. Along every step of the sliding window, we get a new pixel value and after a series of steps, we get a new image of different size. The process is repeated through a sequence of convolutional layers which together form a convolutional neural network as shown in the figure below:


Figure 6: Convolutional Neural Network on an Image


In Graph Networks, the goal is to generalize the notion of convolutional operations beyond simple lattices or grid of pixels to an arbitrary shaped topology as in graphs and also leverage node features / attributes as described in section 2 above.

?

?

4.1? ?Images vs Graphs:

?

Contrary to images, which are represented as a simple grid of pixels, a graph network is highly complex in topology as shown in the figure below:


Figure 7: Complex topology of a Graph Network


?Defining a notion of sliding window as n Convolutional Neural Networks which wraps a section of pixels is not feasible as there is no fixed/regular topology in Graphs as represented in the figure above.

?

In case of images, a single convolutional neural network layer takes a 3 x 3 filter, applies a transformation and creates a new pixel as shown in the figure below:


Figure 8: Single convolutional operation on an image


?We then take the filter and slide it in steps through the whole image doing the same aggregation. The process continues until the whole image is covered and we cover the whole process through different convolutional layers.

?

?

Convolutional Operation in Graphs:

?

We do something similar in Graphs as that of the convolutional operation in Images. In a Graph, if we want to apply the convolutional operator in terms of a sliding window, then, we are going to have the centre of the sliding window which is the node at the centre as shown in the figure below. It is this centre that is going to aggregate information from the neighbours as shown in the figure below:


Figure 9: Convolutional operation applied to images and convolutional operation applied to Graphs.

?

In case of an image, we can think about the centre as the pixel in the centre of the 3 x 3 filter and the neighbour being the rest of the pixels in the filter.

?

?

Therefore, the basic concept to be understood is what the convolutional operators are really doing in images – they are transforming the information from the neighbours, pixels, combining (aggregating) it and creating a new kind of image. The same operation happens in Graphs – in Graphs the convolutional operators are collecting the information from the neighbouring nodes, aggregating it and creating a new embedding / message.


5. How Graph Convolutional Neural Networks are actually going to work?

?

In Graph Convolutional Neural Networks, the idea is that a Node’s neighbourhood defines its neural network architecture. That is: the structure of the Graph around the node of interest defines the structure of the neural network.


Figure 10: Node’s neighbourhood defines the structure of the neural network.


?Thus, if we want to make a prediction for the node “i”, marked in the figure below, then the node “i” will take the information from the neighbouring nodes and the neighbours are going to take the information from the neighbours – neighbours the information is going to be propagated or transformed along the edges of the network.


Figure 10 A: Propagation of information to the target node “i” from its neighbouring hops


Thus, one can think of computation in Graph Neural Network as a 2-step process:

?

  1. In the first step, we determine the node’s computation graph.
  2. In the second step, we propagate and transform the information along the computation graph.

?

The computation graph defines the architecture or the structure of the underlying neural network. In this fashion, the network learns as the information of the Graph structure is propagated and finally learning the node embedding. It should be underscored that a computation graph is highlighted in the Figure 9/10 above will have to be formulated for every node in the Graph in order to learn it’s embedding.

?

In the section, now, let us see the calculation detail in a computation graph of a node to understand the process better.

?

?6. Example of calculations in a Computation Graph of a Node

?

Let us go deeper into the explanation and discussion of section 4 and section 5 through an example. Consider an example graph with 6 nodes as shown in the Figure below:


Figure 11: Example input graph of six nodes.


?The goal of the problem is to generate the embeddings of the node based on the local neighbourhood of the target node under consideration.? Therefore, considering the node A as the target node. The structure of the neural network on which the computations are to be made in order to make the prediction for the node A is as shown in the Figure 12/13 below.

?

Figure 12: The Input Graph and the neural network architecture for the node A


Figure 13: The boxes indicate the neural networks where the linear transformations and non-linearities are applied as explained in subsequent paragraphs.

?

The above architecture may be justified as follows:

?

  • The node A as shown in the Figure 11/12 above, is going to take information from the neighbours in the network. The neighbours of A being B, C and D.? This is one-hop of information. This can be unfolded to multiple hops of information.

?

  • If we want to consider another hop, then, the node D takes the information from its neighbours A and node C takes the information from A, B, E and F (that is: the neighbours of C) and similarly node B takes information from its neighbours A and C.

?

Next, if the above (Figure 12/13) is the structure of the network:

?

  1. The messages (embeddings) from node A and C to the node B, from nodes A, B, E and F to the node C and from node A to the node C and the messages from A to the node D will have to be transformed, aggregated together and passed to B.
  2. Similarly, the message from B will have to be transformed/aggregated before passing it to A and the same is with the other nodes. The transformation will be parametrised, and the parameters will have to be learnt.

?

?The main point to understand about the difference between Graph Neural Networks and Vanilla Neural Networks is that in Graph Neural Networks each node has its own neural network architecture. Every node in the Graph Network gets to define its own computation graph based on the network around its neighbourhood as shown in the Figure below:

?

Figure 14: Each node of the Graph has its own neural network architecture – its own computation graph.


The computation graph of each of the nodes: A, B, C, D, E and F will be based on their local neighbourhood is as shown in the Figure 14 above. Depending upon its local neighbourhood, the neural network architecture of the node may be long and narrow as that of the node D or wider as that of the node C or the node A as marked in the same Figure. The most important fact to be understood is that every node in the graph will involve its own computation graph and hence its own neural network architecture.

?

Therefore, the training of a Graph Neural Network will involve the training of multiple neural networks for the respective node – the training being carried out simultaneously and the structure of each neural network will be determined by the local neighbourhood.


7. Layers in a Graph Neural Network

?

The neural network corresponding to a node in the graph can be of arbitrary depth as shown in the Figure 15 below:


Figure 15: The computation graph of a node can incorporate arbitrary number of hops from its neighbourhood (arbitrary depth)


One can create an arbitrary number of layers. Nodes will have embedding in each layer and the embedding at Layer 0 is initialized as its input feature X as shown in the Figure 15 above.? Therefore, for Layer 0 the embedding for the node A and C will be the node feature (feature vector) for node A and C and the embedding for Layer 1 will be the aggregation of feature vectors of node A and C plus its own embedding.

?

Thus, for K-hops long local neighbourhood, we will need a K-layer neural network. Thus, the Layer K embedding will get information from the nodes which are K-hops away.

?

8. Neighbourhood Aggregation and Transformations


Having defined the notion of how to create a computation graph per node based on the local neighbourhood in sections 5 through 7, next, let us talk about the transformation that happens in a neural network.

?

Having drawn the computation graph corresponding to a node A in the graph as shown in Figure 15, the question now arises, what is happening in the boxes as shown in the Figure below:


Figure 16: What are the aggregations/transformations in the box (neural network)?


?Thus, we will have to see; how do we define the transformation, how do we parametrise and how do we learn.

?

An important point to note is that – because the ordering of the nodes in a graph is arbitrary, this means that the aggregation operator in a graph will have to be permutation invariant. That is – if you order the nodes in any order, the aggregation should always remain the same. Thus, the neighbourhood aggregation function has to be permutation or order invariant.

?

The basic approach is to average the information from the neighbours and apply a neural network. Since average/summation being permutation invariant, it can be used as a form of aggregation in the input blocks as shown in the Figure 17 below.


Figure 17: Averaged information from the neighbours passed into the neural network for linear transformation and application of non-linearities.


Once the messages are aggregated, we apply the neural network corresponding to the block as shown in the Figure 16/17 above. Applying the neural network will mean that we apply a linear transformation followed by a non-linearity and create a next level message. This is what is happening in the box of Figure 17 as referenced above. The math of the transformation and the application of non-linearities/activations is discussed in the subsequent section (Section 9).

?

The key distinction between different architectures of graph neural networks is about how the aggregation is done – that is: how the information from children’s nodes is aggregated and combined is aggregated and combined with information from the parent.

?

9. Mathematics in a Graph Neural Network Computation

?

As mentioned in section 8, that, the basic approach is that we want to average the messages coming from the children: that is, coming from the neighbours of the target node and apply a neural network to them. Let us understand the mathematics involved here which is fundamentally not different than the mathematics of the operations in conventional/vanilla neural networks.


Consider:

where, h is the embedding, and the subscript denotes the node, and the superscript denotes the layer of the neural network. As mentioned in the preceding sections, the 0th layer, the embedding of the node v is simply its feature representation denoted by xv above.

?

Thus,

hv0 denotes the 0th layer node embedding denoted by its feature representation.

?

Next, we are going to create the higher order embeddings at the subsequent layers.


So, at Layer-1, we will have:



This is what we essentially do in the above math:


  • We take the embedding of the node at previous layer, transform it by a matrix B. This is the expression:


  • We also take the embeddings of the neighbours, sum and average them. This is the expression:

  • We transform the average embedding by multiplying a matrix W

  • We then apply the non-linearity/activation:


This forms one layer of the neural network. This can be done several times over several iterations/several layers and whatever is the hidden final representation is the final embedding of the node. This is the standard type of process / steps in a conventional neural network that take place in the hidden layers of the network.

?

?

10. Training a Graph Convolutional Network Model

?

Next, we will have to train the Graph Convolutional Network model where every node has its own computational architecture as shown below:


Figure 18: Computational architecture of a single node of the Graph


We feed the parameters (as discussed in section 9 – the matrices W and B) into the stochastic Gradient Descent and the parameters are to be trained in accordance with the following loss function:



Considering the equation above, the parameters to be trained are:

?

  • Wk – the weight matrix for the neighbourhood aggregation
  • Bk – the weight matrix for transforming the hidden vector of the node itself.

?

An important point to note is that all the nodes in the network use the same transformation matrices: Wl and Bl thus enabling the embeddings to be generated for un-seen nodes in the Graph as discussed in section 11.

?

10.1? Supervised or Unsupervised setting during training:

?

The node embedding zv are the function of the input graph and the training can happen in a supervised setting or unsupervised setting. In supervised setting, where the node labels are available, we formulate the loss function and make a prediction minimizing the discrepancy between the prediction and the truth.



In an Unsupervised setting where the node labels aren’t available, we determine the similarity in the network using Jaccard similarity (normally) as described in section XX of Part Y of this Graph-series. The loss function is formulated as the difference between the similarity in the network and the similarity in the embedding space – denoted by the dot product of the node embeddings in the embedding space.

?

As in other problems, for supervised training we would define cross entropy loss for classification problems as shown in the equation below:


In the above equation:

  • zv: is the encoded input coming from the node embeddings.

?

  • ?: are the classification weights

?

  • yv: node labels


11. Conclusion: Synopsis of the discussions in section 1 through 10

?

From the discussions in sections 1 through 9, we can now think of model design as follows:

Given a graph:


Figure 20: Example Graph


?We would like to compute the embedding of the target node. We construct the computation graph for the nodes in the graph as shown in the figure below:


Figure 21: Computation Graph of a target node


?We define the neighbourhood aggregation function and the loss function like the cross entropy and train the model. Normally, for training, we select a batch of nodes as shown in the figure below.


Figure 22: Training the model for a batch of computation graphs.


Application of trained model on “unseen” nodes:

Once the model is trained, we can apply the model or “unseen” nodes/nodes never trained on. This means that the model will now have inductive capability unlike the node embeddings that were trained on shallow networks like a skip-gram model.? This is possible because the same parameters that were i.e. W and b (as mentioned in section 10) are shared across all nodes.


Figure 23: Shared Parameters across all computation graphs in the network enables prediction on un-seen nodes.


thus, allowing the graph neural networks to generalize on “unseen” nodes. For example, in networks of biomedicine we can train the model on one organism (which is a graph network) and use the model to predict on another organism (another network) as shown in the figure below:


Figure 24: Prediction on unseen networks of organisms


The prediction on unseen nodes is very useful for networks that evolve – because you can take a snapshot of the network, create a computation graph, forward pass and and predict? the embedding of the unseen nodes.


Thus, the sections from 1 through 10 provide a detailed description of Graph Neural Networks, difference between vanilla neural networks and Graph Neural Networks, comparison between convolutional neural networks and graph neural networks, the loss function and training process in GNNs.

?

My subsequent blogs will focus on different architectures of Graph Networks, Knowledge Graphs, etc!

要查看或添加评论,请登录

Ajay Taneja的更多文章

社区洞察

其他会员也浏览了