Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes


1.Introduction:

?

This is the continuation of my series of articles on Graphs and is the sixth article in the series. This article talks about the formulation of node embeddings in graphs and is divided into five main parts:

?

  • Firstly, we discuss about the motivation – why do we care to map graph structures into embeddings. Some part of this section might look a repeat of the content in my previous article on Understanding Node Embeddings in Graphs but I felt it was necessary to do so in order to maintain the continuity of the series and establish the overall context of the formulation of node embeddings . This forms the section 2 of this article.

?

  • In the second part, we talk about the problem: how do we define the problem and what’s the “context” of a node and the overall objective. This forms the section 3.

?

  • In the third part of this article, I have talk about the Word2vec approach for the generation of word embeddings. I feel this is extremely important as the Node2Vec algorithm for formulating node embeddings builds upon the successful foundation of Word2Vec algorithm in Natural Language Processing (section 4).

?

  • The fourth part talks about the Node2Vec algorithm and importantly the concept of Random Walks which uses the second order Markov chains to establish the context of every node. The similarity of Node2Vec with Word2Vec is crystal clear through this part (section 5).

?

Reading and understanding through section 4 and section 5, it will be clear that both Word2Vec and Node2Vec are completely consistent in their approach.

?

  • And last but not the least, we then (Part 5 – section 6) we go into the mathematics so that the novelty of the approach of Node2Vec algorithm of mapping the nodes to an embedding space becomes clear.


2. Motivation behind mapping graph structures to node embeddings:

Let us take a very simple graph as shown in the figure below:

Figure 1: Example Graph


The graph above may be interpreted as below:

?

  • The smaller the edge, the higher the weight – that is: a stronger relationship between the nodes (example: edge P-Q and edge 10-11).

?

Further,


Looking at the graph, we can say that the edge PQ has the same relationship as 10-11 – we say this by observation. Supposing, we’re going to construct a Machine Learning model for node classification or link prediction or graph classification – we will have to pass this information to the model. That is: we will have to pass the information of how the graph connects to the model. Not doing this, we will be putting additional burden on the model to learn the relationship.


Node2Vec algorithm tries to encapsulate this information in an embedding space in the form of vectors. Nodes which are strongly connected and are close to one another in the graph will also be close to one another in the embedding space when represented as vectors. Therefore, the vectors corresponding to the nodes P and Q will be close together as well as vectors 10 and 11 as visualized through the graph and mentioned above.


It should be underscored that finding the vector representation of the graph structures we’re not bothered about the feature representation of each of the node but rather we would like to map the relationship between the data points into an embedding space so that we can feed this into a machine learning model to make prediction relating to node classification, link prediction or graph classification, likewise.


Figure 2: Encoding the Graph Structure as Vectors into an Embedding Space


3. Context of a Node and the overall objective:

Let us go ahead and formulate the general objective of the problem: we want to map the nodes with similar “context” close in an embedding space where they are represented as vectors – the approach is similar to the word2vec algorithm wherein words of similar context are mapped close to one another represented as vectors in the embedding space as in the Figure 3 below.


Figure 3: Word Embeddings in Natural Language Processing (NLP)



Figure 4: Input graph the Zachary’s karate club network and the corresponding node embeddings.


Now, let us understand what “context” in terms of a node is. There exist two orthogonal approaches on how to explore the “context” of a node in a graph which are directly linked with “Breadth First Search (BFS)” and “Depth First Search (DFS)” respectively.? The first similarity metric is Homophily which is derived from a sociological concept stating that similar people tend to interact closely with one another which means that similar people have direct relationship – this is closely related to the Bread First Search which explores the direct context (neighbours) of a node.

?

The Depth first Search which is called “Structural Equivalence” attempts to detect the role of a node from a wider perspective. Here we not only go into 1-hop (direct relationships) but rather multiple hops to see which other nodes a particular node is connected to. In human network, this might mean to the extent to which two nodes are connected to the same others i.e., have the same social environments. It is often determined that structurally equivalent nodes will be similar in other ways as well, such as in attitudes, behaviours or performance. This is related to the depth first search. Let us now understand the approaches of Breadth First Search and Depth First Search in a graph.

?

Let us start with a graph as shown in the figure below:

Figure 5: Breadth First Search and Depth First Search in a Graph


Let’s look at the Depth First Search algorithm in terms of a graph:

?

In Depth First Search – we start from the source node 1 and we would explore all its relationship into the Depth of the Graph before we move onto the source node. Thus, in a depth first search, we proceed through the graph as follows: Starting at the node 1, we explore along the path 2, 9 and 8 until we cannot find any other node more into the depth of the network. We then explore the relationship related to node 2: that is path 7 and 5 and then we explore the relationship of 5 that is 6 and now since we have explored the entire path through the depth we jump back to node 1 and explore the remaining relationships to node 3 and 4 – thus the Depth First Path traversed considering 1 as the source node is given as:

?

?

?Depth First Path – source node 1 – Context of node 1 – [2, 9, 8, 7, 5, 6, 3, 4]

?

The next approach is the “Breadth First Search”:

?

Here we start at node 1 and explore its direct relationship first. We’d jump to 2, 5, 3 in the first 3 steps and hen see that we do not have direct relationships, then, we explore the nodes in a distance level of two hops, so we jump to the node 2 and see the direct relationships to the node 2 and find node 9 and 8. And then we jump to the node 5 and find the direct relationships to the node 5 which include: 7 and 6 and finally we jump to the node 3 and get vits direct relationship which is node 4. – thus, the Breadth First Path traversed considering 1 as the source node is given as:


Breadth First Path – source node 1 – Context of node 1 – [2, 5, 3, 9, 8, 7, 6, 4]

?

In the Node2Vec there is a hyperparameter which reflects a trade-off between the breadth first and depth first search. The orange colour in the graph represents depth traversal and the blue colour breadth traversal.


How do the context help us mathematically?

Having the contexts of each node – in terms of the neighbourhood of the nodes as a trade-off between the Breadth First Search and Depth First Search – we can find out how much the contexts of two nodes overlap in order to find the similarity between the two nodes. The similarity metric we use here is the Jaccard similarity – it may be recalled from the Part 3 of my series of blogs – wherein I have discussed about the Jaccard similarity in section 4.2: here. The Jaccard coefficient is the intersection between a pair of nodes divided by the union as shown in the equation below:

Equation: Expression for Jaccard coefficient


Suppose we have the context of a node u and v defined as below, we get the intersection as illustrated in the figure below:

?

Ns_u = [2, 3, 5, 6, 7, 9]

Ns_v = [1, 2, 5, 7, 9]


Figure 6: Overlapping nodes u and v.


Jaccard similarity = Intersection / Union = 4 / 7


Similarity between u and v = 0.571

?

It may be recalled that ?I had pointed out in the part of this series [https://www.dhirubhai.net/pulse/understanding-node-embeddings-graphs-part-5-x-my-notes-ajay-taneja-1f0jf/?trackingId=YznVESQaR5S%2Foub2cYMDbQ%3D%3D] in section 3 that the goal of the node embeddings is to encode the nodes so that the similarity in the embedding space (e.g. dot product) is approximated by similarity in the graph as shown in the figure below and represented in the equation:


Equation: Similarity in the network should approximate similarity in the embedding space


The similarity between the nodes in the original network is mathematically the result of the Jaccard similarity as discussed above.

?

?Further, in the Node2Vec there is a hyperparameter which reflects a trade-off between the breadth first and depth first search. So, we usually do not take the entire context of a node as that would lead to capturing the entire graph. Therefore, we have to introduce a sampling strategy which is a strategy to approximate the context of a node. In Node2Vec, we introduce a sampling strategy which is a fixed number of random walks starting at each of the node and exploring the context. Context exploring is also parametrised which is a trade-off between Depth and Breadth First Search.

?

4. Word2Vec algorithm for generation of word embeddings:

?

Node2Vec builds upon the successful foundation of the Word2Vec. Word2Vec was released during 2013 [https://arxiv.org/abs/1301.3781]and Node2Vec following that in 2016 [https://arxiv.org/abs/1607.00653].

Let us first highlight how the Word2vec algorithm works and then we will see how the Node2Vec essentially extends the skip gram architecture of Word2Vec.

?

Overview of the Word2Vec algorithm:

There are two variants of Word2Vec — skip-gram and CBOW. The skip-gram variant takes a target word and tries to predict the surrounding context words. The window size determines the span of the words on either side of the target word.

Suppose we have the following sentence:

I Love Deep Learning

?

Now, consider a window size of 2 . For the window sizes of 2 and 3, the skip grams are shown in the table below. The underlined word is referred to as the ‘target word’ and the words on either side of the target word are the ‘contextual’ words – they may be one or more in number depending on the window size chosen.

See examples in the table below – the target word together with the contextual word denote the skip grams.

Table: Skip-Grams based on window size for the sentence: I Love Deep Learning


Now, let us understand about the training of the word2vec algorithm to generate word embeddings.

?

Training the word2vec to generate word embeddings:

In Word2Vec we train a single hidden layer neural network for a certain task but – we’re not going to use the neural network for the task it was trained on. That is: we’re not interested in the prediction that the output layer of the neural network predicts directly. Instead, we’re just going to use the weights of the hidden layer of the neural network which are essentially the word embeddings of the “contextual” words as explained above.

The single hidden layer neural network is trained as described below:

  • Given a single word in the middle of the sentence (the input/target word), we look at the words within a certain window size and these words on either side of the input/target word are termed as “contextual words”.
  • The neural network is going to tell us the probability of every word in vocabulary of being the contextual word.
  • The output probabilities are going to relate to how likely it is to find each vocabular word “near” to / in “context” to the input (target) word. For example, if we give the trained network the word “cricket” then, the output probabilities are going be highest for the vocabular word “ball” or “match” than a word like “apple” or “banana”.


Training examples in word2vec:

In word2vec, we generate several billions of training examples from source text as follows:


Figure 7: Training examples for the Word2Vec algorithm


Figure 8: Training examples for Word2Vec algorithm


As mentioned above, we feed to the input layer of the neural network the one hot encoding of the input word and process it through the single hidden layer and the output of the neural network will be the probabilities of the vocabulary words and the word with the highest probability is the contextual word we’re looking for.


This is illustrated in the architecture below:


Figure 9: The architecture of the skip gram model – single hidden layer Neural Network model


As mentioned above, we are intersted in the weights of the hidden layer than the contextual words themselves. The weights of the hidden layer form the embedding vectors of the contextual words explained above.


5. Node2Vec for generation of node embeddings: Random Walks, Markov Chain framework, Random walks in Graphs

As mentioned in section 2, Node2Vec algorithm tries to encapsulate the information of the structure of the graph in an embedding space in the form of vectors. Nodes which are strongly connected and are close to one another in the graph will also be close to one another in the embedding space when represented as vectors.

And the interesting part is that the Node2Vec algorithm makes use of the skip gram model – the same single hidden layer neural network model (as described in section 4 to generate word embeddings) to get the node embeddings. The question that we need to address first is:

?

How do we convert a graph into a one-dimensional array of "sentences" so that we pass the encoding to the same single hidden layer neural network?

We do this by taking every node in the graph as the “target” (analogous to the “target” word in Word2Vec) and the “context” of the target nodes. In case of nodes in a graph, as mentioned in the section 4, the context corresponds to the nodes based on a trade of between Breadth First Search (BFS) and Depth First Search (DFS) obtained through a Random Walk Simulation. Let us now spend some time understanding about Random Walks. We will first look at simple first order random walks followed by describing the framework of Markov chain which is a modelling tool used to predict a system’s state in the future.


Simple one-dimensional Random Walks:

A Random walk simulation has application in several subjects, including physics, chemistry, graphs as well as in many other areas. For example, let us say we have particles in a box as shown below:


Figure 10: Random walk of particles in a box


The particle will move around in different ways – as a random pattern. Let us – for simplicity consider a one-dimensional model as shown in the figure below:


Figure 11: One dimensional model to explain Random Walk


Let us say we start at 0th position and at each time step we are constrained to move by one position – we can move either to the left or to the right of the number line shown in the Figure 11 above.

Let us now build up the probabilities for the one-dimensional random walk:

  • At t = 0, we can only be at position 0, so the probability of the random walker being at position 0 is 1.
  • Then, at the next step, at t = 1, as the random walker can move left or right one position at a time, the probability of the random walker being at position +1 is 0.5 and the probability of the random walker being at position – 1 is 0.5 and the probability of the random walker being at other positions at time step t = 1 is 0.
  • At the subsequent time step, at t = 2, the random walker will either land up at position 2 or at position 0– if originally position 1 at the end of the first time-step t = 1 or
  • Similarly, if the random walker was originally at position – 1, the random walker may land up at position 0 or at -2 at the end of 2nd time step


Therefore, the probabilities at the end of time step 2:

?

  1. Random walker is at position 0 = 0.5
  2. Random Walker is at position 2 = 0. 5 x 0.5 = 0.25
  3. Random Walker is at position -2 = 0.25

?

The probabilities discussed above are often referred to as: Transition Probabilities. It should be underscored that in Simple Random Walk, the walk is random and is completely independent of the past history and that is the essence of Markov property which leads us to the discussion on “Markov Chains”.

?

Framework of Markov Chains:

Let us now introduce the framework of Markov Chains. ?Markov Chains make use of three fundamental concepts:

  • State Space – indicating where you travel. In case of a one-dimensional random walk as described in section XX the state space comprised of fixed positions on the number line.

?

  • Transition Probabilities: The second concept is transition probabilities as again pointed out in the paragraphs above. In case of Random Walk in two dimensions where the Random Walker may move in any of the four directions: left, right, top, down as shown in the Figure 12 below, and since each state has four neighbours with no bias towards any direction, the transition probabilities are all ?.

Figure 12: Transmission Probabilities in two-dimensional unbiased Random Ralk


?

  • Third is the initial distribution indicating where you start the Random walk from.

?

Markov Property:

The most important point about Markov Chains is the “Markov Property”.? This is the simple idea that once you have got to a particular state, you should have forgotten how you got there. That is, once you have moved from one state to another, there is nothing limiting the random walker to move to the same state again and thus there exist equal probability of returning to the same state. This was also respected in the example one dimensional simple random walk above.

?

Random Walks in Graphs:

In case of a graph, imagine one starts traversing the graph from a node (say node 1) as shown in the figure 5 again represented as figure 12 below. At random, one picks a neighbouring node and hops on to it (note: this is analogous to moving from one word in a sentence to another). Then, you repeat the whole process until a predefined walk length –, this is a hyper parameter.

The walk length parameter defines how long the sentences (context of a node) will be. For every node in the graph, the node2vec algorithm generates a series of random walks with a particular node as a starting node. One will have to define how many random walks will start from a particular node with particular walks per node parameter (hyperparameter).


The Node2vec algorithm uses random walks to generate a number of “sentences” starting from each node in the graph. The starting node is analogous to the first word in the sentence and as we hop over the other nodes through random walks, we collect more nodes (“words”).


Figure 13: Random Walks in Graphs



First order Random Walks and Second Order Random Walks:

Random Walk in Node2Vec uses the second order Markov chain – contrary to the Markov property discussed above which is the first order random walk. In first order random walk, at any time step, one can return to the previous state (previous node) and transmission probabilities as indicated in figure 12 are equal.

In second order random walks, we take into account both the current and the previous state. In second order biased random walks we have two hyperparameters:

  • P
  • q

?

which control the trade off between breadth first and depth first search as explained in section 3.


6. Formulation of Node Embeddings – Mathematical details:

?

It is now time to formulate the problem from a mathematical perspective and explore how node2vec solves the optimisation problem. Let us consider a single node in the graph – that is the node “u”.

To start with; let the node “u” be the source node for which we want to find the node embedding as shown in the figure below:


Figure 14: Source node “u” for which we want to find the embedding


Using the Random Walk approach, as elaborated in section 5, we can find the context of the source node “u” using a sampling strategy “s”. Let us say we we end up with a context of the source node “u” as:



Where:


Now, let us say we have a node “v” as an element of the context of node “u”.

?

Having the above in hand: the node “u”, its context N_s(u) and a node “v” which is an element of context of node “u”, we want to find a mapping function which maps node “u” to a vector of d dimension denoted as:



Where f(u) is a vector of d dimension.

?

Having f(v) we can calculate the similarity between two nodes f(u) and f(v) denoted as the dot product of two vectors. The smaller the angle between vectors u and v greater will be the similarity:


Figure 15: Angle ? between vectors u and v


Thus, the angle between two vectors should be small so that the similarity is large. The similarity in the embedding space is denoted as: f(u).f(v) – the dot product between the two vectors. Having the similarity score in the embedding space, we wan to transform the score into probabilities by applying the softmax because softmax squashes the numbers between 0 and 1.

?

We want to maximise the probability of the node v being in the context of node u denoted as conditional probability:



?Now, we not only want to maximise the probability of node v but all the nodes that are in the context of node u.

?

Therefore, the conditional probability expression can be written as:



Considering all the nodes in the graph the likelihood function that needs to be maximised is written as:



Carrying out the above maximisation over every node in the network will result in the formulation of node embeddings for each node in the network

要查看或添加评论,请登录

Ajay Taneja的更多文章

社区洞察

其他会员也浏览了