Semi-supervised classification with Graph Convolutional Networks
Thomas N Kipf and Max Welling, ICLR 2017
Kipf et al. 2017 introduces Graph Convolutional Networks (GCN) which uses features of each node and leverages edges of the graph to derive class similarity between nodes in semi-supervised setting.
Traditionally semi-supervised learning in a graph-structured data heavily relied on the assumption that the edges in the graph represent class similarities (i.e. nodes with similar classes have edge between them). For example, in an image segmentation task, an image can be thought as a grid (Graph with node for every pixel and edges between neighboring pixels as shown in figure below). Such representation of an image represents the assumption that the neighboring pixels belong to the same class (hence an edge between them).
With this assumption, semi-supervised learning task is then posed as an optimization problem with following loss function:
Here is some form of cross-entropy loss for the supervised examples and the is a regularizing function which tries to reduce difference in labels of connected nodes. is laplacian quadratic form.
Graph Convolutional networks
GCN proposes a way to do graph convolutions by using the following layer wise propagation rule:
The paper proposes a two layer Graph Convolutional Network which amounts to following model:
- ReLU activation function is used in the first layer because the gradient of ReLU is , so with multiple iterations, the gradient values do not vanish (tends to zero), which is the case with other non-linear functions. Softmax is used in the top layer, because the final output expected are class probabilities.
GCN uses gradient descent to learn weight matrices – and that minimizes the following cross-entropy error for all the supervised nodes ():
Loss function is then a combination of cross-entropy loss for the supervised labels and some regularization.
In the paper, authors observe that L2-regularization of weight matrix at the first layers alone is sufficient.
- Regularization is used to avoid overfitting by penalizing the weight matrices of hidden layers. L2-regularization in particular uses 2-norm of weight matrix for penalty. This pushes the weight matrix close to zero.
Connection between CNN and GCN
In CNN, input feature map (blue grid in below image) is convolved with discrete kernel () to produce output feature map (green grid). This can be seen as a message passing algorithm where the messages from the neighboring nodes ( in the example below) and the node itself () are multiplied with weight and the output feature map is obtained by summing up these messages.
GCN similarly generates output feature map () by convolving the input feature map () with weight matrix ().
This can be seen as a message passing algorithm where messages or input features () of each node in the graph are multiplied by weights () and summation is stored in and finally, the output feature map () is generated by summing the weighted messages of all the neighboring nodes along with the message from the node itself (Note: takes care of message from node itself).
In CNN the number of neighbors for each node are fixed ( in above example), so the number of messages received by all the nodes in the output later are same. In GCN, since the structure of graph is dictated by the adjacency matrix, the number of messages received at each node () is not the same. Hence, the need for normalizing (i.e, ) arises in GCN but not in CNN.
Although theoretically there are no limitations on number of convolution layers, GCN paper proposes two layered network. Since every layer convolves the neighboring node, layers effectively convolves order neighbors. GCN convolve only upto order neighbors. Their empirical evaluations suggest 2nd order neighbor is enough for most of the domains.
In traditional approach, since we use , which is a function of adjacency matrix, the nodes with dense neighborhood will have high penalty and hence the model will overfit on the local neighborhoods of such nodes (consequently it will under-fit the nodes with sparse neighborhood). Normalization of adjacency matrix helps us alleviate this problem.
Extending GCN to graph classification and supervised learning task
GCN formulation can be leveraged for graph classification problem, where the task is to predict a single variable for the whole graph. Adjacency matrix of each graph ( in the figure below) is concatenated into a sparse block-diagonal matrix () as shown in the figure below. This adjacency matrix and the feature matrix (, n = total number of nodes in all the graphs and c = # input features) can be used to train GCN and the output matrix Z can then be pooled to obtain class labels for each graph.
For supervised learning task, two graphs can be created from training set (labeled nodes) and the testing set (unlabeled nodes). Adjacency matrix of two graphs can again be concatenated as block-diagonal matrix and GCN can be trained on it. The output matrix Z will have the class probabilities for test set as well.
Spectral Graph Convolutions
Convolution theorem states that convolution of two matrices is equivalent to point-wise multiplication in the fourier domain i.e. when denotes fourier transform operator. In signal processing, the spectral transformation usually refers to transformation from time to frequency dimension (fourier domain), but in graph theory spectral transformation usually refers transformation to eigen-vector dimension (). So, the convolution theorem in graphs represents following equation:
For spectral graph convolution, we directly estimate the convolution filter in the eigen-vector dimension as some function of eigen-vectors (). With filter spectral convolution on graph is defined as follows:
Note the difference between convolution symbol () and the spectral convolution symbol used in paper () signifies that the is already in eigen-vector dimension.
can be approximated by truncated expansion in terms of Chebyshev polynomials as:
Observe that and .
Using the above two results, we can approximate the convolution without needing eigen-decomposition, directly using graph laplacian,
So, we effectively reduced the complexity of convolution from to , i.e. from cube of # nodes () to linear in # edges (). Since has non-zero elements, multiplication is a sparse matrix multiplication which can be done in . :
Connection between GCN and spectral convolutions
When we use we are assigning same weights to all the 1st and 2nd order neighbors. If we used different , we will assign different weights to different hop neighbors.
Renormalization Trick: With , the eigen values are in range [0,2]. When the largest eigenvalue is less than the vanishing problem may occur even in case of two layers. With the renormalization trick, we can ensure that the eigen values are between [0,1] and the maximum eigenvalue is . So, we avoid vanishing problem.
Since this will still give us matrix, we add a weight matrix () to linearly transform the result to
GCN provides equal importance to self node as well as the neighboring introduces. It also does not allow to different weights for different neighbors, which is allowed in CCN.
GCN being a graph convolution in the spectral domain, the Graph structure if fixed. Spatial domain convolutions on other hand allow for graph modifications.
Although GCN considers node features, it still heavily rely on the node locality.
What is the dimensionality of , how to calculate row and column of matrix ? Link to Answer
What is the time complexity to compute the eigen-decomposition Link to Answer
What is the relation between and GCN? Link to Answer
Prove . Link to Answer
Prove complexity of is . Link to Answer
Describe the intuition, why GCN can alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions? Link to Answer
Why is a valid approximation?
Probably because eigen values of A [0,1] and is the eigen value of , so the eigen values are in range [0,2]. We need this rescaloing because the Chebyshev polynomial needs input in range [-1,1]
Prove . Link to Answer
Why does repeated application of result in numerical instability? Link to Answer
Why does renormalization trick help? Link to Answer
Show in detail how to apply GCN for graph classification and supervised learning problem? Link to Answer
Derive time and memory complexity of Link to Answer
Why use 2 layers in GCN? Link to Answer
Explain L-2 regularization. Link to Answer
- Thomas N Kipf and Max Welling, Semi-supervised classification with Graph Convolutional Networks, ICLR 2017
- Notes by Prof. Feng Chen: https://personal.utdallas.edu/~fxc190007/courses/20S-7301/GCN-notes.pdf