Harsha Kokel
Semisupervised classification with Graph Convolutional Networks
Thomas N Kipf and Max Welling, ICLR 2017
Kipf et al. 2017 introduces Graph Convolutional Networks (GCN) which uses features of each node and leverages edges of the graph to derive class similarity between nodes in semisupervised setting.
Traditionally semisupervised learning in a graphstructured data heavily relied on the assumption that the edges in the graph represent class similarities (i.e. nodes with similar classes have edge between them). For example, in an image segmentation task, an image can be thought as a grid (Graph with node for every pixel and edges between neighboring pixels as shown in figure below). Such representation of an image represents the assumption that the neighboring pixels belong to the same class (hence an edge between them).
With this assumption, semisupervised learning task is then posed as an optimization problem with following loss function:
Here is some form of crossentropy loss for the supervised examples and the is a regularizing function which tries to reduce difference in labels of connected nodes. is laplacian quadratic form.
Graph Convolutional networks
GCN proposes a way to do graph convolutions by using the following layer wise propagation rule:
The paper proposes a two layer Graph Convolutional Network which amounts to following model:
 ReLU activation function is used in the first layer because the gradient of ReLU is , so with multiple iterations, the gradient values do not vanish (tends to zero), which is the case with other nonlinear functions. Softmax is used in the top layer, because the final output expected are class probabilities.
GCN uses gradient descent to learn weight matrices – and that minimizes the following crossentropy error for all the supervised nodes ():
Loss function is then a combination of crossentropy loss for the supervised labels and some regularization.
In the paper, authors observe that L2regularization of weight matrix at the first layers alone is sufficient.
 Regularization is used to avoid overfitting by penalizing the weight matrices of hidden layers. L2regularization in particular uses 2norm of weight matrix for penalty. This pushes the weight matrix close to zero.
Connection between CNN and GCN
In CNN, input feature map (blue grid in below image) is convolved with discrete kernel () to produce output feature map (green grid). This can be seen as a message passing algorithm where the messages from the neighboring nodes ( in the example below) and the node itself () are multiplied with weight and the output feature map is obtained by summing up these messages.
GCN similarly generates output feature map () by convolving the input feature map () with weight matrix ().
This can be seen as a message passing algorithm where messages or input features () of each node in the graph are multiplied by weights () and summation is stored in and finally, the output feature map () is generated by summing the weighted messages of all the neighboring nodes along with the message from the node itself (Note: takes care of message from node itself).

In CNN the number of neighbors for each node are fixed ( in above example), so the number of messages received by all the nodes in the output later are same. In GCN, since the structure of graph is dictated by the adjacency matrix, the number of messages received at each node () is not the same. Hence, the need for normalizing (i.e, ) arises in GCN but not in CNN.

Although theoretically there are no limitations on number of convolution layers, GCN paper proposes two layered network. Since every layer convolves the neighboring node, layers effectively convolves order neighbors. GCN convolve only upto order neighbors. Their empirical evaluations suggest 2nd order neighbor is enough for most of the domains.

In traditional approach, since we use , which is a function of adjacency matrix, the nodes with dense neighborhood will have high penalty and hence the model will overfit on the local neighborhoods of such nodes (consequently it will underfit the nodes with sparse neighborhood). Normalization of adjacency matrix helps us alleviate this problem.
Extending GCN to graph classification and supervised learning task
GCN formulation can be leveraged for graph classification problem, where the task is to predict a single variable for the whole graph. Adjacency matrix of each graph ( in the figure below) is concatenated into a sparse blockdiagonal matrix () as shown in the figure below. This adjacency matrix and the feature matrix (, n = total number of nodes in all the graphs and c = # input features) can be used to train GCN and the output matrix Z can then be pooled to obtain class labels for each graph.
For supervised learning task, two graphs can be created from training set (labeled nodes) and the testing set (unlabeled nodes). Adjacency matrix of two graphs can again be concatenated as blockdiagonal matrix and GCN can be trained on it. The output matrix Z will have the class probabilities for test set as well.
Spectral Graph Convolutions
Convolution theorem states that convolution of two matrices is equivalent to pointwise multiplication in the fourier domain i.e. when denotes fourier transform operator. In signal processing, the spectral transformation usually refers to transformation from time to frequency dimension (fourier domain), but in graph theory spectral transformation usually refers transformation to eigenvector dimension (). So, the convolution theorem in graphs represents following equation:
For spectral graph convolution, we directly estimate the convolution filter in the eigenvector dimension as some function of eigenvectors (). With filter spectral convolution on graph is defined as follows:
Note the difference between convolution symbol () and the spectral convolution symbol used in paper () signifies that the is already in eigenvector dimension.
can be approximated by truncated expansion in terms of Chebyshev polynomials as:
Chebyshev polynomial is a matrix with dimensions same as (i.e. , where n = # nodes ). Elements of matrix are obtained by applying Chebyshev polynomial definition element wise.
So,
Observe that and .
and
Using the above two results, we can approximate the convolution without needing eigendecomposition, directly using graph laplacian,
So, we effectively reduced the complexity of convolution from to , i.e. from cube of # nodes () to linear in # edges (). Since has nonzero elements, multiplication is a sparse matrix multiplication which can be done in . :
Connection between GCN and spectral convolutions
GCN can be seen as approximation of spectral convolution with , and
If we use convolution formula repeatedly over multiple layers, values keep increasing in every layer since we have . Renormalization trick is used to avoid this.
Since GCN uses and approximates , which reduces the number of parameters we intuitively determine that GCN will not overfit the graphs.
When we use we are assigning same weights to all the 1st and 2nd order neighbors. If we used different , we will assign different weights to different hop neighbors.
Renormalization Trick: With , the eigen values are in range [0,2]. When the largest eigenvalue is less than the vanishing problem may occur even in case of two layers. With the renormalization trick, we can ensure that the eigen values are between [0,1] and the maximum eigenvalue is . So, we avoid vanishing problem.
So,
Since this will still give us matrix, we add a weight matrix () to linearly transform the result to
So,
With , we have which is graph convolution. Complexity of matrix multiplication is since is a sparse matrix with nonzero elements – Not sure how paper gets
Limitations

GCN provides equal importance to self node as well as the neighboring introduces. It also does not allow to different weights for different neighbors, which is allowed in CCN.

GCN being a graph convolution in the spectral domain, the Graph structure if fixed. Spatial domain convolutions on other hand allow for graph modifications.

Although GCN considers node features, it still heavily rely on the node locality.
Questions

What is the dimensionality of , how to calculate row and column of matrix ? Link to Answer

What is the time complexity to compute the eigendecomposition Link to Answer

What is the relation between and GCN? Link to Answer

Prove . Link to Answer

Prove complexity of is . Link to Answer

Describe the intuition, why GCN can alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions? Link to Answer

Why is a valid approximation?
Probably because eigen values of A [0,1] and is the eigen value of , so the eigen values are in range [0,2]. We need this rescaloing because the Chebyshev polynomial needs input in range [1,1]

Prove . Link to Answer

Why does repeated application of result in numerical instability? Link to Answer

Why does renormalization trick help? Link to Answer

Show in detail how to apply GCN for graph classification and supervised learning problem? Link to Answer

Derive time and memory complexity of Link to Answer

Why use 2 layers in GCN? Link to Answer

Explain L2 regularization. Link to Answer
References
 Thomas N Kipf and Max Welling, Semisupervised classification with Graph Convolutional Networks, ICLR 2017
 Notes by Prof. Feng Chen: https://personal.utdallas.edu/~fxc190007/courses/20S7301/GCNnotes.pdf
 https://tkipf.github.io/graphconvolutionalnetworks/
 https://www.inference.vc/howpowerfularegraphconvolutionsreviewofkipfwelling20162/
 https://dtsbourg.me/thoughts/posts/semisupervisedclassificationgcn
 https://medium.com/@BorisAKnyazev