Convolutional neural networks (CNNs) have been extremely successful in the domain of Computer Vision and Image Analysis. Can we apply them to graph data too?

That’s the question which the emerging field of Deep Learning on Graphs aims to answer. We extend convolution to graphs *by analogy,* using ideas from Spectral Graph Theory.

I’ll aim to give a short overview of the spectral derivation of Graph Convolution. I’d recommend that anyone who’s interested in this topic watch this excellent tutorial at NIPS 2017, and I have borrowed a lot of ideas and notation from this tutorial.

I’ll start by introducing a few seemingly unconnected ideas and later tie them all together.

The convolution of two functions, , is defined as follows. Note that the convolution is also a function.

We can picture this (in a discrete form) as “inverting” , placing it over such that overlaps with , multiplying the overlapping values and adding them up to get the value of the convolution at that value of . Then we move to the next value of and repeat. The following image illustrates this.

Of course, this is only the simplified discrete case where the integral is replaced by a summation. But this is also the form we are used to seeing in Convolutional Neural Networks. One function is the convolutional “filter”, and the other function represents the matrix of input values. In that case, the filter moves in **two** dimensions, not just one.

Now let’s talk about the **Laplacian operator**, . It takes a function, as input, and returns another function, , given by

indicates how the fast the average value of changes as moves away from .

Now by *analogy*, if we have a function on an (unweighted) graph, the unnormalized **Graph Laplacian** at a particular vertex is simply the total difference between the node and its neighbours.

Further, if the function is represented as a vector , then the Laplacian operator can be represented as a matrix :

A function can be expressed in its Fourier series form as

This can also be expressed as an inner product (we need to take the complex conjugate of ) :

The functions are known as **Fourier basis** functions. And as it happens, the Fourier basis functions are **Laplacian eigenfunctions**. That means that when you apply the Laplacian operator to a Fourier basis function, you get back the Fourier basis function times a constant.

In the case of a graph, the same holds. Since we represent the Laplacian as a *matrix* that takes a *vector* as input, instead of eigenfunctions we have **eigenvectors**. So , and

Converting to a vector of Fourier coefficients is called taking the **Fourier transform** of , and is denoted by . In matrix-vector form,

is a matrix where the column is the eigenvector .

So now that we have these pieces in place – the formal definition of convolution, the Graph Laplacian, and the Fourier transform on graphs, we can put them together to arrive at a definition of convolution on a general graph.

And the thing that ties them all together is the Convolution Theorem. It states that convolution can be computed in the Fourier domain as follows :

So all we need to do to convolve and is to take their individual Fourier Transforms, multiply them element-wise, and take the inverse Fourier Transform of the result.

Here is where a key assumption comes in – so far, we have made no assumptions on whether the graph is directed or undirected. But now we will restrict ourselves to **undirected** graphs, where the adjacency matrix, and hence the Laplacian, are **symmetric**.

So the Laplacian eigenvectors are all mutually orthogonal, and hence , and . So the convolution becomes simply :

The element-wise dot product can be replaced by a conventional matrix multiplication, by converting the vector into a diagonal matrix.

W is a learnable diagonal matrix of ‘filter’ coefficients. So we finally have a way to perform convolutions on general graphs! But as you might already have noticed, there are a few significant problems with this approach.

For starters, the filters that we learn are dependent on the eigenvectors. The same filters can’t be applied to another graph without significant “distortions”.

Further, there’s no guarantee that these filters are localized in space. A traditional convolutional filter is applied on a small, connected patch of an image, but in our above formulation of the graph convolution, the filter can be applied across many disconnected portions of a graph.

**Localization** in space corresponds to **smoothness** in the frequency domain. If we want to have filters that are localized on the graph, we need to have smooth filters in the frequency domain.

The easiest way to ensure this is to parameterize the filter using a smooth **spectral transfer function** with learnable parameters , and that is precisely what Defferrard et al. (2016) [1] propose. is a diagonal matrix of Laplacian eigenvalues.

Specifically, Defferrard et al. use a family of polynomials called Chebyshev polynomials to parameterize . The Chebyshev polynomials are given by , with and . So we take the first polynomials to get the following truncated expression for the filters :

are learnable Chebyshev coefficients, and is a diagonal matrix of eigenvalues, scaled down such that each eigenvalue is in the range .

The Laplacian matrix can be written in its eigen-decomposed form as

And since , we can easily see that

So our final expression for the convolution is :

Here, represents the scaled-down Laplacian, ie, .

Note that this expression depends on the K-th power of the Laplacian, and hence the computed value at a node can depend only on nodes that are in its K-hop neighbourhood.

Kipf and Welling (2017) [2] take the idea discussed in the previous section and simplify it even further, truncating the Chebyshev polynomials to just .

Their rationale is that although these are just linear functions of the Graph Laplacian, multiple ‘stacked’ layers of these interspersed by non-linearities (much like conventional neural networks) can express a broad class of functions. Further, these layered linear functions with non-linearities are not limited by the Chebyshev polynomial parameterization, thus potentially giving them **more** expressive power. They call their model “Graph Convolutional Networks”, or GCNs for short.

I’m not going to go into the details of further assumptions and simplification that they make, but suffice to say that they end up with the following expression for the convolution :

(self-loops are enforced), and is the degree matrix corresponding to the new adjacency matrix .

Here, they introduce an assumption : the function does not just assign a real value to each vertex, but rather, assigns a *feature vector* to each vertex. So can be represented by a matrix , where each row contains the feature vector for a particular node.

So accordingly, the equation now becomes :

is a matrix of learnable filter parameters, and is the convolved output.

This has a nice, simple interpretation – we first multiply each node’s feature vector with to get a new feature vector of size for each node.

Then, we multiply by the normalized adjacency matrix . This corresponds to **averaging **the immediate neighbours of a node, to get the new vector for that node. Note that the new adjacency matrix has enforced self-loops, so the node’s own feature vector will always be included in the aggregation.

In this fashion, we can build more Graph Convolutional Networks by convolving, adding a non-linearity such as a ReLU, and repeating this process. For instance, a simple 2-layer GCN could be given by

GCNs have shown state of the art performance in semi-supervised classification on graphs, where the task is to assign a node to each vertex given a small set of pre-labelled nodes. The following diagram (taken from Kipf and Welling, 2017) illustrates the architecture of a GCN (left), and the embeddings produced by a GCN trained for a semi-supervised classification task.

There have been multiple variants of GCNs proposed in follow-up works, which build on the simpler models described here. I hope to talk about some of those variants in future blogs.

Note – I’ve edited this post to focus only on adversarial *attacks*, and I’ve moved the stuff about defences into a follow-up post.

Szegedy et al. in 2014 [1] made an interesting discovery – neural networks could be forced to misclassify an image by applying a hardly perceptible perturbation to the image! For example, the images in the left column (below) are all classified correctly, whereas the images in the right column are all classified as an “Ostrich”! The middle column represents the hardly perceptible noise added to the image on the left.

It was clear that these “adversarial examples”, as they termed them, represented fundamental blind spots in several state-of-the-art neural network models, and for a while it was not clear why these existed.

Goodfellow et al. [2] in a follow-up work in 2015 attempted to explain this behaviour as arising **not** from the extreme non-linearity of neural networks, as would be expected, but rather from the *linearity* of their individual layers. For example, consider a simple dot product between a weight vector and an input feature vector, . Now if we were to perturb the input as , then the resulting output would be :

If has a large number of dimensions, then many infinitesimal changes in the individual elements can add up to a large change in the output. That is, even if is small, can be very large.

Now suppose that the norm (ie, the maximum magnitude of an element) in is constrained to be less than some small (since we want the change to be as imperceptible as possible). Then we can intuitively see that can be maximized by multiplying each element of by if it’s positive and if it’s negative. That is,

Now, to generalize this to neural networks, they observe that several neural network models are intentionally designed to behave in very linear ways. In fact, sigmoid networks are carefully tuned so that they operate in the non-saturating linear region of the curve. So if the cost function used to train the network is , then this can be approximated at a point as :

As above, to maximize this, we choose

This is the **Fast Sign Gradient** method of generating adversarial examples. Using the above method, Goodfellow et al. produced the following example of a panda which was misclassified by their target network.

Now the next question is – can we force misclassification of an image belonging to a certain class, *to a particular target class*? This is a much more challenging problem than just plain misclassification, because forcing a Pomeranian to be classified as a Labrador is much easier than forcing it to be classified as an aeroplane, say.

This is the question which Papernot et al. [3] address. Their strategy is to identify **regions of the input** which, when increased, increase the probability of the target class, while at the same time decrease the cumulative probabilities of the other classes. Using this, they construct a** “saliency map”** that expresses how important each pixel of the input is in misclassification of the given image as the target class.

In other words, if a particular pixel makes the probability of the target class decrease, or makes the overall probability of the other classes increase, then its saliency is . However, if both of those are not the case, then its saliency is proportional to the increase in the target class multiplied by the magnitude of the decrease of the other classes.

Input features (pixels) are perturbed one by one in order of decreasing saliency until the norm of the perturbation exceeds some fixed threshold.

The following image shows the results of targeted misclassification attacks on a classifier trained on the famous MNIST dataset.

So far we have the Fast Gradient Sign Method (FGSM) and the Jacobian Saliency Map Attack (JSMA). But the method we’re about to discuss is, in some sense, the most “powerful” first-order adversary (an adversary that uses only first-order gradient information).

Projected Gradient Descent (PGD) for generation of adversarial examples was proposed by Kurakin et al. in 2016 [4].

FGSM is a one-step process :

PGD by contrast is an iterative process that takes a step and then clips (projects) it to within of the original point. A suitable norm can be chosen for the projection, and in practice this is usually the norm.

Instead of *maximizing* the loss function , we could alternatively *minimize* the loss on the *target* adversarial class, .

These are two formulations of the PGD attack. Madry et al. (2017) [5] showed that PGD is a ‘universal’ first-order adversary (an adversary that uses only first-order gradient information), and robustness against a PGD attack implies robustness against *all* first-order adversaries.

Carlini and Wagner (2017) [6] go back to the fundamental definition of adversarial examples. Given an input point , we want to *change the classification* of the point (possibly to a fixed target class), by applying the *smallest possible distortion*. Formally, we can write :

Here, each pixel (or dimension) of is normalized to lie between and . denotes some appropriate distance measure (like the distance, and denotes the classifier.

Now the constraint is not very well-suited for optimization, so we transform it to an equivalent function , such that iff . An example of such a function might be :

Note : here, denotes .

The authors propose 6 other alternative choices for that are all equivalent to the original condition. Now the optimization problem looks like the following much more familiar form :

We can apply the good ol’ method of Lagrangian (or KKT) multipliers, and get the following form :

The multiplier can be found by a heuristic-based search (for the nitty-gritties, take a look at the paper).

Now that we have this loss function (sometimes referred to as the **CW loss**), a simple way to optimize it while preserving the condition that (or the “box-constraint”) is to perform PGD updates.

Another way is to perform a change of variables, like so :

So will always be in the interval , and we can just optimize on .

So far, we have discussed attack scenarios where an adversary has complete unrestricted knowledge of the architecture of the model and its weights. Papernot et al. [7] in 2017 proposed the first adversary that could attack a “black-box” model.

The only capability of the adversary in this setting is to observe the model’s output on any chosen input. Further, they assume that the adversary does not have access to a large training set (including the set the original model was trained on), thus adding additional constraints.

The plan of attack is to train a *substitute* model that learns to mimic the behaviour of the black-box model. Since the substitute model is transparent to us, we can perform FGSM, JSMA or PGD, and expect that these adversarial examples are *transferable* to the original model.

To start off, we need a small number of initial samples. For instance, on MNIST, the authors claim that just 10 images of each digit are sufficient. Further, these examples **don’t** have to come from the dataset or distribition on which the original model was trained.

The next step is to choose an architecture relevant to the problem at hand. For example, for an image classification setting, one would choose a ResNet or similar. They found that adversarial examples can transfer across architectures, and the specific choice of architecture does not make a significant difference.

The key idea for the next step is to **augment** this small dataset so that our s*ubstitute model learns the decision boundaries of the original model.* Also keep in mind that we can’t add too many new points (in the real world, apart from concerns over practicality, this also incurs the risk of the adversary being detected). So we must efficiently pick new training points.

To do this, we look at the direction in which the substitute model’s output is changing around each training point. The intuition behind this is that these large variations require more input-label pairs to capture them accurately.

Rather than look at the full vector of outputs generated by the substitute model, we choose to look only at the output corresponding to the label predicted by the *original* model. Let denote the **Jacobian** matrix of the substitute model (if a model has inputs and outputs, then the Jacobian is an matrix of partial derivatives of each output with respect to each input). Then the perturbation applied to each training point is :

The dataset is augmented, the model is re-trained and this process repeats until the substitute model learns to mimic the original model well. Then we can apply FGSM or JSMA on this substitute model.

In a follow-up post, I’ve explained how these black-box attacks expose a fundamental limitation of some defences, called “Gradient Masking”. There is a wonderful paper by Athalye, Carlini and Wagner on “Obfuscated Gradients” [8] that certainly deserves a mention here, but I thought it would make more sense to talk about it in the context of defences. So I’ve moved the discussion about that paper to the follow-up post.

Next, let’s look at a black box attacks with even more constraints.

Ilyas et al. [9] in a recent paper consider three restrictive settings, or “threat models”, on which to perform black-box attacks. They are :

**The Query-limited Setting :**In several real-world examples, where each query comes with a cost, we want to keep the number of queries made by the adversary to an absolute bare minimum.**The Partial Information Setting**: In certain cases, we may not have access to the full vector output by a model, rather, we may have access to only the top-k labels with their (possibly relative) confidences. In the extreme case, we have access to only the top label and its confidence score.**The Label-only Setting :**We only have access to the top-k labels,*without*their associated confidences. Again in the extreme case, we only know the top label.

The central idea of their paper is adapting Natural Evolutionary Strategies (NES), a black-box function optimizer, to estimate the gradient. NES takes a function and a search distribution , which is a distribution over the parameters to search, given a value of the current parameters .

The authors use a Gaussian search distribution, centred at , to estimate the gradient at . So , and . Plugging this into the NES forumulation leaves us with the following equation for estimating the gradient :

Once we have the approximate gradient, we are back on familiar ground, and we can simply use PGD to generate adversarial examples. This is (unsurprisingly) much more efficient and effective than training a substitute model and relying on transferability.

Using a combination of NES and PGD, the authors are succesfully able to attack models in the Query-limited setting. The use of NES ensures that the number of queries is low.

When we try to extend this to the Partial information setting, there is a problem. If the adversarial target class drops out of the top-k, then we can’t estimate the gradient of using NES because that relies on being able to know the exact output at several points in the vicinity of .

So the strategy is this :

- Start with an image of the
**target class.**Clip it to within the smallest possible of the**original image**, such that the target class doesn’t slip out of the top-k predictions. - Then perform a step of PGD with this smallest to get a new image, and repeat.

The below image shows a successful attack on the Google Cloud Vision API, which corresponds to the partial information setting. As you can see, the top prediction changes from “Skiing” to “Dog”.

Finally for the label-only setting, in place of a measure of confidence, the authors use robustness of the model to small Gaussian random distortions. This is shown in the picture below.

Once we have these confidence scores, this becomes identical to the Partial information setting.

And that brings us to the end. I’ve attempted to cover as much as I can, but I’m sure there’s several important papers and ideas that I’ve missed. Hopefully I’ll be able to keep this up regularly and do more such blogs in the future, where I can cover those. Until then, bye!

- Intriguing properties of neural networks
- Explaining and Harnessing Adversarial Examples
- Adversarial Machine Learning at Scale
- Towards Deep Learning Models Resistant to Adversarial Attacks
- Towards Evaluating the Robustness of Neural Networks
- Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
- Black-box Adversarial Attacks with Limited Queries and Information

- 2091 lines of code added
- 719 lines of code removed
- 3 Pull Requests
- 66 commits

My project was titled “**Adding more convolution operations**“, and I have added the following features.

**1. Unshared Convolution :**

A different filter is applied over every region of the input image. An application of these convolutions is in DeepFace.

The relevant pull-requests are :

Note : The first PR needed to be rebased and was already rather messy. So I closed it and opened the second PR which was ultimately merged.

The interface created was a new boolean parameter ‘unshared’ in the existing 2D convolution interface. This involved the slow (default) Python implementation and C code for the CPU and GPU implementations.

**2. Asymmetric Padding and Dilated Causal Convolution **

Dilated Causal Convolutions were introduced in the Wavenet paper by DeepMind. The idea is to model sequences by having output at ‘time’ t depend on inputs till time t-1 only.

The input and output tensors are a batch of stacks of feature vectors of constant length. So this is actually a 1D convolution. It can be implemented as a 2D convolution by reshaping the tensor to add an extra dimension with size 1.

Causality can be implemented by padding with zeros on the left side only. The number of zeros to pad with can be calculated from the size of the filter.

Rather than implement just causal convolution, it was decided to implement asymmetric padding and use this asymmetric padding to implement causal convolution.

The pull request for this feature is :

https://github.com/Theano/Theano/pull/6331.

Note: As of 29/08, this feature is finished but not yet merged.

The interface created was a new supported format for the border_mode parameter. One can specify a tuple containing pairs of integers to indicate the left and right padding along each dimension. Also, a causal_conv1d function was created that would call the Convolution Op with the appropriate parameters.

**Tests and Documentation :**

An important part of the whole project was adding tests and documentation for the above-mentioned features. The tests are in 3 parts – for the Python code with no optimization, for the CPU GEMM-based convolution and for the GPU convolution.

Apart from this, docstrings and comments needed to be added/updated to reflect the new interfaces.

**Future Enhancements :**

- Unshared and Grouped Convolutions can be sped up (benchmarking required) using Strided Batched Gemm in CUDA 8. This was part of the original plan but couldn’t be completed in time.
- Unshared convolution could be extended to 3D if there is any interest in doing so.
- Asymmetric padding in 3D needs to be implemented.
- Asymmetric padding can be computed through cuDNN if the tuple contains two same integers, ie, it is actually symmetric. This will require a change to the optimizer.
- The causal_conv1d function requires filter_shape to be passed as an argument, which is inelegant. This needs to be done away with.

That’s all folks.

Vikram Nitin

29/08/2017

]]>

Then I began work on Dilated Causal Convolution while waiting for reviews on my code. After implementing some of my changes (and a rebase, because 3D grouped convolution had meanwhile been merged), I discovered a very strange error that I’m still working on. It only occurs some of the time, for example it doesn’t show up the first time after when I purging the cache and running.

It’s been two days but I’ve been unable to get to the bottom of this. Hopefully there is a light at the end of the tunnel.

Dilated Causal is in a primitive state now. I’ll need to really hurry things up for the final review to get it merge-ready. It’s going to be a fun ten days or so.

]]>As a result I missed my targets for that week, and with the second evaluations due the following week, I was not sure I would even be allowed to continue. But I’ve scraped through, and now I’ve resolved to put my head down and focus on what needs to be done in this last month of the program.

Finishing the summer with just unshared convolution implemented would certainly not be desirable, so I’ll need to scramble to integrate grouped convolution (which Mohammed Affan had been working on). The plan as of now is to move on to Dilated Causal Convolution once this is done, with the hope that it will prove to be an easy task.

Anyway, tough times ahead, lots to do. Hope to push through and wind up GSoC on a good note.

]]>Although I don’t have an NVIDIA GPU on my laptop (Intel Integrated Graphics only), I was able to use OpenCL on the CPU to test the code. I believe that libgpuarray is largely compatible with OpenCL too.

When I tried out the code, I was confronted with a mysterious “Error 11”, with GEMV failing (matrix-vector multiplications for the forward pass). I replaced the GEMV calls with GEMM (parameters suitably modified) and the error disappeared. I still have no idea what I was doing wrong.

Then the further debugging also took a couple of days, with bugs related to sampling, filter dilation, etc.

By this time, my branch had become incompatible for merging, since the gpuarray/corr_gemm.c file had been cleaned up since I forked the repo. I performed a rebase to resolve the conflict.

Within a couple of days, my fork had some new conflicts with the base branch, since Mohammad Affan’s grouped convolution had been merged.

So my next task was integrating the grouped with the unshared convolution. That should be done in the next day or so

This coming week, I’ll be participating in the RoboCup in Nagoya, Japan. So I don’t expect to make any progress in this next week.

]]>Once the slow Python implementation was done, the next thing I had to work on was the CPU and GPU implementations (in C). Specifically, the operations (forward, grad weights, grad inputs) have to be expressed as one or more matrix multiplications, so that they can exploit the fast algorithms available for General Matrix-Matrix multiplication (GEMM).

Expressing the operations in this form was the easier part. The irritating bit was converting leading dimensions and sizes in column major order (used by gemm). For unshared convolution, all operations were expressed as matrix-vector multiplications called in a loop over the output regions.

I tried to integrate my primitive unshared convolution tests into the larger framework of the general AbstractConv tests. This was unexpectedly difficult, and I had to deal with literally thousands of lines of sometimes uninformative error messages. The entire process took nearly 5 days, instead of the expected 1 or 2.

Even after the tests passed locally, Travis would cut them short due to the large run-time (double the previous, due to the extra unshared parameter in the outer loop). So we decided to leave them as is for now and move on the GPU implementation.

Here I faced a somewhat foreseen problem, namely my not having a GPU (Intel Integrated HD Graphics only). I went ahead and made the GPU gemm along similar lines as the CPU gemm without yet actually testing to see if it worked. In the last couple of days, I’ve installed OpenCL with CLBlast for my laptop (nasty long process), and have installed libgpuarray along with them.

So the next step will be filling in the missing pieces in the GPU implementation and trying the whole thing out. This should be done soon.

Until next fortnight then.

]]>In that time, I’ve gotten used to staying up late into the night staring at my laptop and (figuratively) pencil-chewing, thinking of how to efficiently compute a gradient or something along those lines. It sometimes does feel odd when the output of 6 hours’ work is 10 lines of code added and 20 lines removed.

But it is satisfying to know that the code you’ve left behind is cleaner and better than the one you started out with at the beginning of the day, and that keeps me going.

As an aside, another thing I’ve learnt is that my abilities of estimation really can’t be relied upon. Somewhere in early June, I thought that the Python part of the code would be a breeze and that I would be done in a week or less. The C (CPU) code would take another week at most, I thought, and I could start work on the GPU code by the 20th at best. Anyway, it’s now the 27th and I still have a few days to go before I finish the CPU code.

It’s been frustrating to see what I estimate as a day’s work sometimes stretch on to 3 days or more, but I hope that the upward climb is now over and it will be smoother from here on.

On another note, one topic of discussion this week was the shape the weights array would take. Lasagne, Keras and Pylearn2 all use different 6 or 7-dimensional tensors. And none of these shapes match with my original choice for the implementation! After I started working on the C code, I realised that I probably should change the dimension ordering to make things simpler, so I’m making do with a dimshuffle while calling the C code. I’ll have to go back to the Python code and make the necessary changes later, however.

Anyway, here’s looking forward to more challenges and discoveries ahead.

]]>My proposal titled “Adding more Convolution Operations” was accepted by Theano (under the Python Software Foundation) on May 4th, providing a welcome distraction from my then-ongoing exams.

There are many variants of the basic convolution operation, like Grouped, Unshared and Separable Convolution. Although higher-level libraries like Keras and Lasagne provide support for some of these, having native implementations in Theano is desirable.

The first thing to be done was to divide up the list of tasks in my proposal into smaller, manageable bits. It was decided that I would begin work on Unshared Convolution, and we would plan further tasks as we go along. Unshared Convolution is a variant where there is a different kernel for every ‘region’ of the input, instead of the complete weight-sharing in a conventional convolution.

The different steps involved in the implementation (for any convolution operation) are designing the three passes : forward, gradient with respect to weights, and gradient with respect to input. Each of these has a slow default Python implementation (primarily for debugging), and an underlying C implementation (optimised for speed). In addition, tests for these need to be included in the Travis build.

So currently, I’m working on implementing these components for Unshared Convolutions. Looking forward to an amazing learning experience over the next few months!

]]>