In this post, we summarize our paper, A Study on Encodings for Neural Architecture Search, which will appear at NeurIPS 2020 as a spotlight paper. This work provides a theoretical and empirical study of different encodings that can be used within neural architecture search. The results act as an ablation study for prior work, act as a guideline for future work, and demonstrate that encodings are an important design decision for neural architecture search.
ArXiv paper: https://arxiv.org/abs/2007.04965
Source code: https://github.com/naszilla/nas-encodings
In the past eight years, deep learning has been the dominant paradigm in artificial intelligence due to its strong performance on computer vision and natural language tasks. However, one of the downsides of deep learning is that the best neural architectures are often highly complex and highly dependent on the dataset. For this reason, the field of neural architecture search (NAS) has seen a steep rise in interest, due to the promise of using algorithms to automatically design a specialized neural architecture for any given problem.
In NAS, given a dataset D and a search space A of architectures, the goal is to find the architecture in A with the highest validation accuracy when trained on D (with a fixed set of hyperparameters). The search space A can be very large, for example, size 1018. A neural architecture in the search space is typically represented as a labeled directed acyclic graph (DAG), where the nodes (or edges) are operations such as CONV3x3, POOL3x3, or Skip Connect. The nodes and edges represent paths for the data to flow from the input layer to the output layer.
When designing a NAS algorithm, a natural question is, how should we encode the neural architecture to maximize performance? The representation of the DAG-based architectures may significantly change the outcome of NAS subroutines such as perturbing or manipulating architectures. Although the majority of prior work has not explicitly considered this question, two recent papers have shown that even small changes to the architecture encoding can make a substantial difference in the final performance of the NAS algorithm.
In this work, we provide the first formal study on NAS encoding schemes. We define an encoding as a multi-function from an architecture to a real-valued tensor. The encoding is a fixed transformation, independent of the dataset. We define eight different encodings (four are shown below) based on two paradigms: adjacency matrix-based encodings, and path-based encodings. Adjacency matrix approaches represent the architecture as a list of edges and operations, while path-based approaches represent the architecture as a set of paths from the input to the output.
The path-based encodings are not one-to-one functions, since multiple neural architectures can map to the same encoding (see part (a) below). The adjacency matrix is a multi-function (not a regular function) because one architecture can map to multiple encodings.
First, we theoretically characterize the scalability of each encoding by quantifying the information loss from truncation. We show that the path encoding can be significantly truncated with little loss of information, while the adjacency encoding cannot be truncated. Specifically, given a random neural architecture with k edges, n nodes, and r choices of operations on each node, truncating the encoding to length rk/n contains almost the same information as the full path encoding, and it cannot be truncated any smaller. Furthermore, the adjacency matrix cannot be truncated without losing a significant amount of information. We run experiments to verify these conclusions.
Next, we experimentally test the effectiveness of each encoding. We identify three major encoding-dependent subroutines. These three subroutines are the only encoding-dependent building blocks necessary for many NAS algorithms.
Sample random architecture – draw an architecture randomly from the search space.
Perturb architecture – make a small change to a given architecture.
Train predictor model – train a model (e.g., Gaussian process, or neural network) to predict the validation accuracies of unseen architectures.
We show which of the encodings perform best for each subroutine by testing each encoding within each subroutine for many popular NAS algorithms. Our experiments retroactively provide an ablation study for prior work by disentangling the algorithmic contributions from the encoding-based contributions.
Overall, our results show that NAS encodings are an important design decision which must be taken into account not only at the algorithmic level, but at the subroutine level, and which can have a significant impact on the final performance. Our experimental results allow us to disentangle the algorithmic and encoding-based contributions of prior work, and act as a guideline for the encodings to use in future work. To see the full details, please read our paper or look at our code.