Neural Architecture Search with Reinforcement Learning
The primary author is a Google Brain Resident (https://research.google.com/teams/brain/residency/). The Brain Residency is a really great program for starting a career in Deep Learning and ML research, and I'm really impressed by how quickly these new researchers churn out great work like this.
disclosure: I work at Google Brain
I think this is the way that Neural Networks achieve some modicum of generality - chaining them together.
Let's say you have a robot that you want to grab a can of beer off the counter. You say "grab that beer" and point to it. The first neural network interprets the speech and visual input. A second neural network chooses the proper neural nets to continue the task based on the information interpreted from the first net - it picks one for walking and one for grabbing.
Extreme paper tldr - Humans usually construct neural network components and the graph of how they fit together by hand. This work sets up a "controller" neural network that constructs two core components in many neural networks, an RNN and a CNN, through reinforcement learning. This is an intensive and slow process, requiring 400 CPUs and 800 GPUs for the RNN and CNN respectively, but achieves better than or near state of the art results for language modeling and vision classification respectively.
This paper is currently under review for ICLR 2017 and is one of the papers I was most excited about. I previously wrote an article, "In deep learning, architecture engineering is the new feature engineering"[1], which discusses the (often ignored) fact that much of the work in modern deep learning papers is just assembling components in different combinations. This is one of the first works that I feel provides a viable answer to my complaint/concern.
The paper itself tackles two problems - first, that optimizing architecture is usually black magic directed poorly by humans, and second, humans rarely spend their time tailoring towards a specific task, instead seeking generality. Zoph and Le do this by having one neural network generate the architecture for a later one through a large series of experiments. They perform experiments in both vision (classification) and text (language modeling), replacing the convolutional neural network component and the recurrent neural network component respectively.
First is that many of the choices regarding constructing the neural network architecture are somewhat arbitrary and only hit upon experimentally by the practitioners themselves. Andrej Karpathy noted in one of his lectures (paraphrased) "Start with an architecture that works, then modify from there" - mainly as there's a lot of "black magic" in these architectures that has only been discovered by spilling blood to the experimental god of a hundred GPUs and/or "graduate student descent" (i.e. where you lock a poor grad in a room for an indeterminate period of time and tell them to do better on task X). Being able to turn to a neural network to run this painful search for you instead is a good idea - assuming you have the large number of GPUs or CPUs necessary. In the paper they use 400 CPUs for the language modeling search and 800 GPUs for the CNN classification search!
The other is whether we should generalize or specialize these architectures. There are many variants of architectures that are not built for or tested against each possible new task. For example, within recurrent neural networks (RNNs) we have the RNN/GRU/LSTM/QRNN/RHN/... and a million minor minor variants between them, each of which perform slightly different depending on the task. While we would like to imagine the architectures that humans make would get progressively closer to "the perfect generic RNN cell" over time, it makes sense that certain cells are, could, or should be optimized to a specific task. Seeking generality isn't always the correct answer. Humans want to seek generality as we don't have the time to tailor to each specific task - but what if we could? Maybe in that situation Occam's razor is actually an impediment to our thinking.
While this is early days, and hugely resource intensive, it is likely to get more feasible over time either as we get more computing power or become smarter regarding how we use it. As a researcher in neural networks, I don't consider this a threat, but instead a useful tool, likely in the same way that a compiler likely only helped assembly programmers.
If people are interested, I can write an article covering many of the details of this paper like I did for Google's Neural Machine Translation architecture[2]. In that article I try to step through how these systems work from the ground up, and the reasoning behind many of the decisions in the paper, hopefully in an understandable manner for a general audience.
P.S. Merity et al. is one of the numbers they beat in the language modeling section, so you may read this entire post in a bitter tone if you'd like ;)
P.P.S. This paper has been out since November 2016 or earlier - I think it was a recent MIT Tech Review article that might have resurfaced it? (oops: wrote Wired initially, meant MIT Tech Review - thanks @saycheese)
[1]: http://smerity.com/articles/2016/architectures_are_the_new_f...
I would love to see a Neural Architecture Search that create Neural Architecture Search as a future research project. Meta-meta-learning. I like the idea of improving the network which creates other networks.
Also the size of the network can be used as part of the evaluation in order to minimize the networks and maximize the accuracy.
Ran across this research reading this article, "AI Software Learns to Make AI Software" - which is already posted here:
This is pretty old, and neural nets can train neural nets too (better than humans as usual). Check learning to learn w/ gradient descent by gradient descent
Wait... if the neural net can design other neural nets, can it be taught to design itself?