RL²: Fast Reinforcement Learning via Slow Reinforcement Learning (2016)
This seems to be the important bit, that describes what makes their learning "fast":
"The objective (...) is to maximize the expected total discounted reward accumulated during a single trial rather than a single episode. Maximizing this objective is equivalent to minimizing the cumulative pseudo-regret (Bubeck & Cesa-Bianchi, 2012). Since the underlying MDP changes across trials, as long as different strategies are required for different MDPs, the agent must act differently according to its belief over which MDP it is currently in. Hence, the agent is forced to integrate all the information it has received, including past actions, rewards, and termination flags, and adapt its strategy continually. Hence, we have set up an end-to-end optimization process, where the agent is encouraged to learn a “fast” reinforcement learning algorithm"
However, one would note that this is still bounded by learning of the RNN, so I don't really see how this approach makes the algorithm much faster than "slow" RL, as loads of trials would still be required for any real learning. Maybe someone more knowledgeable could pitch in.
If this works, this seems like it could be very significant.
Broadly, slowness is a serious problem in current machine learning approaches and so anything that speeds things up is significant.
The approach of learning the learning process would seem to get bonus points for being interesting and general.
Edit: Published in November, this paper didn't seem to get any comments on the machine learning Reddit, which is my go-to for informed on this stuff. I'd love to have someone who knew what they were doing comment here.
Note: The original title noted OpenAI's role in the paper, which maybe seen by loading the PDF and reading the author credits.
So when are we going to see the paper where we use an RL net to speed up another RL net for discovering a neural architecture for learning how to do gradient descent (by gradient descent)?