This paper out of DeepMind used a Google StreetView dataset and set out to train a network capable of navigating to a given goal destination, without knowing where it was on any birds-eye map, and with its only input being photographic viewpoint images of its current location and orientation. This was done through a framework of reinforcement learning, where the model is conditioned on a representation of its goal, and given the image features of its current view of the world, and has to take actions like “turn left,” “turn sharply left”, “go forward”, etc, in order to navigate. Rather than lat-long, goals are specified in city-specific ways, in terms of the distance between the goal position and a reference set of landmarks. I don’t entirely understand the motivation behind this approach; the authors say it’s more scalable, but it wasn’t obvious to me why that would be the case. https://i.imgur.com/V3UATsK.png - The authors construct different architectures that combine these two fundamental pieces of input data - current image and the goal you’re trying to reach - in different ways. In the simplest model, called GoalNav, there’s a single LSTM that combines the goal information with the output of a convolutional encoder processing images of your current viewpoint. - In the next most complex, CityNav, there are two LSTMs: one for processing your goal, and the other for combining the output of the goal network with your convolutional inputs, in order to decide on an action. Notionally, this separation of tasks corresponds to “figure out what absolute to go in, given your goal”, and “figure out how to go in that absolute direction from where you are now”. As a way to support training, the goal network is trained with an auxiliary loss function where it needs to predict how far its current orientation is from North. Note that this does pass some amount of information about current location into the model (since the network gets to know its actual orientation relative to true north), but this is only available during training, with the hope that the model will have gotten good enough at predicting orientation to perform well. - The final model, similar to above, is called MultiCityNav, and is explicitly designed for transfer learning. Instead of training multiple cities on a single shared network, only the convolutional encoder and policy network (the “how do I go in the absolute direction needed to reach my goal” parts) are shared between cities, and the goal processing LSTM (the “which direction should I be going in” part) is re-trained per city. This is designed to allow for transfer in the parts of learning you would expect to generalize, but allow the network to learn a city-specific approach for converting between goal specifications (in terms of city landmarks) and direction. In order to get over the fact that reward in this setting is very sparse (i.e. you only get reward when you reach the goal), the authors (1) train in a curriculum fashion, starting with tasks very nearby the model’s starting point, and gradually getting longer, and (2) add a small amount of reward shaping, where you get rewarded for moving in the direction of the goal, but only if you’re within 200m of it. This last is a bit of a concession on the realism front, and the authors say as much, but it’s just quite hard to train RL with purely dense rewards, and it makes sense that reward shaping would help here. Ultimately, they were able to get performance (in terms of goal-reaching rewards) around ¾ as strong as an Oracle model, who had access to the full map and could calculate the true shortest path.