[link]
TLDR; The authors propose Progressive Neural Networks (ProgNN), a new way to do transfer learning without forgetting prior knowledge (as is done in finetuning). ProgNNs train a neural neural on task 1, freeze the parameters, and then train a new network on task 2 while introducing lateral connections and adapter functions from network 1 to network 2. This process can be repeated with further columns (networks). The authors evaluate ProgNNs on 3 RL tasks and find that they outperform finetuningbased approaches. #### Key Points  Finetuning is a destructive process that forgets previous knowledge. We don't want that.  Layer h_k in network 3 gets additional lateral connections from layers h_(k1) in network 2 and network 1. Parameters of those connections are learned, but network 2 and network 1 are frozen during training of network 3.  Downside: # of Parameters grows quadratically with the number of tasks. Paper discussed some approaches to address the problem, but not sure how well these work in practice.  Metric: AUC (Average score per episode during training) as opposed to final score. Transfer score = Relative performance compared with single net baseline.  Authors use Average Perturbation Sensitivity (APS) and Average Fisher Sensitivity (AFS) to analyze which features/layers from previous networks are actually used in the newly trained network.  Experiment 1: Variations of Pong game. Baseline that finetunes only final layer fails to learn. ProgNN beats other baselines and APS shows reuse of knowledge.  Experiment 2: Different Atari games. ProgNets result in positive Transfer 8/12 times, negative transfer 2/12 times. Negative transfer may be a result of optimization problems. Finetuning final layers fails again. ProgNN beats other approaches.  Experiment 3: Labyrinth, 3D Maze. Pretty much same result as other experiments. #### Notes  It seems like the assumption is that layer k always wants to transfer knowledge from layer (k1). But why is that true? Network are trained on different tasks, so the layer representations, or even numbers of layers, may be completely different. And Once you introduce lateral connections from all layers to all other layers the approach no longer scales.  Old tasks cannot learn from new tasks. Unlike humans.  Gating or residuals for lateral connection could make sense to allow to network to "easily" reuse previously learned knowledge.  Why use AUC metric? I also would've liked to see the final score. Maybe there's a good reason for this, but the paper doesn't explain.  Scary that finetuning the final layer only fails in most experiments. That's a very commonly used approach in nonRL domains.  Someone should try this on nonRL tasks.  What happens to training time and optimization difficult as you add more columns? Seems prohibitively expensive.
Your comment:

[link]
Rusu et al. Propose progressive networks, sets of networks allowing transfer learning over multiple tasks without forgetting. The key idea of progressive networks is very simple. Instead of finetuning a model (for transfer learning), the pretrained model is taken and its weights fixed. Another network is then trained from scratch while receiving features from the pretrained network as additional input. Specifically, the authors consider a sequence of tasks. For the first task, a deep neural network (e.g. multilayer perceptron) is trained. Assuming $L$ layers with hidden activations $h_i^{(1)}$ for $i \leq L$, each layer computes $h_i^{(1)} = f(W_i^{(1)} h_{i1}^{(1)})$ where $f$ is an activation function and for $i = 1$, the network input is used. After training until convergence, a second network is trained – now on a different task. The parameters of the first network is fixed, but the second network can use the features of the first one: $h_i^{(2)} = f(W_i^{(2)} h_{i1}^{(2)} + U_i^{(2:1)}h_{i1}^{(1)})$. This idea can be generalized to the $k$the network, which can use the activations from all the previous networks: $h_i^{(k)} = f(W_i^{(k)} h_{i1}^{(k)} + \sum_{j < k} U_i^{(k:j)} h_{i1}^{(j)})$. For three networks, this is illustrated in Figure 1. https://i.imgur.com/ndyymxY.png Figure 1: An illustration of the feature transfer between networks. In practice, however, this approach results in an explosion of parameters and computation. Therefore, the authors apply a dimensionality reduction to the $h_{i – 1}^{(j)}$ for $j < k$. Additionally, an individual scaling factor is used to account for different ranges used in the different networks (also depending on the input data). Then, the above equation can be rewritten as $h_i^{(k)} = f(W_i^{(k)} h_{i1}^{(k)} + \sum_{j < k} U_i^{(k)} f(V_i^{(k)} \alpha_i^{(:k)} h_{i1}^{(:k)})$. (Note that notation has been adapted slightly, as I found the original notation misleading.) Here, $h_{i – 1}^{(:k)}$ denotes the concatenated features from all networks $j < k$. Similarly, for each network, one $\alpha_i^{(j)}$ is learned to scale the features (note that the notation above would imply a elementwise multiplication of the $\alpha_i^{(j)}$'s repeated in a vector, or equivalently a matrixvector product). $V_i^{(k)}$ then describes a dimensionality reduction; overall, a onelayer perceptron is used to “transfer” features from networks $j < k$ to the current network. The same approach can also be applied to convolutional layers (e.g. a $1 \times 1$ convolution can be used for dimensionality reduction). In experiments, the authors show that progressive networks allow efficient transfer learning (efficient in terms of faster training). Additionally, they study which features are actually transferred. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 