Summaries from Neural Information Processing Systems Conference on ShortScience.org

nips.djvuzone.org
sci-hub
scholar.google.com

The Cascade-Correlation Learning Architecture
Fahlman, Scott E. and Lebiere, Christian
Neural Information Processing Systems Conference - 1989 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

Cascade Correlation is an algorithm to create feed-forward neural network architectures. However, those architectures are not the typical layered architectures. See [my YouTube video](https://www.youtube.com/watch?v=1E3XZr-bzZ4) for a short explanation of the constructed architecture.

For the "correlation" part, see [this question](http://datascience.stackexchange.com/q/9672/8820).

## Related work

See [Meiosis Networks summary](http://www.shortscience.org/paper?bibtexKey=conf/nips/Hanson89#martinthoma) for many topology learning papers

nips.djvuzone.org
sci-hub
scholar.google.com

Meiosis Networks
Hanson, Stephen Jose
Neural Information Processing Systems Conference - 1989 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

This paper is about topology learning (also called *structural learning* as in contrast to *parametric learning*) for neural networks. Instead of taking deterministic weights, each weight $w_{ij}$ is normal distributed ($w_{ij} \sim \mathcal{N}(\mu_{ij}, \sigma^2_{ij})$). Hence every connection has two learned parameters: $\mu_{ij}$ and $\sigma^2_{ij}$.

Meiosis is cell division. So meiosis networks split nodes under some conditions.

The "topology" being learned seems only to add single neurons to the given layers. It is not able to add new layers or add skip connections.

## Chapters

* 1.1 Learning and Search: The author seems to describe the [VC dimension](https://en.wikipedia.org/wiki/VC_dimension).
* 1.2 Stochastic Delta Rule: Explains how to update the weights parameters.
* 1.3 Meiosis: Networks variance is initialized randomly with $\sigma_i^2 \sim U([-10, 10])$ (negative variance???). A node $j$ is splitted, when the random part dominates the value of the sampled weights: $$\frac{\sum_i \sigma_{ij}}{\sum_i \mu_{ij}} > 1 \text{ and } \frac{\sum_k \sigma_{jk}}{\sum_k \mu_{jk}} > 1$$ The mean of the new nodes is sampled around the old mean (TODO: how is it sampled?), half the variance is assigned to the new connections.
* 1.4 Examples: XOR, 3-bit parity, blood NMR data, learning curves. Learning rate of $\eta = 0.5$, momentum of $\alpha = 0.75$.
* 1.5 Conclusion:

## What I don't understand

1. In the present approach, weights reflect a coarse prediction history as coded by a distribution of values and parameterized in the mean and standard deviation of these weight distributions. 
2. The first formula (1)
3. What negative variance is.
4. How exactly the means are sampled


## Related work

* Constructive Methods
  * 1989: [The Cascade-Correlation Learning Architecture](http://www.shortscience.org/paper?bibtexKey=conf/nips/FahlmanL89)
  * 1989: [Dynamic Node Creation in Backpropagation Networks](http://www.shortscience.org/paper?bibtexKey=ash:dynamic): Only one hidden layer
* Pruning methods
  * 1989: [Optimal Brain Damage](http://www.shortscience.org/paper?bibtexKey=conf/nips/CunDS89)
  * 1993: [Optimal Brain Surgeon](http://www.shortscience.org/paper?bibtexKey=conf/nips/HassibiS92)
  * 2015: [Learning both Weights and Connections for Efficient Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1506.02626)
  * 2016: [Neural networks with differentiable structure](http://www.shortscience.org/paper?bibtexKey=journals%2Fcorr%2F1606.06216#martinthoma)

nips.djvuzone.org
sci-hub
scholar.google.com

Optimal Brain Damage
LeCun, Yann and Denker, John S. and Solla, Sara A.
Neural Information Processing Systems Conference - 1989 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

Optimal Brain Damage (OBD) is a techique to make a network smaller by pruning small weights.


## Idea

* use second-derivative information to make tradeoff between network complexity and training error
* do this while training to prevent overfitting / reduce the need for data / reduce training time
* **How to choose what to delete**: Weights which have least impact on training error. This is estimated by approximating the function with a Taylor series.

## Recipe

(Directly copied from the paper):

The OBD procedure can be carried out as follows:

1. Choose a reasonable network architecture
2. Train the network until a reasonable solution is obtained
3. Compute the second derivatives $h_{kk}$ for each parameter
4. Compute the saliencies for each parameter: $s_k = h_{kk} u_k^2 /2$
5. Sort the parameters by saliency and delete some low-saliency parameters
6. Iterate to step 2

Deleting a parameter is defined as setting it to 0 and freezing it there. Several
variants of the procedure can be devised, such as decreasing the values of the low-saliency parameters instead of simply setting them to 0, or allowing the deleted
parameters to adapt again after they have been set to 0.

## See also

* 1989: Optimal Brain Damage ([original pdf](https://papers.nips.cc/paper/250-optimal-brain-damage.pdf), [nice pdf](http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf), [txt](https://github.com/NicolasEstrada/nlp/blob/master/nipstxt/nips02/0598.txt))
* 1993: [Optimal Brain Surgeon](http://www.shortscience.org/paper?bibtexKey=conf/nips/HassibiS92) ([pdf](https://papers.nips.cc/paper/647-second-order-derivatives-for-network-pruning-optimal-brain-surgeon.pdf) and [follow-up](http://www.shortscience.org/paper?bibtexKey=conf/nips/HassibiSW93), [2](http://www.shortscience.org/paper?bibtexKey=conf/epia/EndischHS07))
* 1998: LeNet-5
* 2012: AlexNet
* 2015: [Learning both Weights and Connections for Efficient Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1506.02626)
* 2016: [Neural networks with differentiable structure](http://www.shortscience.org/paper?bibtexKey=journals%2Fcorr%2F1606.06216#martinthoma)