Meiosis NetworksMeiosis NetworksHanson, Stephen Jose1989

Paper summarymartinthomaThis paper is about topology learning (also called *structural learning* as in contrast to *parametric learning*) for neural networks. Instead of taking deterministic weights, each weight $w_{ij}$ is normal distributed ($w_{ij} \sim \mathcal{N}(\mu_{ij}, \sigma^2_{ij})$). Hence every connection has two learned parameters: $\mu_{ij}$ and $\sigma^2_{ij}$.
Meiosis is cell division. So meiosis networks split nodes under some conditions.
The "topology" being learned seems only to add single neurons to the given layers. It is not able to add new layers or add skip connections.
## Chapters
* 1.1 Learning and Search: The author seems to describe the [VC dimension](https://en.wikipedia.org/wiki/VC_dimension).
* 1.2 Stochastic Delta Rule: Explains how to update the weights parameters.
* 1.3 Meiosis: Networks variance is initialized randomly with $\sigma_i^2 \sim U([-10, 10])$ (negative variance???). A node $j$ is splitted, when the random part dominates the value of the sampled weights: $$\frac{\sum_i \sigma_{ij}}{\sum_i \mu_{ij}} > 1 \text{ and } \frac{\sum_k \sigma_{jk}}{\sum_k \mu_{jk}} > 1$$ The mean of the new nodes is sampled around the old mean (TODO: how is it sampled?), half the variance is assigned to the new connections.
* 1.4 Examples: XOR, 3-bit parity, blood NMR data, learning curves. Learning rate of $\eta = 0.5$, momentum of $\alpha = 0.75$.
* 1.5 Conclusion:
## What I don't understand
1. In the present approach, weights reflect a coarse prediction history as coded by a distribution of values and parameterized in the mean and standard deviation of these weight distributions.
2. The first formula (1)
3. What negative variance is.
4. How exactly the means are sampled
## Related work
* Constructive Methods
* 1989: [The Cascade-Correlation Learning Architecture](http://www.shortscience.org/paper?bibtexKey=conf/nips/FahlmanL89)
* 1989: [Dynamic Node Creation in Backpropagation Networks](http://www.shortscience.org/paper?bibtexKey=ash:dynamic): Only one hidden layer
* Pruning methods
* 1989: [Optimal Brain Damage](http://www.shortscience.org/paper?bibtexKey=conf/nips/CunDS89)
* 1993: [Optimal Brain Surgeon](http://www.shortscience.org/paper?bibtexKey=conf/nips/HassibiS92)
* 2015: [Learning both Weights and Connections for Efficient Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1506.02626)
* 2016: [Neural networks with differentiable structure](http://www.shortscience.org/paper?bibtexKey=journals%2Fcorr%2F1606.06216#martinthoma)

This paper is about topology learning (also called *structural learning* as in contrast to *parametric learning*) for neural networks. Instead of taking deterministic weights, each weight $w_{ij}$ is normal distributed ($w_{ij} \sim \mathcal{N}(\mu_{ij}, \sigma^2_{ij})$). Hence every connection has two learned parameters: $\mu_{ij}$ and $\sigma^2_{ij}$.
Meiosis is cell division. So meiosis networks split nodes under some conditions.
The "topology" being learned seems only to add single neurons to the given layers. It is not able to add new layers or add skip connections.
## Chapters
* 1.1 Learning and Search: The author seems to describe the [VC dimension](https://en.wikipedia.org/wiki/VC_dimension).
* 1.2 Stochastic Delta Rule: Explains how to update the weights parameters.
* 1.3 Meiosis: Networks variance is initialized randomly with $\sigma_i^2 \sim U([-10, 10])$ (negative variance???). A node $j$ is splitted, when the random part dominates the value of the sampled weights: $$\frac{\sum_i \sigma_{ij}}{\sum_i \mu_{ij}} > 1 \text{ and } \frac{\sum_k \sigma_{jk}}{\sum_k \mu_{jk}} > 1$$ The mean of the new nodes is sampled around the old mean (TODO: how is it sampled?), half the variance is assigned to the new connections.
* 1.4 Examples: XOR, 3-bit parity, blood NMR data, learning curves. Learning rate of $\eta = 0.5$, momentum of $\alpha = 0.75$.
* 1.5 Conclusion:
## What I don't understand
1. In the present approach, weights reflect a coarse prediction history as coded by a distribution of values and parameterized in the mean and standard deviation of these weight distributions.
2. The first formula (1)
3. What negative variance is.
4. How exactly the means are sampled
## Related work
* Constructive Methods
* 1989: [The Cascade-Correlation Learning Architecture](http://www.shortscience.org/paper?bibtexKey=conf/nips/FahlmanL89)
* 1989: [Dynamic Node Creation in Backpropagation Networks](http://www.shortscience.org/paper?bibtexKey=ash:dynamic): Only one hidden layer
* Pruning methods
* 1989: [Optimal Brain Damage](http://www.shortscience.org/paper?bibtexKey=conf/nips/CunDS89)
* 1993: [Optimal Brain Surgeon](http://www.shortscience.org/paper?bibtexKey=conf/nips/HassibiS92)
* 2015: [Learning both Weights and Connections for Efficient Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1506.02626)
* 2016: [Neural networks with differentiable structure](http://www.shortscience.org/paper?bibtexKey=journals%2Fcorr%2F1606.06216#martinthoma)