Predicting Parameters in Deep LearningPredicting Parameters in Deep LearningDenil, Misha and Shakibi, Babak and Dinh, Laurent and Ranzato, Marc'Aurelio and de Freitas, Nando2013
Paper summarynipsreviewsMotivated by recent attempts to learn very large networks this work proposes an approach for reducing the number of free parameters in neural-network type architectures. The method is based on the intuition that there is typically strong redundancy in the learned parameters (for instance, the first layer filters of of NNs applied to images are smooth): The authors suggest to learn only a subset of the parameter values and to then predicted the remaining ones through some form of interpolation. The proposed approach is evaluated for several architectures (MLP, convolutional NN, reconstruction-ICA) and different vision datasets (MNIST, CIFAR, STL-10). The results suggest that in general it is sufficient to learn fewer than 50% of the parameters without any loss in performance (significantly fewer parameters seem sufficient for MNIST).
The method is relatively simple: The authors assume a low-rank decomposition of the weight matrix and then further fix one of the two matrices using prior knowledge about the data (e.g., in the vision case, exploiting the fact that nearby pixels - and weights - tend to be correlated). This can be interpreted as predicting the "unobserved" parameters from the subset of learned filter weights via kernel ridge regression, where the kernel captures prior knowledge about the topology / "smoothness" of the weights. For the situation when such prior knowledge is not available the authors describe a way to learn a suitable kernel from data.
The idea of reducing the number of parameters in NN-like architectures through connectivity constraints in itself is of course not novel, and the authors provide a pretty good discussion of related work in section 5. Their method is very closely related to the idea of factorizing weight matrices as is, for instance, commonly done for 3-way RBMs (e.g. ref  in the paper), but also occasionally for standard RBMs (e.g. [R1], missing in the paper). The present papers differs from these in that the authors propose to exploit prior knowledge to constrain one of the matrices. As also discussed by the authors, the approach can further be interpreted as a particular type of pooling -- a strategy commonly employed in convolutional neural networks. Another view of the proposed approach is that the filters are represented as a linear combination of basis functions (in the paper, the particular form of the basis functions is determined by the choice of kernel). Such representations have been explored in various forms and to various ends in the computer vision and signal processing literature (see e.g. [R2,R3,R4,R5]). [R4,R5], for instance, represent filters in terms of a linear combination of basis functions that reduce the computational complexity of the filtering process).