This paper developed a semantically rich representation for natural sound using unlabeled videos as a bridge to transfer discriminative visual knowledge from well-established visual recognition models into the sound modality. The learned sound representation yields significant performance improvements on standard benchmarks for acoustic scene classification task. ### Key Points - The natural synchronization between vision and sound can be leveraged as a supervision signal for each other. - Cross-modal learning can overcome overfitting if the target modal have much fewer data than other modals, which is essential for deep networks to work well. - In the sound classification task, **pool5** and **conv6** extracted from SoundNet achieve best performance. ### Model - The authors proposed a student-teacher training procedure to transfer discriminative visual knowledge from visual recognition models trained on ImageNet and Places into the SoundNet by minimizing KL divergence between their predictions. !(https://cloud.githubusercontent.com/assets/7057863/20856609/05fe12d6-b94e-11e6-8c92-995ee84fe0d7.png) - Two reasons to use CNN for sound: 1. invariant to translations; 2. stacking layers to detect higher-level concepts. ### Exp - Adding a linear SVM upon representation learned from SoundNet outperforms other existing methods 10%. - Using lots of unlabeled videos as supervision signals enable the deeper SoundNet to work, or otherwise the 8-layer networks performs poorly due to overfitting. - Simultaneous Using Places and ImageNet as supervision beats using only one of them 3%. - Multi-modal recognition models use visual and sound data together yields 2% gain in classification accuracy. ### Thought I think this paper is really complete since it contains good intuition, ablation analysis, representation visualization, hidden unit visualization, and significent performance imporvements. ### Questions - Although paper said that "To handle variable-temporal-length of input sound, this model uses a fully convolutional network and produces an output over multiple timesteps in video.", but the code seems to set the length of each excerpts fixed to 5 seconds. - It looks not clear for me about the data augmentation technique used in training.