The submission presents an experimental setup for analyzing the successful gradient-based optimization and performance of networks with large numbers of parameters. They propose to train a convolutional network on MNIST and analyze the gradient descent paths through weight space. The trajectories are compared and evaluated using PCA. This is very similar to the approach taken by Goodfellow et al, and it is difficult to see any new contributions of this submission. The results are mostly well-known at this point, although there is certainly room for further research in this area. The demonstration of divergence during training because of shuffled inputs is interesting but not surprising. There are no new visualizations or qualitative results, and the quantitative results are limited to 2 numbers (the variance explained by the top 2 and top 10 principal components) which are meaningless without more extensive comparison and analysis. The paper is too misleading and takes too much credit for previously known ideas for me to endorse as it is.