Memory Limited, Streaming PCAMemory Limited, Streaming PCAMitliagkas, Ioannis and Caramanis, Constantine and Jain, Prateek2013
Paper summarynipsreviewsThis paper proposes an approach to one-pass SVD based on a blocked variant of the power method, which variance is reduced within each block of streaming data, and compares to exact batch SVD.
Figure 1d is offered as an example where the proposed Algo 1 can scale to data for which the authors claim to be so large that "traditional batch methods" could not be run and reported. Yet there are many existing well-known SVD methods which are routinely used for even larger data sets than the largest here (sparse 8.2M vectors for 120k dimensions). These include the EMPCA (Roweiss 1998) and fast randomized SVD (Haiko et at 2011), both of which the author's cite. Why were these methods (both very simple to implement efficiently even in Matlab, etc.) not reported for this data? Especially necessary to compare against is the randomized SVD, since it too can be done in one-pass (see Haiko et al); although that cited paper discusses the tradeoffs in doing multiple passes -- something this paper does not even discuss. The authors say it took "a few hours" for Algo 1 to extract the top 7 components. Methods like the randomized SVD family of Haiko et al scale linearly in those parameters (n=8.2M and d=120k and k=7 and the number of non-zeros of the sparse data) and typically run in less than 1 hour for even larger data sets. So, demonstrating both the speed and accuracy of the proposed Algo 1 compared to the randomized algorithms seems necessary at this point, to establish the practical significance of this proposed approach.
This paper identifies and resolves a basic gap in the design of streaming PCA algorithms. It is shown that a block stochastic streaming version of the power method recovers the dominant rank-k PCA subspace with optimal memory requirements and sample complexity not too worse than batch PCA (which maintains the covariance matrix explicitly), assuming that streaming data is drawn from a natural probabilistic generative model. The paper is excellently written and provides intuitions for the analysis, starting with exact rank 1 and exact rank k case to the general rank k approximation problem. Some empirical analysis is also provided illustrating the approach for PCA on large document-term matrices.
This paper proposes an approach to one-pass SVD based on a blocked variant of the power method, which variance is reduced within each block of streaming data, and compares to exact batch SVD.
Figure 1d is offered as an example where the proposed Algo 1 can scale to data for which the authors claim to be so large that "traditional batch methods" could not be run and reported. Yet there are many existing well-known SVD methods which are routinely used for even larger data sets than the largest here (sparse 8.2M vectors for 120k dimensions). These include the EMPCA (Roweiss 1998) and fast randomized SVD (Haiko et at 2011), both of which the author's cite. Why were these methods (both very simple to implement efficiently even in Matlab, etc.) not reported for this data? Especially necessary to compare against is the randomized SVD, since it too can be done in one-pass (see Haiko et al); although that cited paper discusses the tradeoffs in doing multiple passes -- something this paper does not even discuss. The authors say it took "a few hours" for Algo 1 to extract the top 7 components. Methods like the randomized SVD family of Haiko et al scale linearly in those parameters (n=8.2M and d=120k and k=7 and the number of non-zeros of the sparse data) and typically run in less than 1 hour for even larger data sets. So, demonstrating both the speed and accuracy of the proposed Algo 1 compared to the randomized algorithms seems necessary at this point, to establish the practical significance of this proposed approach.
This paper identifies and resolves a basic gap in the design of streaming PCA algorithms. It is shown that a block stochastic streaming version of the power method recovers the dominant rank-k PCA subspace with optimal memory requirements and sample complexity not too worse than batch PCA (which maintains the covariance matrix explicitly), assuming that streaming data is drawn from a natural probabilistic generative model. The paper is excellently written and provides intuitions for the analysis, starting with exact rank 1 and exact rank k case to the general rank k approximation problem. Some empirical analysis is also provided illustrating the approach for PCA on large document-term matrices.