Visualizing Large-scale and High-dimensional Data Visualizing Large-scale and High-dimensional Data
Paper summary #### Introduction * LargeVis - a technique to visualize large-scale and high-dimensional data in low-dimensional space. * Problem relates to both information visualization and machine learning (and data mining) domain. * [Link to the paper]( #### t-SNE * State of the art method for visualization problem. * Start by constructing a similarity structure from the data and then project the structure to 2/3 dimensional space. * An accelerated version proposed that uses a K-nearest Neighbor (KNN) graph as the similarity structure. * Limitations * Computational cost of constructing KNN graph for high-dimensional data. * Efficiency of graph visualization techniques breaks down for large data ($O(NlogN)$ complexity). * Parameters are sensitive to the dataset. #### LargeVis * Constructs KNN graph (in a more efficient manner as compared to t-SNE). * Uses a principled probabilistic model for graph visualization. * Let us say the input is in the form of $N$ datapoints in $d$ dimensional space. ##### KNN Graph Construction * Random projection tree used for nearest-neighborhood search in the high-dimensional space with euclidean distance as metric of distance. * Tree is constructed by partitioning the entire space and the nodes in the tree correspond to subspaces. * For every non-leaf node of the tree, select a random hyperplane that splits the subspace (corresponding to the leaf node) into two children. * This is done till the number of nodes in the subspace reaches a threshold. * Once the tree is constructed, each data point gets assigned a leaf node and points in the subspace of the leaf node are the candidate nearest neighbors of that datapoint. * For high accuracy, a large number of trees should be constructed (which increases the computational cost). * The paper counters this bottleneck by using the idea "a neighbor of my neighbor is also my neighbor" to increase the accuracy of the constructed graph. * Basically construct an approximate KNN graph using random projection trees and then for each node, search its neighbor's neighbors as well. * Edges are assigned weights just like in t-SNE. ##### Probabilistic Model For Graph Visualization * Given a pair of vertices, the probability of observing an edge between them is given as a function of the distance between the projection of the pair of vertices in the lower dimensional space. * The probability of observing an edge with weight $w$ is given as $w_{th}$ power of probability of observing an edge. * Given a weighted graph, $G$, the likelihood of the graph can be modeled as the likelihood of observed edges plus the likelihood of negative edges (vertex pairs without edges). * To simplify the objective function, some negative edges are sampled corresponding to each vertex using a noisy distribution. * The edges are sampled with probability proportional to their weight and then treated as binary edges. * The resulting objective function can be optimized using asynchronous stochastic gradient descent (very effective on sparse graphs). * The overall complexity is $O(sMN)$, $s$ is the dimension of lower dimensional space, $M$ is the number of negative samples and $N$ is the number of vertices. #### Experiments * Data is first preprocessed with embedding learning techniques like SkipGram and LINE and brought down to 100-dimensional space. * One limitation is that the data is preprocessed to reduce the number of dimensions to 100. This seems to go against the claim of scaling for hundreds of dimensions. * KNN construction is fastest for LargeVis followed by random projection trees, NN Descent, and vantage point trees (used in t-SNE). * LargeVis requires very few iterations to create highly accurate KNN graphs. * KNN Classifier is used to measure the accuracy of graph visualization quantitatively. * Compared to t-SNE, LargeVis is much more stable with respect to learning rate across datasets. * LargeVis benefits from its linear complexity (as compared to t-SNE's log linear complexity) and for default learning rate, outperforms t-SNE for larger datasets.
Visualizing Large-scale and High-dimensional Data
Jian Tang and Jingzhou Liu and Ming Zhang and Qiaozhu Mei
arXiv e-Print archive - 2016 via arXiv
Keywords: cs.LG, cs.HC


Your comment:

Short Science allows researchers to publish paper summaries that are voted on and ranked!