Visualizing data using t-SNE Visualizing data using t-SNE
Paper summary #### Introduction * Method to visualize high-dimensional data points in 2/3 dimensional space. * Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation. * Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold. * t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix. * Based on SNE (Stochastic Neighbor Embedding) * [Link to paper]( #### SNE * Given a set of datapoints $x_1, x_2, ...x_n, p_{i|j}$ is the probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. Calculation of $\sigma_i$ is described later. * Similarly, define $q_{i|j}$ as conditional probability corresponding to low-dimensional representations of $y_i$ and $y_j$ (corresponding to $x_i$ and $x_j$). The variance of Gaussian in this case is set to be $1/\sqrt{2}$ * Argument is that if $y_i$ and $y_j$ captures the similarity between $x_i$ and $x_j$, $p_{i|j}$ and $q_{i|j}$ should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for $P_i$ and $Q_i$, where $P_i$ ($Q_i$) represent conditional probability distribution given $x_i$ ($y_i$) * Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure. * Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find $\sigma_i$ which produces the $P_i$ having same perplexity as specified by the user. * Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing. #### t-SNE (t-Distributed SNE) ##### Symmetric SNE * A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized. * Symmetric because $p_{i|j} = p_{j|i}$ and $q_{i|j} = q_{j|i}$ * More robust to outliers and has a simpler gradient expression. ##### Crowding Problem * When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters. * One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult. ##### t-SNE * Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space. * Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast. * The cost function is easy to optimize. ##### Optimization Tricks ###### Early Compression * At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another. * Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin. ###### Early Exaggeration * Scale all the $p_{i|j}$'s so that large $q_{i|j}$*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space. ##### t-SNE on large datasets * Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets. * Select a random subset of points (called landmark points) to display. * for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point. * $p_{i|j}$ is defined as fraction of random walks starting at $x_i$ and finishing at $x_j$ (both these points are landmark points). This way, $p_{i|j}$ is not sensitive to "short-circuits" in the graph (due to noisy data points). #### Advantages of t-SNE * Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data. * Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates. * The local neighborhood size of each datapoint is determined on the basis of the local density of the data. * Random walk version of t-SNE takes care of "short-circuit" problem. #### Limitations of t-SNE * It is unclear t-SNE would perform on general **Dimensionality Reduction** for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate. * t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (**curse of dimensionality**). * The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution. allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and