[link]
#### Introduction * Method to visualize highdimensional data points in 2/3 dimensional space. * Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation. * Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a nonlinear manifold. * tSNE approach converts data into a matrix of pairwise similarities and visualizes this matrix. * Based on SNE (Stochastic Neighbor Embedding) * [Link to paper](http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) #### SNE * Given a set of datapoints $x_1, x_2, ...x_n, p_{ij}$ is the probability that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. Calculation of $\sigma_i$ is described later. * Similarly, define $q_{ij}$ as conditional probability corresponding to lowdimensional representations of $y_i$ and $y_j$ (corresponding to $x_i$ and $x_j$). The variance of Gaussian in this case is set to be $1/\sqrt{2}$ * Argument is that if $y_i$ and $y_j$ captures the similarity between $x_i$ and $x_j$, $p_{ij}$ and $q_{ij}$ should be equal. So objective function to be minimized is KullbackLeibler (KL) Divergence measure for $P_i$ and $Q_i$, where $P_i$ ($Q_i$) represent conditional probability distribution given $x_i$ ($y_i$) * Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure. * Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find $\sigma_i$ which produces the $P_i$ having same perplexity as specified by the user. * Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing. #### tSNE (tDistributed SNE) ##### Symmetric SNE * A single KL Divergence between P (joint probability distribution in highdimensional space) and Q (joint probability distribution in lowdimensional space) is minimized. * Symmetric because $p_{ij} = p_{ji}$ and $q_{ij} = q_{ji}$ * More robust to outliers and has a simpler gradient expression. ##### Crowding Problem * When we model a highdimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters. * One way around this problem is to use UNISNE but optimization of the cost function, in that case, is difficult. ##### tSNE * Instead of Gaussian, use a heavytailed distribution (like Studentt distribution) to convert distances into probability scores in low dimensions. This way moderate distance in highdimensional space can be modeled by larger distance in lowdimensional space. * Studentt distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast. * The cost function is easy to optimize. ##### Optimization Tricks ###### Early Compression * At the start of optimization, force the datapoints (in lowdimensional space) to stay close together so that datapoints can easily move from one cluster to another. * Implemented an L2penalty term proportional to the sum of the squared distance of datapoints from the origin. ###### Early Exaggeration * Scale all the $p_{ij}$'s so that large $q_{ij}$*'s are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the lowdimensional space. ##### tSNE on large datasets * Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets. * Select a random subset of points (called landmark points) to display. * for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point. * $p_{ij}$ is defined as fraction of random walks starting at $x_i$ and finishing at $x_j$ (both these points are landmark points). This way, $p_{ij}$ is not sensitive to "shortcircuits" in the graph (due to noisy data points). #### Advantages of tSNE * Gaussian kernel employed by tSNE (in highdimensional) defines a soft border between the local and global structure of the data. * Both nearby and distant pair of datapoints get equal importance in modeling the lowdimensional coordinates. * The local neighborhood size of each datapoint is determined on the basis of the local density of the data. * Random walk version of tSNE takes care of "shortcircuit" problem. #### Limitations of tSNE * It is unclear tSNE would perform on general **Dimensionality Reduction** for more than 3 dimensions. For such higher (than 3) dimensions, Studentt distribution with more degrees of freedom should be more appropriate. * tSNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (**curse of dimensionality**). * The cost function for tSNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.
Your comment:
