`Update 2015/11/23: Since I first wrote this note, I became involved in the next iterations of this work, which became v2 of the arXiv manuscript. The notes below were made based on v1.` This paper considers the problem of Maximum Inner Product Search (MIPS). In MIPS, given a query $q$ and a set of inputs $x_i$, we want to find the input (or the top n inputs) with highest inner product, i.e. $argmax_i q' x_i$. Recently, it was shown that a simple transformation to the query and input vectors made it possible to approximately solve MIPS using hashing methods for Maximum Cosine Similarity Search (MCSS), a problem for which solutions are readily available (see section 2.4 for a brief but very clear description of the transformation). In this paper, the authors combine this approach with clustering, in order to improve the quality of retrieved inputs. Specifically, they consider the spherical kmeans algorithm, which is a variant of kmeans in which data points are clustered based on cosine similarity instead of the euclidean similarity (in short, data points are first scaled to be of unit norm, then in the training inner loop points are assigned to the cluster centroid with highest dot product and cluster centroids are updated as usual, except that they are always rescaled to unit norm). Moreover, they consider a bottomup application of the algorithm to yield a hierarchical clustering tree. They propose to use such a hierarchical clustering tree to find the topn candidates for MIPS. The key insight here is that, since spherical kmeans relies on cosine similarity for finding the best cluster, and since we have a transformation that allows the maximisation of inner product to be approximated by the maximisation of cosine similarity, then a tree to find MIPS candidates could be constructed by running spherical kmeans on the inputs transformed by the same transformation used for hashingbased MIPS. In order to make the search more robust to border issues when a query is close to the frontier between clusters, at each level of the tree they consider more than one candidate cluster during topdown search, so as to merge the candidates in several leaves of the tree at the very end of a full top down query. Their experiments using search with word embeddings show that the quality of the top 1, 10 and 100 MIPS candidates using their spherical kmeans approach is better than using two hashingbased search methods.
Your comment:
