**Motivation** A possible approach to industrialise deep and shallow embeddings in various retrieval tasks is to decouple the representations of documents (video clips/items in recommendations) and queries (users in recsys). Under this scheme, the degree of matching ("relevance") $r$ of the document with representation $x$ and the query with representation $q$ is calculated as a softmax function: $$ r := \frac{e^{q^Tx}}{\sum_i e^{q^Tx_i}} $$ At learning time, both representations are independent until the last softmax layer. This allows a convenient runtime scheme: the representations of the documents are precalculated and stored in some form of index, and the querydocument relevance is calculated online, at the querytime. As for the fixed query the relevance is monotone w.r.t. the dot product $q^Tx$, the problem of efficient ranking boils down to efficiently finding documents/items with high values of this dot product. Such an approach is used in papers describing Youtube [1] and Yahoo [2] systems. This leads to the following **problem statement** (informal): given a dataset of items $\mathbb{X}$ and a sample of queries $\mathbb{Q}$, how can one compress the dataset $\mathbb{X}$ such that the compressed dataset allows to calculate dot product queries w.r.t. uncompressed $q$ with as small distortion as possible? In this paper, squared Euclidean distance similarity queries are also considered. Formally, we're looking for a compressed representation $x$ for $\hat x$ which minimises the error: (in the case of dotproduct) $$ L_{dot} = \sum_{x \in \mathbb{X}, q \in \mathbb{Q}} \left(q^Tx  q^T \hat x \right)^T\left(q^Tx  q^T \hat x \right) $$ (in the case of square of squared Euclidean distance) $$ L_{Eucl} = \sum_{x \in \mathbb{X}, q \in \mathbb{Q}} \left((q  x)^2  (q  \hat x)^2 \right)^2 $$ In the centre of the paper is the observation that these two *pairwise* loss functions can be reduced to perpoint distortion losses in some modified space. In the case of $L_{dot}$: $$ L_{dot} = \sum_{x \in \mathbb{X}, q \in \mathbb{Q}} \left(q^Tx  q^T \hat x \right)^T\left(q^Tx  q^T \hat x \right) = \sum_{x \in \mathbb{X}}(x  \hat x)^T \left( \sum_{q \in \mathbb{Q}} q q^T \right)(x  \hat x) $$ Since $ \left( \sum_{q \in \mathbb{Q}} q q^T \right)$ is a semidefinite positive matrix, we can factorise it as $C^TC$ and plug it back: $$ L_{dot} =\sum_{x \in \mathbb{X}}(Cx  C\hat x)^T(Cx  C\hat x)=\sum_{z \in \mathbb{Z}}(z  \hat z)^T(z  \hat z) $$ where $\mathbb{Z} = \{z = C x  x \in \mathbb{X}\}$ is a dataset obtained by modifying $\mathbb{X}$ by applying matrix $C$. That reduces the problem of minimising the distortion of dotproduct estimates to the problem of reducing the distortion of individual points in $\mathbb{Z}$. Assuming that we have an efficient way to compress and store individual points when minimising squared distance distortion $(z  \hat z)^2$, we can turn $\mathbb{X}$ into $\mathbb{Z}$, then compress & index $\mathbb{Z}$. Luckily, we have Product Quantisation, Optimised Product Quantisation, Additive Product Quantisation, etc to make the required perpoint compression $z \rightarrow \hat z$, that minimise the perpoint loss. How one can use compressed $\mathbb{Z}$ at runtime? The query itself must be modified: $q \rightarrow r = (C^{1})^Tq$. Then, by finding an estimate of $r^T z$ we will find our estimate $q^Tx$: $$ r^T z = q^TC^{1}z = q^TC^{1} \cdot C x = q^T x $$ Similar reduction is performed for the pairwise squared Euclidean distance loss. Next, the authors demonstrate the the obtained estimates for the distance and dot product are unbiased when the underlying compression in $\mathbb{Z}$space is performed by Optimised Product Quantisation (OPQ). The paper concludes with several experiments that demonstrate that the Pairwise Quantisation better recovers dotproduct and squared Euclidean distance than OPQ. **Overall** the approach seems to be practical, elegant and addresses an important problem. One possible issue is that (authors mention it) the optimisation is performed w.r.t. *all* pairs of sampled queries and datapoints. In practical applications, one only needs to accurately recover dotproduct/distance to the closest vectors. For instance, the correct ranking of the documents on positions 100200 does not matter; while positions 110 are extremely important. [1] https://static.googleusercontent.com/media/research.google.com/ru//pubs/archive/45530.pdf [2] http://www.kdd.org/kdd2016/papers/files/adf0361yinA.pdf
Your comment:
