Deep Image Retrieval: Learning global representations for image searchDeep Image Retrieval: Learning global representations for image searchGordo, Albert and Almazán, Jon and Revaud, Jérome and Larlus, Diane2016
Paper summaryshanexia
**Contributions:**
- Triplet ranking loss, implemented in three-stream Siamese network
- Integrate region proposal network in system. All operations are derivative, making the system end-to-end trainable.
- Proposed dataset cleaning method, which is critical for performance boost.
- Performance surpasses previous global descriptors and most of local based descriptors in Landmarks dataset.
**Training:**
- Sample triplets, triplet hinge loss:
$L(I_q, I^+, I^-)=max(0, m+q^Td^- - q^Td^+)$
- Since only convolutional layers are used in CNN, and aggregation does not require a fixed input size, full image resolution could be used.
**Network data flow:**
- Use convolutional layers of pre-trained network to extract activation features.
- Max-pooling in different regions, using multi-scale rigid grid with overlapping cells. Note that ROI pooling is differentiable.
- L2 normalize region features, whiten with PCA and l2-normalize again. PCA projection can be implemented with a shifting and a FC layer.
- Aggregate: sum and l2 normalize.
- Dot product similarity of image vector is approximately many-to-many region matching.
**Region Proposal Network**
- Objective function is multi-task loss, which combines classification loss and regression loss.
- When applied, need to perform non-maximum suppression, keep top K proposals for each image.
**Landmark Dataset Cleaning**
- Construct image graph, with edges as similarity score. The score is computed offline, using invariant keypoint matching and spatial verification.
- Extract connected components in graph. They correspond to differnt profiles of a landmark.
**Contributions:**
- Triplet ranking loss, implemented in three-stream Siamese network
- Integrate region proposal network in system. All operations are derivative, making the system end-to-end trainable.
- Proposed dataset cleaning method, which is critical for performance boost.
- Performance surpasses previous global descriptors and most of local based descriptors in Landmarks dataset.
**Training:**
- Sample triplets, triplet hinge loss:
$L(I_q, I^+, I^-)=max(0, m+q^Td^- - q^Td^+)$
- Since only convolutional layers are used in CNN, and aggregation does not require a fixed input size, full image resolution could be used.
**Network data flow:**
- Use convolutional layers of pre-trained network to extract activation features.
- Max-pooling in different regions, using multi-scale rigid grid with overlapping cells. Note that ROI pooling is differentiable.
- L2 normalize region features, whiten with PCA and l2-normalize again. PCA projection can be implemented with a shifting and a FC layer.
- Aggregate: sum and l2 normalize.
- Dot product similarity of image vector is approximately many-to-many region matching.
**Region Proposal Network**
- Objective function is multi-task loss, which combines classification loss and regression loss.
- When applied, need to perform non-maximum suppression, keep top K proposals for each image.
**Landmark Dataset Cleaning**
- Construct image graph, with edges as similarity score. The score is computed offline, using invariant keypoint matching and spatial verification.
- Extract connected components in graph. They correspond to differnt profiles of a landmark.