First published: 2016/06/30 (3 years ago) Abstract: The problem of arbitrary object tracking has traditionally been tackled by
learning a model of the object's appearance exclusively online, using as sole
training data the video itself. Despite the success of these methods, their
online-only approach inherently limits the richness of the model they can
learn. Recently, several attempts have been made to exploit the expressive
power of deep convolutional networks. However, when the object to track is not
known beforehand, it is necessary to perform Stochastic Gradient Descent online
to adapt the weights of the network, severely compromising the speed of the
system. In this paper we equip a basic tracking algorithm with a novel
fully-convolutional Siamese network trained end-to-end on the ILSVRC15 video
object detection dataset. Our tracker operates at frame-rates beyond real-time
and, despite its extreme simplicity, achieves state-of-the-art performance in
the VOT2015 benchmark.
This paper suggests an approach to find correlation score between different sub-window of a search image with a query image. Using a fully convolutional siamese network architecture that they describe helps in getting this correlation for different sub windows for search images in one forward pass of the network. For every video, they compute the features for the object being tracked once and use it for entire duration of video for computing correlation.
This is in the same spirit as GOTURN tracker. Although having fully convolutional helps in having translation invariance, it is not directly an advantage over predicting bounding boxes directly as adopted in GOTURN paper. Also, results are not directly comparable as this has been trained on a different data-set.
First published: 2016/04/06 (3 years ago) Abstract: Machine learning techniques are often used in computer vision due to their
ability to leverage large amounts of training data to improve performance.
Unfortunately, most generic object trackers are still trained from scratch
online and do not benefit from the large number of videos that are readily
available for offline training. We propose a method for offline training of
neural networks that can track novel objects at test-time at 100 fps. Our
tracker is significantly faster than previous methods that use neural networks
for tracking, which are typically very slow to run and not practical for
real-time applications. Our tracker uses a simple feed-forward network with no
online training required. The tracker learns a generic relationship between
object motion and appearance and can be used to track novel objects that do not
appear in the training set. We test our network on a standard tracking
benchmark to demonstrate our tracker's state-of-the-art performance. Further,
our performance improves as we add more videos to our offline training set. To
the best of our knowledge, our tracker is the first neural-network tracker that
learns to track generic objects at 100 fps.
This paper introduces a bunch of tricks which make sense for visual tracking. These tricks are as followed:
1. At test time, a crop with center at the previous frame's bounding box's center with size larger than the bounding box is given along with the search area in the current frame.
2. Training offline on a large set of videos (where object bounding boxes are given for a subset of frames) and images with object bounding boxes.
3. Network takes two images: i) a crop of the image/frame around the bounding box and ii) the image centered at the center of the bounding box. Given the later, network regresses the bounding box in i).
4. Above crops are sampled such that the ground truth bounding box center in i) is not very far from the center in ii), hence network prefers smooth motion.
My take: This is very nice way to use still images to train image correlation task and hence can be used for tracking. Speed on gpu is very impressive but still not comparable on CPUs.