Summary by shiyu 1 month ago
## Boundary sensitive network
### **keyword**: action detection in video; accurate proposal
**Summary**: In order to generate precise temporal boundaries and improve recall with lesses proposals, Tianwei Lin et al use BSN which first combine temporal boundaries with high probability to form proposals and then select proposals by evaluating whether a proposal contains an action(confidence score+ boundary probability).
**Model**:
1. video feature encoding: use the two-stream extractor to form the input of BSN. $F = \{f_{tn}\}_{n=1}^{l_s} = \{(f_{S,Tn}, f_{T,t_n}\}_{n=1}^{l_s)} $
2. BSN:
* temporal evaluation: input feature sequence, using 3-layer CNN+3 fiter with sigmoid, to generate start, end, and actioness probability
* proposal generation: 1.combine bound with high start/end probability or if probility peak to form proposal; 2. use actioness probability to generate proposal feature for each proposal by sampling the actioness probability during proposal region.
* proposal evaluation: using 1 hidden layer perceptron to evaluate confidence score based on proposal features.
proposal $\varphi =(t_s,t_e,p_{conf},p_{t_s}^s,p_{t_e}^e) $ $p_{t_e}^e$ is the end probability,and $p_{conf}$ is confidence score
https://i.imgur.com/VjJLQDc.png
**Training**:
* **Learn to generate probility curve**:
In order to calculate the accuracy of proposals the loss in the temporal evaluation is calculated as following:
$L_{TEM} = \lambda L^{action} + L ^{start} + L^{end}$;
$L = \frac{1}{l_w} \sum_{i =1}^{l_w}(\frac{l_w}{l_w-\sum_i g_i} b_i*log(p_i)+\frac{l_w}{\sum_i g_i} (1-b_i)*log(1-p_i))$
$ b_i = sign(g_i-\theta_{IoP})$
Thus, if start region proposal is highly overlapped with ground truth, the start point probability should increase to lower the loss, after training, the information of ground truth region could be leveraged to predict the accurate probability for start. actions and end probability could apply the same rule.
* **Learn to choose right proposal**:
In order to choose the right proposal based on confidence score, push confidence score to match with IOU of the groud truth and proposal is important. So the loss to do this is described as follow:
$L_p = \frac{1}{N_{train}} \sum_{i=1}^{N_{train}} (p_{conf,i}-g_{iou,i})^2$. $N_{train}$ is number of training proposals and among it the ratio of positive to negative proposal is 1:2.$g_{iou,i}$ is the ith proposal's overlap with its corresponding ground truth.
During test and prediction, the final confidence is calculated to fetch and suppress proposals using gaussian decaying soft-NMS. and final confidence score for each proposal is $p_f = p_{conf}p_{ts}^sp_{te}^e$
Thus, after training, the confidence score should reveal the iou between the proposal and its corresponding ground truth based on the proposal feature which is generated through actionness probability, whereas final proposal is achieved by ranking final confidence score.
**Conclusion**: Different with segment proposal or use RNN to decide where to look next, this paper generate proposals with boundary probability and select them using the confidence score-- the IOU between the proposal and corresponding ground truth. with sufficient data, it can provide right bound probability and confidence score. and the highlight of the paper is it can be very accurate within feature sequence.
*However, it only samples part of the video for feature sequence. so it is possible it will jump over the boundary point. if an accurate policy to decide where to sample is used, accuracy should be further boosted. *
* **computation complexity**: within this network, computation includes
1. two-stream feature extractor for video samples
2. probility generation module: 3-layers cnn for the generated sequence
3. proposal generation using combination
4. sampler to generate proposal feature
5. 1-hidden layer perceptron to generate confidence score.
major computing complexity should attribute to feature extractor(1') and proposal relate module if lots of proposals are generated(3',4')
**Performance**: when combined with SCNN-classifier, it reach map@0.5 = 36.9 on THUMOS14 dataset

more
less