Temporal Action Detection with Structured Segment Networks on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Temporal Action Detection with Structured Segment Networks
Yue Zhao and Yuanjun Xiong and Limin Wang and Zhirong Wu and Xiaoou Tang and Dahua Lin
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

Summaries/Notes 1

[link] Summary by shiyu 5 years ago

## Structured segmented network

### **key word**: action detection in video; computing complexity reduction; structurize proposal 

**Abstract**: using a temporal action grouping scheme (TAG) to generate accurate proposals, using a structured pyramid to model the temporal structure of each action instance to tackle the issue that detected actions are not complete, using two classifiers to determine class and completeness and using a regressor for each category to further modify the temporal bound. In this paper, Yue Zhao et al mainly tackle the problem of high computing complexity by sampling video frame and remove redundant proposals in video detection and the lack of action stage modeling.  

**Model**: 
1. generate proposals: find continuous temporal regions with mostly high actioness. $P = \{ p_i = [s_i,e_i]\}_{i = 1}^N$
2. splitting proposals into 3 stages: start, course, and end: first augment the proposal by 2 times symmetrical to center, and course part is the original proposal, while start and end is the left part and right part of the difference between the transformed proposal and original one.
3. build temporal pyramid representation for each stage: first  L samples are sampled from the augmented proposal, then two-stream feature extractor is used on each one of them and pooling features for each stage 
4. build global representation for each proposal by concatenating stage-level representations
5. a global representation for each proposal is used as input for classifiers 

* input  =  ${S_t}_{t = 1} ^{T}$a sequence of T snippet representing the video. each snippet  = the frames + an optical  flow stack
* network: two linear classifiers; L two-steam feature extractor and several pooling layer  
* output: category and completeness and modification for each proposals.
https://i.imgur.com/thM9oWz.png

**Training**: 
* joint loss for classifiers: $L_{cls} = -log(P(c_i|p_i)* P(b_i,c_i,p_i)) $
* loss for location regression:  $\lambda * 1(c_i>=1, b_i = 1)  L(u_i,\varphi _i;p_i)$

**Summary**:
This paper has three highlights:
1. Parallel: it uses a paralleled network structure where proposals can be processed in paralleled which will shorten the processing time based on GPU
2. temporal structure modeling and regression: give each proposal certain structure so that completeness of proposals can be achieved
3. reduce computing complexity: use two tricks: remove video redundancy by sampling frame; remove proposal redundance

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private