Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014.
APA
Motivation:
Recognition of human actions in videos is a challenging task, since compared to still image classification, the temporal component of videos provides an additional and vital clue for recognition.

Contributions:
An extending deep Convolutional Networks leveraging both spatial and temporal streams.
Technical summarization:
Two-stream architecture:
As shown blow the picture, spatial part can depict information about scenes and objects since some actions are strongly associated with particular objects. Besides, with input of stacking optical flow displacement fields between several consecutive frames, temporal part conveys the movement of the observer (the camera) and the objects.

Multi-task learning:
To solve the problem of insufficient datasets, they adopt multi-task learning for the exploitation of additional training data. There are two softmax classification layers on top of the last FC for the dataset HMDB-51 and UCF-101. The overall training loss is computed as the sum of the individual tasks' losses.
My comment:
Besides comparing to state-of-the-art method, they also implement a lot of experiment for finding best configuration of parameters. For examples, for the spatial ConvNet, distinct training settings are validated and for temporal ConNet they tried various input configurations. To prove the effect of multi-task learning, they have shown the accuracy in different situation. A thorough experiment shows concrete proof to their method.
沒有留言:
張貼留言