2016年6月1日 星期三

[AMMAI] [Lecture 14] - "Sequence to Sequence – Video to Text"

Paper Information:
  Venugopalan, Subhashini, et al. "Sequence to sequence-video to text." Proceedings of the IEEE International Conference on Computer Vision. 2015.

Motivation:
  Video caption is an important applications in human-robot interaction, video indexing, and describing movies for the blind

Contributions:
   An end-to-end sequence-to-sequence model to generate captions for videos, S2VT, which learns to directly map a sequence of frames to a sequence of words.

Technical summarization:

  Model overview:
  
    The picture above depicts their model. A stacked LSTM first encodes the frames one by one, taking as input the output of a CNN applied to each input frame’s intensity values. Besides, To model the temporal aspects of activities typically shown in videos, we also compute the optical flow between pairs of consecutive frames.

  LSTM for sequence modeling:
 
    Model maximizes for the log-likelihood of the predicted output sentence given the hidden representation of the visual frame sequence, and the previous words it has seen.

  
    The picture above shows the LSTM stack unrolled over time. The top LSTM layer is used to model the visual frame sequence and the next is used to model the output word sequence. Moreover, <BOS> is used to prompts the second LSTM layer to start decoding. For the trade-off between memory consumption and the frame numbers. They unroll the LSTM to a fixed 80 time. Therefore, video fewer then 80 time step will be pad with zeros and the longer will be truncate.

My comment:
  

  The tables is the METEOR score in different dataset. Because METERO is computed based on the alignment between a given hypothesis sentence and a set of candidate reference sentences and MPII-MD dataset only has one sentence for each video, the scores in MPII-MD dataset are all lower than MSVD's.

  Learning from the comment in social media may be a new area for caption. However, comments in social media are full of advertisement and the length of sentences are quite short. An attention base model might be a solution of the caption from social media.


沒有留言:

張貼留言