2016年6月1日 星期三

[AMMAI] [Lecture 14] - "Sequence to Sequence – Video to Text"

Paper Information:
  Venugopalan, Subhashini, et al. "Sequence to sequence-video to text." Proceedings of the IEEE International Conference on Computer Vision. 2015.

Motivation:
  Video caption is an important applications in human-robot interaction, video indexing, and describing movies for the blind

Contributions:
   An end-to-end sequence-to-sequence model to generate captions for videos, S2VT, which learns to directly map a sequence of frames to a sequence of words.

Technical summarization:

  Model overview:
  
    The picture above depicts their model. A stacked LSTM first encodes the frames one by one, taking as input the output of a CNN applied to each input frame’s intensity values. Besides, To model the temporal aspects of activities typically shown in videos, we also compute the optical flow between pairs of consecutive frames.

  LSTM for sequence modeling:
 
    Model maximizes for the log-likelihood of the predicted output sentence given the hidden representation of the visual frame sequence, and the previous words it has seen.

  
    The picture above shows the LSTM stack unrolled over time. The top LSTM layer is used to model the visual frame sequence and the next is used to model the output word sequence. Moreover, <BOS> is used to prompts the second LSTM layer to start decoding. For the trade-off between memory consumption and the frame numbers. They unroll the LSTM to a fixed 80 time. Therefore, video fewer then 80 time step will be pad with zeros and the longer will be truncate.

My comment:
  

  The tables is the METEOR score in different dataset. Because METERO is computed based on the alignment between a given hypothesis sentence and a set of candidate reference sentences and MPII-MD dataset only has one sentence for each video, the scores in MPII-MD dataset are all lower than MSVD's.

  Learning from the comment in social media may be a new area for caption. However, comments in social media are full of advertisement and the length of sentences are quite short. An attention base model might be a solution of the caption from social media.


2016年5月24日 星期二

[AMMAI] [Lecture 13] - "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups."

Paper Information:
  Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." Signal Processing Magazine, IEEE 29.6 (2012): 82-97.

Motivation:
   Gaussian mixture models (GMMs) are used to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. However, GMMs have serious drawback statistically inefficient for modeling data that lie on or near a nonlinear manifold in the data space. Deep neural networks methods have been shown to being better than GMMs on a variety of speech recognition benchmarks. 

Contributions:
  It wants to demonstrate the progress of DNN methods.

Technical summarization:
  Restricted Boltzmann machine (RBM):
    It's an approximate learning algorithm which consists of a layer of stochastic binary “visible” units that represent binary input data connected to a layer of stochastic binary hidden units that learn to model significant relationship between the visible units. It's a type of MRF but it has bipartie graph no visible-visible or hidden-hidden connections.
  
  Stacking RBMs to make a deep belief network:
    For real-valured data, Gaussian–Bernoulli RBM (GRBM) is adopted. By stacking RBMs,it can represent progressively more complex statistical structure in the data. After learning a DBN by training a stack of RBMs. It can be used to initialize all the feature detecting layers of a deterministic feedforward DNN. Then just add a final softmax layer and train the whole DNN discriminatively.

My comment:
  Phonetic classification and recognition on TIMIT:
    TIMIT is the bench-marked dataset for speech recognition. It is always helpful that we can find a bench-marked dataset related to our research because many existing techniques have already tested on the dataset. It greatly reduce the time for duplicating others' work. For each type of DBN-DNN the architecture that performed best on the development set is reported.
    

  This paper metions a lot in pre-training even a much faster approximate method CD. Recent CNN methods are all based on the pre-training on ImageNet. It Indeed helps the related works for saving the time from tedious training and overfitting. However, parallelizing the fine-tuning of DNNs is still a major issue. Combing with the concept of "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding" may help to improve fine-tuning time.

2016年5月18日 星期三

[AMMAI] [Lecture 12] - "Text Understanding from Scratch"

Paper Information:
  Zhang, Xiang, and Yann LeCun. "Text Understanding from Scratch." arXiv preprint arXiv:1502.01710 (2015).

Motivation:
  ConvNet is quite successful in image domain; therefore, they want to apply it in text domain hoping that it can learn the relationship between text by character-level ConvNet.

Contributions:
  Demonstrating the ability of deep learning system in text understanding without embedding knowledge.

Technical summarization:
  ConvNet Model Design:
    
The convolution function is define as:
And the max-pooling function is define as:

Finally, they use ReLU as thresholding function.
For the network architecture, they design the ConvNets with the architecture with 6 conv and 3 fc.

  Data Augmentation:
    Just like the rotating, scaling and flipping in image recognition to achieve augmentation. They choose to adopt replacing synonyms as their method for invariance.

My comment:


  It's interesting that they compare quantization with Braille used for assisting blind reading. In this situation, ConvNet is just like a blind person tring to learn the binary encoding.

For the experiment, they evaluate on many dataset like DBpedia, Amazon and Yahoo! Answers. All the experiments have better result comparing to bag of words or word2vec.

  Besides, they also demonstrate the ability of dealing with Chinese on Sogou News corpus. This experiment shows it's generality in language. It's extremely amazing that the accuracy is high even crossing the language.

   

2016年5月11日 星期三

[AMMAI] [Lecture 11] - "DeepFace: Closing the Gap to Human-Level Performance in Face Verification"

Paper Information:
  Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level performance in face verification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

Motivation:
  There is  still a gap between machines and human while dealing with face recognition in unconstrained images.

Contributions:
  A system that generalizes well to other datasets and has closed the majority of the remaining gap in the most popular benchmark.

Technical summarization:
  Face alignment:
    Deep learning architecture is based on the assumption that the alignment is completed. Therefor, face alignment is quite an important processing. In this paper, they use analytical 3D modeling of the face based on fiducial points.

    The short summarization for their method in face alignment will be mentioned below. First, they find 6 fiducial points to crop the image. Second, 67 fiducial points will be cropped according to Delaunay triangulation. Third, They align the 3D shape to the 2D image-plane and based on the 3D model frontalized crop can be generated.
  Representation:
   
    The frist three layers are used to extract low-level features. Although the max-pooling layer can make the network more robust, it also casuse the network lose information. Therefore, it is only applied to the first convolutional layer. The following three conv layers are locally connected layers. It needs more training data while training stage but it can keeps local statistics from different region. At the last layer, they use softmax for classification. Besides, the F7 will be used as raw face representation.
  Verification metric:
    They've tried several method for the verification like X^2 distance and Siamese network.
For the X^2 distance, it can be calculated by the following equation. The w can be learned by linear SVM.

For the Siamese network, it's quite similar to the equation above and the a can be learned from the network.


My comment:
  From this paper, I discover that there are quite a lot of available facial datasets for training. They leverage these dataset for the experiment to demonstrate their method. As we can see below, it achieve remarkable accuracy even compared to human on LFW dataset. However, other methods also achieve acceptable accuracy. Face recognition might be totally explored in the accuracy field. Dealing with real scenario might be the next challenge while processing with multiple people and various lightings. But the collection of dataset might be a problem; besides, privacy issue might also be a problem. 
  

2016年5月4日 星期三

[AMMAI] [Lecture 10] - "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"

Paper Information:
  Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in Neural Information Processing Systems. 2015.

Motivation:
  For state-of-the-art object detection networks, the region proposal computation is a bottleneck.

Contributions:
  A cost-free region proposal method, which is Region Proposal Network (RPN) that shares full-image convolutional features with the detection network.

Technical summarization:
  Region proposal networks:

    A RPN takes an image as input and outputs a set of rectangular object proposals, each with an objectness score. Each sliding window is mapped to a lower-dimensional vector. Furthermore, this vector is fed into two sibling FC layers a box-regression layer (reg) and a box-classification layer (cls).

  Translation invariant:
    No matter how the position of object in the image changes, the same function should be able to predict the proposal in either location.

  A Loss Function for Learning Region Proposals:

 
To train RPN, they minimize an objective function above,which the pi is the predicted probability of anchor i, pi* is the ground truth label, ti is a vector representation and ti* is the ground truth box.



  Optimization: 
    For preventing bias, they randomly sample 256 anchors in an image and the positive and negative anchors's ratio of up to 1:1.

  Sharing Convolutional Features for Region Proposal and Object Detection:
    There 4 step for sharing convolutional features.
      1.Train the RPN
      2.Train the Fast R-CNN by RPN.
      3.Use detector network to initialize RPN training, but fix the shared conv and only fine-tune the layers unique to RPN
      4.keeping the shared conv fixed, we fine-tune the fc layers of the Fast R-CNN

My comment:

    Picture below shows that RPN is really much faster than SS. Moreover, it also reveals that RPN can really be trained to have a better performance than pre-defined algorithm.


Below picture show the recall of proposals of RPN at different IoU ratio.
  It shows that RPN is quite stable even dealing with different number of proposals.

  Besides, compared to Selective Search, it demonstrate that it benefits from the training of networks. However, RPN needs bouding box for training and SS don't. SS is still quite a good method to object proposals because bounding box for large scale dataset is not easy to obtain usually.













2016年4月28日 星期四

[AMMAI] [Lecture 09] - "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding"

Paper Information:
  Han, Song, Huizi Mao, and William J. Dally. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." arXiv preprint arXiv:1510.00149 (2015).

Motivation:
    The demanding of running neural network in embeded systems becomes more and more popular. However, the limited harware resources is obstacle to the application.

Contributions:
   They reduce the sorage and energy requited of large network with pruning, trained quantization, Huffmand coding.

Technical summarization:
3 blocks below will be describe in the following parts.

  Network pruning:
    Frist,It learn the connectivity via normal network training. Second,
   weights below a threshold are removed. Finally, the remaining sparse connections will be retrained.

  Trained quantization and weight sharing
   They use k-means clustering to find the shared weights. Since centroid initialization impacts the quality of clustering, larger weights are quite vital; therefore, linear initialization is choosen to initialization.

  Huffman coding
   The main concept of Huffman coding is that more common symbols are represented with fewer bits.

My comment:
This paper indeed makes a lot visualization to crystallize the abstract weight distribution of CNN. Two examples below will be shown.

It's quite a straight way to view distribution of weight in histogram. Furthermore, from the weight distribution the bias is shown clearly; therefore, it's a concrete proof to show the reason to use Huffman coding.
Weights' distribution of conv3 layer is shown above. As we can see, it forms a bimodal distribution.

Following picture shows that overhead of codebook is very small and often negligible. As the first time I see the use of codebook, I thought it will cost some space. This picture shows it will not cost too much space consumption. Therefor, the decoding time might also not be a problem because the amount of codebook is quite small.

2016年4月20日 星期三

[AMMAI] [Lecture 08] - "Two-Stream Convolutional Networks for Action Recognition in Videos"

Paper Information:
  Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014.
APA

Motivation:
    Recognition of human actions in videos is a challenging task, since compared to still image classification, the temporal component of videos provides an additional and vital clue for recognition.

Contributions:
   An extending deep Convolutional Networks leveraging both spatial and temporal streams.

Technical summarization:
  Two-stream architecture:
As shown blow the picture, spatial part can depict information about scenes and objects since some actions are strongly associated with particular objects. Besides, with input of stacking optical flow displacement fields between several consecutive frames, temporal part conveys the movement of the observer (the camera) and the objects.


   Multi-task learning:
   To solve the problem of insufficient datasets, they adopt multi-task learning for the exploitation of additional training data. There are two softmax classification layers on top of the last FC for the dataset HMDB-51 and UCF-101. The overall training loss is computed as the sum of the individual tasks' losses.

My comment:
  Besides comparing to state-of-the-art method, they also implement a lot of experiment for finding best configuration of parameters. For examples, for the spatial ConvNet, distinct training settings are validated and for temporal ConNet they tried various input configurations. To prove the effect of multi-task learning, they have shown the accuracy in different situation. A thorough experiment shows concrete proof to their method.