2016年5月24日 星期二

[AMMAI] [Lecture 13] - "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups."

Paper Information:
  Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." Signal Processing Magazine, IEEE 29.6 (2012): 82-97.

Motivation:
   Gaussian mixture models (GMMs) are used to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. However, GMMs have serious drawback statistically inefficient for modeling data that lie on or near a nonlinear manifold in the data space. Deep neural networks methods have been shown to being better than GMMs on a variety of speech recognition benchmarks. 

Contributions:
  It wants to demonstrate the progress of DNN methods.

Technical summarization:
  Restricted Boltzmann machine (RBM):
    It's an approximate learning algorithm which consists of a layer of stochastic binary “visible” units that represent binary input data connected to a layer of stochastic binary hidden units that learn to model significant relationship between the visible units. It's a type of MRF but it has bipartie graph no visible-visible or hidden-hidden connections.
  
  Stacking RBMs to make a deep belief network:
    For real-valured data, Gaussian–Bernoulli RBM (GRBM) is adopted. By stacking RBMs,it can represent progressively more complex statistical structure in the data. After learning a DBN by training a stack of RBMs. It can be used to initialize all the feature detecting layers of a deterministic feedforward DNN. Then just add a final softmax layer and train the whole DNN discriminatively.

My comment:
  Phonetic classification and recognition on TIMIT:
    TIMIT is the bench-marked dataset for speech recognition. It is always helpful that we can find a bench-marked dataset related to our research because many existing techniques have already tested on the dataset. It greatly reduce the time for duplicating others' work. For each type of DBN-DNN the architecture that performed best on the development set is reported.
    

  This paper metions a lot in pre-training even a much faster approximate method CD. Recent CNN methods are all based on the pre-training on ImageNet. It Indeed helps the related works for saving the time from tedious training and overfitting. However, parallelizing the fine-tuning of DNNs is still a major issue. Combing with the concept of "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding" may help to improve fine-tuning time.

2016年5月18日 星期三

[AMMAI] [Lecture 12] - "Text Understanding from Scratch"

Paper Information:
  Zhang, Xiang, and Yann LeCun. "Text Understanding from Scratch." arXiv preprint arXiv:1502.01710 (2015).

Motivation:
  ConvNet is quite successful in image domain; therefore, they want to apply it in text domain hoping that it can learn the relationship between text by character-level ConvNet.

Contributions:
  Demonstrating the ability of deep learning system in text understanding without embedding knowledge.

Technical summarization:
  ConvNet Model Design:
    
The convolution function is define as:
And the max-pooling function is define as:

Finally, they use ReLU as thresholding function.
For the network architecture, they design the ConvNets with the architecture with 6 conv and 3 fc.

  Data Augmentation:
    Just like the rotating, scaling and flipping in image recognition to achieve augmentation. They choose to adopt replacing synonyms as their method for invariance.

My comment:


  It's interesting that they compare quantization with Braille used for assisting blind reading. In this situation, ConvNet is just like a blind person tring to learn the binary encoding.

For the experiment, they evaluate on many dataset like DBpedia, Amazon and Yahoo! Answers. All the experiments have better result comparing to bag of words or word2vec.

  Besides, they also demonstrate the ability of dealing with Chinese on Sogou News corpus. This experiment shows it's generality in language. It's extremely amazing that the accuracy is high even crossing the language.

   

2016年5月11日 星期三

[AMMAI] [Lecture 11] - "DeepFace: Closing the Gap to Human-Level Performance in Face Verification"

Paper Information:
  Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level performance in face verification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

Motivation:
  There is  still a gap between machines and human while dealing with face recognition in unconstrained images.

Contributions:
  A system that generalizes well to other datasets and has closed the majority of the remaining gap in the most popular benchmark.

Technical summarization:
  Face alignment:
    Deep learning architecture is based on the assumption that the alignment is completed. Therefor, face alignment is quite an important processing. In this paper, they use analytical 3D modeling of the face based on fiducial points.

    The short summarization for their method in face alignment will be mentioned below. First, they find 6 fiducial points to crop the image. Second, 67 fiducial points will be cropped according to Delaunay triangulation. Third, They align the 3D shape to the 2D image-plane and based on the 3D model frontalized crop can be generated.
  Representation:
   
    The frist three layers are used to extract low-level features. Although the max-pooling layer can make the network more robust, it also casuse the network lose information. Therefore, it is only applied to the first convolutional layer. The following three conv layers are locally connected layers. It needs more training data while training stage but it can keeps local statistics from different region. At the last layer, they use softmax for classification. Besides, the F7 will be used as raw face representation.
  Verification metric:
    They've tried several method for the verification like X^2 distance and Siamese network.
For the X^2 distance, it can be calculated by the following equation. The w can be learned by linear SVM.

For the Siamese network, it's quite similar to the equation above and the a can be learned from the network.


My comment:
  From this paper, I discover that there are quite a lot of available facial datasets for training. They leverage these dataset for the experiment to demonstrate their method. As we can see below, it achieve remarkable accuracy even compared to human on LFW dataset. However, other methods also achieve acceptable accuracy. Face recognition might be totally explored in the accuracy field. Dealing with real scenario might be the next challenge while processing with multiple people and various lightings. But the collection of dataset might be a problem; besides, privacy issue might also be a problem. 
  

2016年5月4日 星期三

[AMMAI] [Lecture 10] - "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"

Paper Information:
  Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in Neural Information Processing Systems. 2015.

Motivation:
  For state-of-the-art object detection networks, the region proposal computation is a bottleneck.

Contributions:
  A cost-free region proposal method, which is Region Proposal Network (RPN) that shares full-image convolutional features with the detection network.

Technical summarization:
  Region proposal networks:

    A RPN takes an image as input and outputs a set of rectangular object proposals, each with an objectness score. Each sliding window is mapped to a lower-dimensional vector. Furthermore, this vector is fed into two sibling FC layers a box-regression layer (reg) and a box-classification layer (cls).

  Translation invariant:
    No matter how the position of object in the image changes, the same function should be able to predict the proposal in either location.

  A Loss Function for Learning Region Proposals:

 
To train RPN, they minimize an objective function above,which the pi is the predicted probability of anchor i, pi* is the ground truth label, ti is a vector representation and ti* is the ground truth box.



  Optimization: 
    For preventing bias, they randomly sample 256 anchors in an image and the positive and negative anchors's ratio of up to 1:1.

  Sharing Convolutional Features for Region Proposal and Object Detection:
    There 4 step for sharing convolutional features.
      1.Train the RPN
      2.Train the Fast R-CNN by RPN.
      3.Use detector network to initialize RPN training, but fix the shared conv and only fine-tune the layers unique to RPN
      4.keeping the shared conv fixed, we fine-tune the fc layers of the Fast R-CNN

My comment:

    Picture below shows that RPN is really much faster than SS. Moreover, it also reveals that RPN can really be trained to have a better performance than pre-defined algorithm.


Below picture show the recall of proposals of RPN at different IoU ratio.
  It shows that RPN is quite stable even dealing with different number of proposals.

  Besides, compared to Selective Search, it demonstrate that it benefits from the training of networks. However, RPN needs bouding box for training and SS don't. SS is still quite a good method to object proposals because bounding box for large scale dataset is not easy to obtain usually.