2016年6月1日 星期三

[AMMAI] [Lecture 14] - "Sequence to Sequence – Video to Text"

Paper Information:
  Venugopalan, Subhashini, et al. "Sequence to sequence-video to text." Proceedings of the IEEE International Conference on Computer Vision. 2015.

Motivation:
  Video caption is an important applications in human-robot interaction, video indexing, and describing movies for the blind

Contributions:
   An end-to-end sequence-to-sequence model to generate captions for videos, S2VT, which learns to directly map a sequence of frames to a sequence of words.

Technical summarization:

  Model overview:
  
    The picture above depicts their model. A stacked LSTM first encodes the frames one by one, taking as input the output of a CNN applied to each input frame’s intensity values. Besides, To model the temporal aspects of activities typically shown in videos, we also compute the optical flow between pairs of consecutive frames.

  LSTM for sequence modeling:
 
    Model maximizes for the log-likelihood of the predicted output sentence given the hidden representation of the visual frame sequence, and the previous words it has seen.

  
    The picture above shows the LSTM stack unrolled over time. The top LSTM layer is used to model the visual frame sequence and the next is used to model the output word sequence. Moreover, <BOS> is used to prompts the second LSTM layer to start decoding. For the trade-off between memory consumption and the frame numbers. They unroll the LSTM to a fixed 80 time. Therefore, video fewer then 80 time step will be pad with zeros and the longer will be truncate.

My comment:
  

  The tables is the METEOR score in different dataset. Because METERO is computed based on the alignment between a given hypothesis sentence and a set of candidate reference sentences and MPII-MD dataset only has one sentence for each video, the scores in MPII-MD dataset are all lower than MSVD's.

  Learning from the comment in social media may be a new area for caption. However, comments in social media are full of advertisement and the length of sentences are quite short. An attention base model might be a solution of the caption from social media.


2016年5月24日 星期二

[AMMAI] [Lecture 13] - "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups."

Paper Information:
  Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." Signal Processing Magazine, IEEE 29.6 (2012): 82-97.

Motivation:
   Gaussian mixture models (GMMs) are used to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. However, GMMs have serious drawback statistically inefficient for modeling data that lie on or near a nonlinear manifold in the data space. Deep neural networks methods have been shown to being better than GMMs on a variety of speech recognition benchmarks. 

Contributions:
  It wants to demonstrate the progress of DNN methods.

Technical summarization:
  Restricted Boltzmann machine (RBM):
    It's an approximate learning algorithm which consists of a layer of stochastic binary “visible” units that represent binary input data connected to a layer of stochastic binary hidden units that learn to model significant relationship between the visible units. It's a type of MRF but it has bipartie graph no visible-visible or hidden-hidden connections.
  
  Stacking RBMs to make a deep belief network:
    For real-valured data, Gaussian–Bernoulli RBM (GRBM) is adopted. By stacking RBMs,it can represent progressively more complex statistical structure in the data. After learning a DBN by training a stack of RBMs. It can be used to initialize all the feature detecting layers of a deterministic feedforward DNN. Then just add a final softmax layer and train the whole DNN discriminatively.

My comment:
  Phonetic classification and recognition on TIMIT:
    TIMIT is the bench-marked dataset for speech recognition. It is always helpful that we can find a bench-marked dataset related to our research because many existing techniques have already tested on the dataset. It greatly reduce the time for duplicating others' work. For each type of DBN-DNN the architecture that performed best on the development set is reported.
    

  This paper metions a lot in pre-training even a much faster approximate method CD. Recent CNN methods are all based on the pre-training on ImageNet. It Indeed helps the related works for saving the time from tedious training and overfitting. However, parallelizing the fine-tuning of DNNs is still a major issue. Combing with the concept of "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding" may help to improve fine-tuning time.

2016年5月18日 星期三

[AMMAI] [Lecture 12] - "Text Understanding from Scratch"

Paper Information:
  Zhang, Xiang, and Yann LeCun. "Text Understanding from Scratch." arXiv preprint arXiv:1502.01710 (2015).

Motivation:
  ConvNet is quite successful in image domain; therefore, they want to apply it in text domain hoping that it can learn the relationship between text by character-level ConvNet.

Contributions:
  Demonstrating the ability of deep learning system in text understanding without embedding knowledge.

Technical summarization:
  ConvNet Model Design:
    
The convolution function is define as:
And the max-pooling function is define as:

Finally, they use ReLU as thresholding function.
For the network architecture, they design the ConvNets with the architecture with 6 conv and 3 fc.

  Data Augmentation:
    Just like the rotating, scaling and flipping in image recognition to achieve augmentation. They choose to adopt replacing synonyms as their method for invariance.

My comment:


  It's interesting that they compare quantization with Braille used for assisting blind reading. In this situation, ConvNet is just like a blind person tring to learn the binary encoding.

For the experiment, they evaluate on many dataset like DBpedia, Amazon and Yahoo! Answers. All the experiments have better result comparing to bag of words or word2vec.

  Besides, they also demonstrate the ability of dealing with Chinese on Sogou News corpus. This experiment shows it's generality in language. It's extremely amazing that the accuracy is high even crossing the language.

   

2016年5月11日 星期三

[AMMAI] [Lecture 11] - "DeepFace: Closing the Gap to Human-Level Performance in Face Verification"

Paper Information:
  Taigman, Yaniv, et al. "Deepface: Closing the gap to human-level performance in face verification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

Motivation:
  There is  still a gap between machines and human while dealing with face recognition in unconstrained images.

Contributions:
  A system that generalizes well to other datasets and has closed the majority of the remaining gap in the most popular benchmark.

Technical summarization:
  Face alignment:
    Deep learning architecture is based on the assumption that the alignment is completed. Therefor, face alignment is quite an important processing. In this paper, they use analytical 3D modeling of the face based on fiducial points.

    The short summarization for their method in face alignment will be mentioned below. First, they find 6 fiducial points to crop the image. Second, 67 fiducial points will be cropped according to Delaunay triangulation. Third, They align the 3D shape to the 2D image-plane and based on the 3D model frontalized crop can be generated.
  Representation:
   
    The frist three layers are used to extract low-level features. Although the max-pooling layer can make the network more robust, it also casuse the network lose information. Therefore, it is only applied to the first convolutional layer. The following three conv layers are locally connected layers. It needs more training data while training stage but it can keeps local statistics from different region. At the last layer, they use softmax for classification. Besides, the F7 will be used as raw face representation.
  Verification metric:
    They've tried several method for the verification like X^2 distance and Siamese network.
For the X^2 distance, it can be calculated by the following equation. The w can be learned by linear SVM.

For the Siamese network, it's quite similar to the equation above and the a can be learned from the network.


My comment:
  From this paper, I discover that there are quite a lot of available facial datasets for training. They leverage these dataset for the experiment to demonstrate their method. As we can see below, it achieve remarkable accuracy even compared to human on LFW dataset. However, other methods also achieve acceptable accuracy. Face recognition might be totally explored in the accuracy field. Dealing with real scenario might be the next challenge while processing with multiple people and various lightings. But the collection of dataset might be a problem; besides, privacy issue might also be a problem. 
  

2016年5月4日 星期三

[AMMAI] [Lecture 10] - "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"

Paper Information:
  Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in Neural Information Processing Systems. 2015.

Motivation:
  For state-of-the-art object detection networks, the region proposal computation is a bottleneck.

Contributions:
  A cost-free region proposal method, which is Region Proposal Network (RPN) that shares full-image convolutional features with the detection network.

Technical summarization:
  Region proposal networks:

    A RPN takes an image as input and outputs a set of rectangular object proposals, each with an objectness score. Each sliding window is mapped to a lower-dimensional vector. Furthermore, this vector is fed into two sibling FC layers a box-regression layer (reg) and a box-classification layer (cls).

  Translation invariant:
    No matter how the position of object in the image changes, the same function should be able to predict the proposal in either location.

  A Loss Function for Learning Region Proposals:

 
To train RPN, they minimize an objective function above,which the pi is the predicted probability of anchor i, pi* is the ground truth label, ti is a vector representation and ti* is the ground truth box.



  Optimization: 
    For preventing bias, they randomly sample 256 anchors in an image and the positive and negative anchors's ratio of up to 1:1.

  Sharing Convolutional Features for Region Proposal and Object Detection:
    There 4 step for sharing convolutional features.
      1.Train the RPN
      2.Train the Fast R-CNN by RPN.
      3.Use detector network to initialize RPN training, but fix the shared conv and only fine-tune the layers unique to RPN
      4.keeping the shared conv fixed, we fine-tune the fc layers of the Fast R-CNN

My comment:

    Picture below shows that RPN is really much faster than SS. Moreover, it also reveals that RPN can really be trained to have a better performance than pre-defined algorithm.


Below picture show the recall of proposals of RPN at different IoU ratio.
  It shows that RPN is quite stable even dealing with different number of proposals.

  Besides, compared to Selective Search, it demonstrate that it benefits from the training of networks. However, RPN needs bouding box for training and SS don't. SS is still quite a good method to object proposals because bounding box for large scale dataset is not easy to obtain usually.













2016年4月28日 星期四

[AMMAI] [Lecture 09] - "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding"

Paper Information:
  Han, Song, Huizi Mao, and William J. Dally. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." arXiv preprint arXiv:1510.00149 (2015).

Motivation:
    The demanding of running neural network in embeded systems becomes more and more popular. However, the limited harware resources is obstacle to the application.

Contributions:
   They reduce the sorage and energy requited of large network with pruning, trained quantization, Huffmand coding.

Technical summarization:
3 blocks below will be describe in the following parts.

  Network pruning:
    Frist,It learn the connectivity via normal network training. Second,
   weights below a threshold are removed. Finally, the remaining sparse connections will be retrained.

  Trained quantization and weight sharing
   They use k-means clustering to find the shared weights. Since centroid initialization impacts the quality of clustering, larger weights are quite vital; therefore, linear initialization is choosen to initialization.

  Huffman coding
   The main concept of Huffman coding is that more common symbols are represented with fewer bits.

My comment:
This paper indeed makes a lot visualization to crystallize the abstract weight distribution of CNN. Two examples below will be shown.

It's quite a straight way to view distribution of weight in histogram. Furthermore, from the weight distribution the bias is shown clearly; therefore, it's a concrete proof to show the reason to use Huffman coding.
Weights' distribution of conv3 layer is shown above. As we can see, it forms a bimodal distribution.

Following picture shows that overhead of codebook is very small and often negligible. As the first time I see the use of codebook, I thought it will cost some space. This picture shows it will not cost too much space consumption. Therefor, the decoding time might also not be a problem because the amount of codebook is quite small.

2016年4月20日 星期三

[AMMAI] [Lecture 08] - "Two-Stream Convolutional Networks for Action Recognition in Videos"

Paper Information:
  Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014.
APA

Motivation:
    Recognition of human actions in videos is a challenging task, since compared to still image classification, the temporal component of videos provides an additional and vital clue for recognition.

Contributions:
   An extending deep Convolutional Networks leveraging both spatial and temporal streams.

Technical summarization:
  Two-stream architecture:
As shown blow the picture, spatial part can depict information about scenes and objects since some actions are strongly associated with particular objects. Besides, with input of stacking optical flow displacement fields between several consecutive frames, temporal part conveys the movement of the observer (the camera) and the objects.


   Multi-task learning:
   To solve the problem of insufficient datasets, they adopt multi-task learning for the exploitation of additional training data. There are two softmax classification layers on top of the last FC for the dataset HMDB-51 and UCF-101. The overall training loss is computed as the sum of the individual tasks' losses.

My comment:
  Besides comparing to state-of-the-art method, they also implement a lot of experiment for finding best configuration of parameters. For examples, for the spatial ConvNet, distinct training settings are validated and for temporal ConNet they tried various input configurations. To prove the effect of multi-task learning, they have shown the accuracy in different situation. A thorough experiment shows concrete proof to their method.



2016年4月6日 星期三

[AMMAI] [Lecture 06] - "A bayesian hierarchical model for learning natural scene categories"

Paper Information:
  Fei-Fei, Li, and Pietro Perona. "A bayesian hierarchical model for learning natural scene categories." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 2. IEEE, 2005.

Motivation:
    Hand-annotating images is tedious and expensive, therefore, they propose an approach that can recognize natural scene categories by unsupervised learning.

Contributions:
   An algorithm learning relevant intermediate representations of scenes automatically and without supervision. Besides, it is flexible and can group images into sensible hierarchy.

Technical summarization:
    
  The goal of learning is to achieve a model that best represents the distribution of these codewords in each category of scenes. In recognition, therefore, they first identify all the codewords in the unknown image. Then they find the category model that fits best the distribution of the codewords of the particular image.

  Codebook formation

    Given the collection of detected patches from the training images of all categories, They learn the codebook by k-means.

  Model Structure

    1. Choose a category label c
    2. Draw a parameter that determines the distribution of the intermediate themes by choosing 
    3. For each N patches Xn in the image
       1. Choose a theme Zn
       2. Choose a patch according to the number of themes and the total number of codewords in the codebook
   
  Bayesian Decision
    Given x, they want to compute the probability of each scene class.
Therefor,the goal is to maximize the log likelihood term log p(x|θ, β, c) by estimating the optimal θ and β. By using Jensen's inequality, the log likelihood can be bounded. Consequently, by maximizing the lower bound L(γ, φ; θ, β) then with the EM algorithm in turn estimate the model parameters θ and β.

My comment:
  
  With clear visualization, it gives a intuitive understanding that the distribution of the 40 intermediate themes and the distribution of codeword. Besides,for the incorrectly categorized images, the number of significant codewords of the model tends to occur less likely. It is a great finding that means there not enough reliable codewords found in the image.

  Based on theme distribution, It demonstrate that the model can group images into hierarchy with the semantic meaning.

  Though they did not implement the relative algorithm on the same dataset, it is still convincing that their method using unsupervised indeed have great performance.

2016年3月28日 星期一

[AMMAI] [Lecture 05] - "Nonlinear dimensionality reduction by locally linear embedding,"

Paper Information:
  Nonlinear dimensionality reduction by locally linear embedding," Roweis & Saul, Science, 2000.

Motivation:
  The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction.

Contributions:
   They introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Moreover, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima.

Technical summarization:
  The LLE can be summarized in the Fig.

1. Select neighbors (for example by using the K nearest neighbors)
2. Reconstruct with linear weights
Minimize the cost function subject to two constrains:
  first,Xi is reconstructed from neighbors
  second, the rows of the weight matrix sum to one.
Weights can be solved by least-squares problems.





3.Map to embedded coordinates
It can solved by a sparse NxN eigenvalue problem


My comment:



LLE is able to identify the underlying structure of the manifold. However, PCA and MDS can not achieve.

A straightforward visualization gives a clear point of view to understand the advantage of the method. Besides, Applying LLE to various domains shows that  the coordinates of these embedding spaces are related to meaningful attributes, such as the pose and expression of human faces.


2016年3月16日 星期三

[AMMAI] [Lecture 04] - "Online dictionary learning for sparse coding"

Paper Information:
  Mairal, Julien, et al. "Online dictionary learning for sparse coding." Proceedings of the 26th annual international conference on machine learning. ACM, 2009.

Motivation:
  While learning the dictionary has proven to be critical to achieve (or improve upon) state-of-the-art results, effectively solving the corresponding optimization problem is a significant computational challenge, particularly in the context of the large-scale datasets involved in image processing tasks, that may include millions of training samples.
  To address these issues, this paper propose an online approach that processes one element (or a small subset) of the training set at a time.

Contributions:
   A new online optimization algorithm for dictionary learning, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples.

Technical summarization:

  Classical dictionary learning techniques:
    The main idea is try to model data vector as linear combinations of basis element. Therefore, loss function should be small if D is "good" at representing the signal X.
    
    



    Online Dictionary Learning:
      The algorithm is to alternate between the two variables, minimizing over one while keeping the other one fixed.

  
      In practice, to speed up the algorithm it can be achieved by replacing the line 5 and 6 of Algorithm 1 with the concept of mini-batch by

      To update the dictionary, the algorithm uses block-coordinate descent with warm restarts. Besides, one of its main advantages is that it is parameter-free and does not require any learning rate tuning.


My comment:
  While compared to batch, online setting is more realistic and scaleable;
moreover, the parameter-free property makes the experiment stable and objective.
Besides, The application on removing the text from the damaged image is impressive.

[AMMAI] [Lecture 03] - "Iterative Quantization: A Procrustean Approach to Learning Binary Codes"

Paper Information:
  Gong, Yunchao, and Svetlana Lazebnik. "Iterative quantization: A procrustean approach to learning binary codes." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.

Motivation:
  This paper wants to solve the problem of learning similarity preserving binary codes for efficient retrieval in large-scale image collections.


Contributions:
  In this paper, the performance of PCA-based binary coding schemes can be greatly improved by simply rotating the projected data. Besides, an iterative quantization method for refining this rotation that is very natural and effective. Iterative quantization (ITQ), has connections to multi-class spectral clustering and to the orthogonal Procrustes problem, and it can be used both with unsupervised data embeddings such as PCA and supervised embeddings such as canonical correlation analysis (CCA).

Technical summarization:
  Unsupervised Code Learning:
    The major novelty of the method is that they try to preserve the locality structure of the projected data by rotating it so as to minimize the discretization error.
 
  Binary Quantization:

 Beginning with the random initialization of R, they adopt a k-means-like iterative quantization (ITQ) procedure to find a local minimum of the quantization loss (2). In each iteration, each data point is first assigned to the nearest vertex of the binary hypercube, and then R (any orthogonal c*c matrix) is updated to minimize the quantization loss given this assignment.
 
  V = XW, W is obtained by top c eigenvectors of X (data matrix)
  They alternate between updates to B and R for several iterations to find a locally optimal solution.
  Leveraging Label Information:
    Their method can be used with any orthogonal basis projection method. Therefore, supervised dimensionality reduction method can be used to capture the semantic structure of the dataset.
    They refine their codes in a supervised setting using Canonical Correlation Analysis (CCA), which has proven to be an effective tool for extracting a common latent space from two views and is robust to noise. The goal of CCA is to find projection directions for feature and label vectors to maximize the correlation between the projected data. 
My comment:
  Sine there are no ground truth class labels for a dataset, they defined the ground truth by Euclidean neighbors. It may be useful while facing same situation while doing research.
  As we can see from the result, PCA really helps to preserve semantic consistency for the smallest code sizes. Therefore, it is important to apply dimensionality reduction to the data in order to capture the its class structure.