2016年4月28日 星期四

[AMMAI] [Lecture 09] - "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding"

Paper Information:
  Han, Song, Huizi Mao, and William J. Dally. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." arXiv preprint arXiv:1510.00149 (2015).

Motivation:
    The demanding of running neural network in embeded systems becomes more and more popular. However, the limited harware resources is obstacle to the application.

Contributions:
   They reduce the sorage and energy requited of large network with pruning, trained quantization, Huffmand coding.

Technical summarization:
3 blocks below will be describe in the following parts.

  Network pruning:
    Frist,It learn the connectivity via normal network training. Second,
   weights below a threshold are removed. Finally, the remaining sparse connections will be retrained.

  Trained quantization and weight sharing
   They use k-means clustering to find the shared weights. Since centroid initialization impacts the quality of clustering, larger weights are quite vital; therefore, linear initialization is choosen to initialization.

  Huffman coding
   The main concept of Huffman coding is that more common symbols are represented with fewer bits.

My comment:
This paper indeed makes a lot visualization to crystallize the abstract weight distribution of CNN. Two examples below will be shown.

It's quite a straight way to view distribution of weight in histogram. Furthermore, from the weight distribution the bias is shown clearly; therefore, it's a concrete proof to show the reason to use Huffman coding.
Weights' distribution of conv3 layer is shown above. As we can see, it forms a bimodal distribution.

Following picture shows that overhead of codebook is very small and often negligible. As the first time I see the use of codebook, I thought it will cost some space. This picture shows it will not cost too much space consumption. Therefor, the decoding time might also not be a problem because the amount of codebook is quite small.

2016年4月20日 星期三

[AMMAI] [Lecture 08] - "Two-Stream Convolutional Networks for Action Recognition in Videos"

Paper Information:
  Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." Advances in Neural Information Processing Systems. 2014.
APA

Motivation:
    Recognition of human actions in videos is a challenging task, since compared to still image classification, the temporal component of videos provides an additional and vital clue for recognition.

Contributions:
   An extending deep Convolutional Networks leveraging both spatial and temporal streams.

Technical summarization:
  Two-stream architecture:
As shown blow the picture, spatial part can depict information about scenes and objects since some actions are strongly associated with particular objects. Besides, with input of stacking optical flow displacement fields between several consecutive frames, temporal part conveys the movement of the observer (the camera) and the objects.


   Multi-task learning:
   To solve the problem of insufficient datasets, they adopt multi-task learning for the exploitation of additional training data. There are two softmax classification layers on top of the last FC for the dataset HMDB-51 and UCF-101. The overall training loss is computed as the sum of the individual tasks' losses.

My comment:
  Besides comparing to state-of-the-art method, they also implement a lot of experiment for finding best configuration of parameters. For examples, for the spatial ConvNet, distinct training settings are validated and for temporal ConNet they tried various input configurations. To prove the effect of multi-task learning, they have shown the accuracy in different situation. A thorough experiment shows concrete proof to their method.



2016年4月6日 星期三

[AMMAI] [Lecture 06] - "A bayesian hierarchical model for learning natural scene categories"

Paper Information:
  Fei-Fei, Li, and Pietro Perona. "A bayesian hierarchical model for learning natural scene categories." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 2. IEEE, 2005.

Motivation:
    Hand-annotating images is tedious and expensive, therefore, they propose an approach that can recognize natural scene categories by unsupervised learning.

Contributions:
   An algorithm learning relevant intermediate representations of scenes automatically and without supervision. Besides, it is flexible and can group images into sensible hierarchy.

Technical summarization:
    
  The goal of learning is to achieve a model that best represents the distribution of these codewords in each category of scenes. In recognition, therefore, they first identify all the codewords in the unknown image. Then they find the category model that fits best the distribution of the codewords of the particular image.

  Codebook formation

    Given the collection of detected patches from the training images of all categories, They learn the codebook by k-means.

  Model Structure

    1. Choose a category label c
    2. Draw a parameter that determines the distribution of the intermediate themes by choosing 
    3. For each N patches Xn in the image
       1. Choose a theme Zn
       2. Choose a patch according to the number of themes and the total number of codewords in the codebook
   
  Bayesian Decision
    Given x, they want to compute the probability of each scene class.
Therefor,the goal is to maximize the log likelihood term log p(x|θ, β, c) by estimating the optimal θ and β. By using Jensen's inequality, the log likelihood can be bounded. Consequently, by maximizing the lower bound L(γ, φ; θ, β) then with the EM algorithm in turn estimate the model parameters θ and β.

My comment:
  
  With clear visualization, it gives a intuitive understanding that the distribution of the 40 intermediate themes and the distribution of codeword. Besides,for the incorrectly categorized images, the number of significant codewords of the model tends to occur less likely. It is a great finding that means there not enough reliable codewords found in the image.

  Based on theme distribution, It demonstrate that the model can group images into hierarchy with the semantic meaning.

  Though they did not implement the relative algorithm on the same dataset, it is still convincing that their method using unsupervised indeed have great performance.