2012年3月14日 星期三

Aggregating local descriptors into a compact image representation

In this paper, they focus on the large-scale image retrieval, and want two fulfill three factors at the same time: the search accuracy, its efficiency and the memory usage.

In order to fulfill these three factor, they proposed a new features representation method, VLAD (vector of locally aggregated descriptors). It can obtain a significantly higher accuracy with about 20-byte representation.

The other contribution of this paper is jointly optimizing the dimensional reduction and the indexation algorithm. Not like previous works, they often consider these two factors separately.

Thus, I will focus on these two things in the following.

VLAD: vector of locally aggregated descriptors
The concept of VLAD is derive from BOF (Bof of features) and Fisher kernel.
They first learn a codebook C = {c1 , ..., ck } of k visual words with k-means.
Each local descriptor x is assigned to its nearest visual word ci = NN(x).
The idea of the VLAD descriptor is to accumulate, for each visual word ci, the differences x−ci of the vectors x assigned to ci.

Write the above things as the mathematical equation. where x_j and c_i, j are the j-th component of x and ci

The following are the visual representation of the features and their images.
Each grid represent one centroid, and each point in the grid has 8 directions (like the representation as SIFT).
Thus, every centroid is a 16*8 = 128 dimension vector.
You can find that the images having similar visual contents tend to have the similar differences with centroids. So aggregating them together is reasonable.

●Converting vectors to codes
In this paper, they use the asymmetric distance computation (ADC), which only encodes the vectors of the database, but not the query vector.

For every features in the database, they first use PCA to do dimension reduction. The number of dimensions will reduce from D to D'. That is, we can find a D'×D matrix M such that the original feature x is transformed to x' = Mx.

After doing this, split x' into m subvectors, and quantized each subvector separately.
Centroids in each subvetors are computed separately by k-means.
The following is the presentation slide from the author. (he use y to represent x')
For every subvectors, they use the same number of centroids. However, after PCA, the variance of the different components of x′ is not balanced. The variance of first principle component tends to be much larger than the last one. Thus, using the same number of centroids will introduce more quantization error for first principle component.

To solve this problem, they try to find another transformation matrix Q, such that the transformed vector x'' = Qx' tends to have the same variance.

The other thing needs to be considered is the number of dimensions remained (D') after PCA.
If D' is large, it will introduce more quantization error. If D' is small, it will generate more reduction error. Thus, they consider these two factors together to find a suitable D'.

Despite they spend lots of efforts on this part, they finally find that the performance is the same as random projection.
The following is the experiment table cut from the paper.
It tells us that we can simply use random projection to reduce the dimensions.
That is much easier to implement, and can get the same performance.

========================================================
Comment:
The main contribution of this paper is VLAD. They said it is derived from Fisher kernel, but I can't find relation between these two things in this paper. I think it is much more likely that they first develop the VLAD and want to find some supports. Then, fisher kernel is selected.
Besides, it's quite strange to introduce compression method in such order. They first introduce ADC, and then PCA. But in practice, PCA is done first, and followed by ADC.
The description of VLAD is much more clear compared with the compression method.