AMMAI2012: Efficient visual search of videos cast as text retrieval

The main concept of this paper is that they want to convert the image into visual words, which are visual analogies of words. Then, it might be possible to use all the techniques on text retrieval for image retrieval. They want to do this because text retrieval is already develop for a long time comparison with image retrieval tasks.

They tried some well-known techniques for text retrieval, such as tf-idf, vector representation on VWs.

The following is the retrieval algorithm for this paper

The main things different from text-retrieval of this paper are feature extraction, visual word construction, and spatial verification. Thus, I will focus on these parts and quickly go through the other parts in the following.

●Feature extraction (first two steps in pre-processing)

This paper use two kinds of feature detectors and combine them together.

The first detector is Shape Adapter (SA), which tends to be centered on corner like feature.

The other is Maximally Stable (MS), which tends to find the blobs of high contrast with respect to their surrounding.

After finding interesting regions by using these two detectors, use SIFT descriptor(you can find the detail introduction in here) to describe the regions.

In order to get more stable features, any region which does not survive for more than three frames is deleted.

●Visual word construction

After extracting features, they use k-means clustering to find the centroids. Then use these centroids as "visual words". In this paper, they use 6,000 clusters for Shape Adapted regions, and 10,000 clusters for Maximally Stable regions.

Then, each frame is represented as a vector as tradition text-retrieval system.

Each element in the vector is weighted by tf-idf. The similarity is defined by cosine similarity.

They also use stop-list technique to delete some terms, and found that really useful as showed in the figure below.

●Spatial verification

One thing worth mention is that images has geometric information and document haven't.

So, they further do spatial verification on the matched pairs.

For every match pair, find 15 nearest spatial neighbors in both the query and target frame to verify it. If there doesn't have any match pair in these 30 points (15 in query, 15 in target), then reject this pair.

The following is the illustration (it only search 5 nearest neighbors).

========================================================

Comment:

Convert the new problem to old problem is a very clever way to solve problems because there are some well-developed methods to solve it. In this paper, it give us a very good example to do that. After converting the problem, we still need to check is there any different property between the problems. Like this paper, they add spatial verification at the end.

AMMAI2012

2012年3月14日星期三

Efficient visual search of videos cast as text retrieval

沒有留言:

張貼留言

2012年3月14日 星期三

Efficient visual search of videos cast as text retrieval

沒有留言:

張貼留言

2012年3月14日星期三