AMMAI2012: Probabilistic latent semantic indexing

In this paper, the authors propose one useful method to learn the latent semantic by probabilistic, pLSA.

In pLSA, they model that every document is composed by lots of latent class, and the words generated are only dependent on the latent classes when given latent classes.

So they model the problem like the following.

d represents certain document, w represents certain word, and z is the latent class.

In this paper, they want minimizing the error distance between original data and the reconstructed data after using latent classes (using KL divergence to measure it). After a serious of calculation, we can covert original question to maximize the following log-likelihood function

where n(d, w) denotes the term frequency.

This optimization problem can be solved by using EM algorithm, and the following is the algorithm.

E step:

Calculating the probability of certain class, z, by using the values of d and w get previous round.

M step:

Then, use the P(z|d, w) to update the probabilities of P(w|z), P(d|z) and P(z).

Repeat the steps of E and M until the change rate of results under certain threshold.

The above is the traditional EM method of solving optimization problem. However, the results highly depend on the initial probabilities, and might get the local maximum values by using this method.

Thus, they proposed tempered EM (TEM). This concept is derived from deterministic annealing. Then, the formula in E step becomes that,

The physical meaning of pLSA can be viewed as finding the bases of generating the documents.

P(w|z1), P(w|z2) ... P(w|zn) can be viewed as the basis vectors for representing the document. Because the summation of the values of these bases should be equal to 1, the reconstructed document should be on the plane expended by these vectors. The error rate can be viewed as the shortest distance between the original points and the expended plane. We want to learn a suitable bases that minimize the distance between learned plane and training points.

========================================================

Comment:

I think the most important part of this paper is the physical meaning of pLSA. It is not just a bunch of mathematical formula, optimization, and so on. It also has a very strong physical support that really convinced me which can be used on almost any kind of data, not just an adhoc solution. This paper provides a good example of explaining complex maths in a simple geometric way.

AMMAI2012

2012年4月10日星期二

Probabilistic latent semantic indexing

沒有留言:

張貼留言

2012年4月10日 星期二

Probabilistic latent semantic indexing

沒有留言:

張貼留言

2012年4月10日星期二