2012年5月16日 星期三

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is an improved version of pLSI (if you want to check the detail, please click here). Its concept is very similar to pLSI. pLSI tried to learn all possible p(wi|zn), which are the probabilities that word i appears in latent topic n. That is also the same for LDA. However, pLSI has some critical problems. Because pLSI only learns topic mixtures p(z|d) for those documents on which it is trained, there is no natural way to use it to assign probability to a previously unseen document. The other problem is that the number of parameters pLSI needs to learn is kV+kM, which is linear to the number of training document. The linear growth in parameters suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem for pLSI.

LDA solves these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than training a set of individual parameters for each training document.Thus, it only needs to learn k + kV parameters. k parameters for generating the topic mixture weights (α), and kV parameters for β . Furthermore,  the k + kV parameters in a k-topic LDA model do not grow with the size of the training corpus, and will not suffer from overfitting.

The following are the mathematical details.
Assume that given parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is


where p(θ|α) is a Dirichlet distribution.
It's graphical representation is
Then we can get p(w|α, β) by integrating over θ and summing over z.
After that, we can derive p(D|α, β) by multiplying the probabilities p(w|α, β) of all documents appear in the corpus D.


That is the mathematical equation of LDA.

The main difference of pLSI and LDA can be highlighted in the following figure. 
pLSI learns a latent topic combination for every document in the training set. Therefore, the learned combination of one document can be viewed as one point in the above figure. Instead of learning the combination of every document in the training set, LDA tries to learn a distribution to describe the probability that some combination of latent topic appears. That is, it tries to learn a distribution that documents appear in the latent topic space to describe the corpus. 

When new data comes, both pLSI and LDA fix the latent topics and tried to find one point in the latent topic space to minimize the reconstruction error. They need almost the same computational cost to locate the new document in the latent topic space. However, after the document is located, LDA can give one more information then pLSI. LDA can tell us the probability that the new document appears in the current corpus, while pLSI gives every document the same weight and thus cannot tell us that probability.

沒有留言:

張貼留言