LDA solves these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than training a set of individual parameters for each training document.Thus, it only needs to learn k + kV parameters. k parameters for generating the topic mixture weights (α), and kV parameters for β . Furthermore, the k + kV parameters in a k-topic LDA model do not grow with the size of the training corpus, and will not suffer from overfitting.
The following are the mathematical details.
Assume that given parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is
where p(θ|α) is a Dirichlet distribution.
It's graphical representation is
Then we can get p(w|α, β) by integrating over θ and summing over z.
After that, we can derive p(D|α, β) by multiplying the probabilities p(w|α, β) of all documents appear in the corpus D.
That is the mathematical equation of LDA.
The main difference of pLSI and LDA can be highlighted in the following figure.
pLSI learns a latent topic combination for every document in the training set. Therefore, the learned combination of one document can be viewed as one point in the above figure. Instead of learning the combination of every document in the training set, LDA tries to learn a distribution to describe the probability that some combination of latent topic appears. That is, it tries to learn a distribution that documents appear in the latent topic space to describe the corpus.
When new data comes, both pLSI and LDA fix the latent topics and tried to find one point in the latent topic space to minimize the reconstruction error. They need almost the same computational cost to locate the new document in the latent topic space. However, after the document is located, LDA can give one more information then pLSI. LDA can tell us the probability that the new document appears in the current corpus, while pLSI gives every document the same weight and thus cannot tell us that probability.