2012年5月16日 星期三

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is an improved version of pLSI (if you want to check the detail, please click here). Its concept is very similar to pLSI. pLSI tried to learn all possible p(wi|zn), which are the probabilities that word i appears in latent topic n. That is also the same for LDA. However, pLSI has some critical problems. Because pLSI only learns topic mixtures p(z|d) for those documents on which it is trained, there is no natural way to use it to assign probability to a previously unseen document. The other problem is that the number of parameters pLSI needs to learn is kV+kM, which is linear to the number of training document. The linear growth in parameters suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem for pLSI.

LDA solves these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than training a set of individual parameters for each training document.Thus, it only needs to learn k + kV parameters. k parameters for generating the topic mixture weights (α), and kV parameters for β . Furthermore,  the k + kV parameters in a k-topic LDA model do not grow with the size of the training corpus, and will not suffer from overfitting.

The following are the mathematical details.
Assume that given parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is


where p(θ|α) is a Dirichlet distribution.
It's graphical representation is
Then we can get p(w|α, β) by integrating over θ and summing over z.
After that, we can derive p(D|α, β) by multiplying the probabilities p(w|α, β) of all documents appear in the corpus D.


That is the mathematical equation of LDA.

The main difference of pLSI and LDA can be highlighted in the following figure. 
pLSI learns a latent topic combination for every document in the training set. Therefore, the learned combination of one document can be viewed as one point in the above figure. Instead of learning the combination of every document in the training set, LDA tries to learn a distribution to describe the probability that some combination of latent topic appears. That is, it tries to learn a distribution that documents appear in the latent topic space to describe the corpus. 

When new data comes, both pLSI and LDA fix the latent topics and tried to find one point in the latent topic space to minimize the reconstruction error. They need almost the same computational cost to locate the new document in the latent topic space. However, after the document is located, LDA can give one more information then pLSI. LDA can tell us the probability that the new document appears in the current corpus, while pLSI gives every document the same weight and thus cannot tell us that probability.

2012年5月2日 星期三

A Global Geometric Framework for Nonlinear Dimensionality Reduction

The goal of this paper is almost the same as the paper I read last week (Nonlinear Dimensionality Reduction by Locally Linear Embedding). Actually, they are coming form the same year Science magazine, and the position of this paper is one former. The same concept is derived from two different teams at almost the same time. That is quite interesting.

In this paper, they also want to find the low dimension underlying manifold structure from high dimension space.
The example is also "Swill roll".














"A" is the original data structure in high dimension space, and they want to find the underlying structure and expand them as "C".
They also assume that geodesic distance can be approximated by adding up a sequence of “short hops” between neighboring points (This concept is the same as previous paper).
However, they use different method to solve the underlying structure.
They proposed isometric feature mapping (Isomap) algorithm to solve it.The following table is their algorithm overview.




















At the first step, they find the neighbor points for every point (the distance of them less then a threshold or found by KNN) . And construct a weighted graph G over the data points by connecting their neighbors.
Every point is viewed as a node in G, and the weight of one edge is defined as dx(i,j). dx(i,j) represent the distance between near points i and j.
After constructing this graph, we can compute the shortest distance of every two nodes by dynamic programming method (Floyd–Warshall algorithm). They denote this distance as dG(i,j).
Then, we know the distances of every two points in the underlying structure. Thus, we can start put them in the new low dimension space. They use classical  multidimensional scaling (MDS) to put the points into d-dimension  Euclidean space Y that best preserves the distances on the graph.
It can be fulfill by minimizing the following cost function.





Where  DY denotes the matrix of Euclidean distances {dY(i,j) = |yi -yj|}.  The Tau operator converts distances to inner products  that supports efficient optimization. This function can be solved by setting the coordinates yto the top d eigen-vectors of the matrix Tau(DG).
The true dimensionality of the data can be estimated from the decrease in error as the dimensionality of Y is increased. That is shown in the figure below (The true dimensionality of underlying structure is marked by arrows).









They put three experiments result on this paper. For all three data sets, the natural appearance of linear interpolations between distant points in the low dimensional coordinate space confirms that Isomap has captured the data’s perceptually relevant structure.