Probabilistic latent semantic indexing on ShortScience.org

scholar.google.com

Probabilistic latent semantic indexing
Hofmann, Thomas
- 1999 via Local Bibsonomy
Keywords: semantic, latent, probabilistic, indexing

Summaries/Notes 1

[link] Summary by Evan Su 8 years ago

Probabilistic latent semantic indexing (PLSI) is an approach for document retrieval by modeling the joint probability model of words and documents as a mixture of independent multinomial distribution conditioned by latent semantic classes. The model is based on two independence assumption. First, the observed words and documents are assumed to be generated independently. Second, conditioned on the latent class, words are generated independently of the specific document identity. Given that the number of classes is smaller than the number of documents, each class acts as a bottleneck variable in predicting the distribution of words conditioned on documents.

Technical details

Given a word w and a document d, their joint probability distribution is model as follows.

$$P(d,w) = P(d)P(w|d), where$$

$$P(w|d) = \displaystyle\sum\_{z\in Z} P(w|z)P(z|d)$$

where $z$ denotes a latent class. 

Following the likelihood principle, one determines the distributions in (1) and (2) by maximization of the log-likelihood function

$$\mathcal{L} = \displaystyle\sum\_{d \in D} \displaystyle\sum\_{w \in W} n(d,w) log P(d,w)$$

The maximization is done by the Expectation Maximization (EM) algorithm.

Results

![](http://1.bp.blogspot.com/-eSKjS0950ac/VUHFJ2Vv7fI/AAAAAAAAA4U/bMGoVNh5_O0/s1600/result.png)

Your comment: