Probabilistic latent semantic indexing Probabilistic latent semantic indexing
Paper summary Probabilistic latent semantic indexing (PLSI) is an approach for document retrieval by modeling the joint probability model of words and documents as a mixture of independent multinomial distribution conditioned by latent semantic classes. The model is based on two independence assumption. First, the observed words and documents are assumed to be generated independently. Second, conditioned on the latent class, words are generated independently of the specific document identity. Given that the number of classes is smaller than the number of documents, each class acts as a bottleneck variable in predicting the distribution of words conditioned on documents. Technical details Given a word w and a document d, their joint probability distribution is model as follows. $$P(d,w) = P(d)P(w|d), where$$ $$P(w|d) = \displaystyle\sum\_{z\in Z} P(w|z)P(z|d)$$ where $z$ denotes a latent class. Following the likelihood principle, one determines the distributions in (1) and (2) by maximization of the log-likelihood function $$\mathcal{L} = \displaystyle\sum\_{d \in D} \displaystyle\sum\_{w \in W} n(d,w) log P(d,w)$$ The maximization is done by the Expectation Maximization (EM) algorithm. Results ![](
Probabilistic latent semantic indexing
Hofmann, Thomas
- 1999 via Bibsonomy
Keywords: semantic, latent, probabilistic, indexing

Your comment: allows researchers to publish paper summaries that are voted on and ranked!