【推荐】LSI(latent semantic indexing) 完美教程

"instead of lecturing about SVD I want to show you how things work --step by step"

-- 如果大家认同这句话的话,Dr. E. Garcia写的此教程就是最适合你阅读的LSI / LSA教程。




Latent Semantic Indexing (LSI) Fast Track Tutorial
Singular Value Decomposition (SVD) Fast Track Tutorial



1) is theming (analysis of themes).

2) is used by search engines to find all the nouns and verbs, and then associate them with related (substitution-useful) nouns and verbs.

3) allows search engines to "learn" which words are related and which noun concepts relate to one another.

4) is a form of on-topic analysis (term scope/subject analysis).can be applied to collections of any size.

5) has no problem addressing polysemy (terms with different meanings).

Pasted from <http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html>

二、LSI本质上识别了以文档为单位的second-order co-ocurrence的单词并归入同一个子空间。因此:



A persistent myth in search marketing circles is that LSI grants contextuality; i.e., terms occurring in the same context. This is not always the case. Consider two documents X and Y and three terms A, B and C and wherein:

A and B do not co-occur.

X mentions terms A and C

Y mentions terms B and C.

:. A---C---B

The common denominator is C, so we define this relation as an in-transit co-occurrence since both A and B occur while in transit with C. This is called second-order co-occurrence and is a special case of high-order co-occurrence.

However, only because terms A and B are in-transit with C this does not grant contextuality, as the terms can be mentioned in different contexts in documents X and Y. For example, this would be the case of X and Y discussing different topics. Long documents are more prone to this.

Even if X and Y are monotopic thesemight be discussing different subjects. Thus, it would be fallacious to assume that high-order co-occurrence between A and B while in-transit with C equates to a contextuality relationship between terms. Add polysemy to this and the scenario worsens, as LSI can fail to address polysemy.

Pasted from <http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html>

