Return to the home page for Patrick Kellogg

An Introduction to Latent Semantic Analysis Thomas K. Landauer, Peter W. Foltz, and Darrell Laham

Latent Semantic Analysis (LSA) analyzes word-word, word-passage, and passage-passage relationships. There's a good relationship between what LSA extracts and what people understand. LSA doesn't use first-hand knowledge of the world, but extracts from "episodes" of word content. LSA doesn't use word order or logic relationships. LSA uses unitary expressions of meaning instead of relationships between sucessive words. A word is a kind of average of meaning through all passages. Dimensionality is reduced to approximate human cognition. LSA is a theory of knowledge representation. Dimensionality reduction solves the problem of "insufficient evidence" or "poverty of the stimulus". LSA uses a matrix decomposition algorithm to reduce the dimensionality. Ideally, the dimension of the reconstruction equals that of the passages. Results show that the meaning similarities are close to that of humans, LSA's rate of knowledge acquisition approximates that of humans, and LSA depends on the dimensionality.

LSA can be use to test theories of human cognition. LSA skips over the order of words to capture relationships in word choice. LSA uses a pre-processing step for word correlation over many passages and contexts. LSA uses a very large number of relationships. Theories of human cognition cannot be settled by theoretical and philosphical ideas.

LSA is an automatic mathematical algorithm for extracting relationships in word usage in passages. It doesn't use dictionaries, external knowledge, or grammar rules. First, represent words as a matrix, each row is a word, and each column is a text passage or context. Next, do a preliminary transformation of the matrix. Each word frequency in the matrix is weighted by a function. Next, LSA applies sigular value decomposition (SVD) to the matrix, which is factor analysis. The decomposition results in dimensionality reduction. Extract words from the passage into a word matrix, do a linear decomposition of the matrix, then reduce the dimensionality of the matrix. The LSA matrix adds words not in the passage, like human minor knowledge acqusition. LSA is intuitively sensible, with a three-fourths gain in total comprhension vocabulary inferred from knowledge about words not in the passage or paragraph. Human children have a rapid growth of vocabulary and knowledge. Humans draw conclusions from missing data. Reducing the dimensionality of representation is usefulwhen the representation matches the data. The data should not be perfectly regenerated. The similarity of dimensionality reduction is the cosine between vectors.

Before the SVD is computed, LSA does a data preprocessing matrix data transformation. Save the log word frequency, and the entropy for each row and column of the word. Weight each word occurrence by an estimate of its importance in the passage. Knowing a word provides information about hte passage it appeared in. Matrix transformations are used in information retreival and human cognition models. A web site provides LSA based word or passage vectors, similarites between words and words, words and passages, and passages and passages. LSA is able to model human conceptual knowledge. LSA links information retrieval and human semantic memory. Latent Semantic Indexing (LSI), like LSA, was tested against pre-examined documents. Direct comparisons were muddied by preprocessing words. LSA does synonym tests, since most near neighbors are related by the cosine. LSA vectors were created from many passages. LSA captures synonymity by knowledge of captured vocabularies. TOEFL vocabulary simulates human performance between word choice. LSA errors were compared to student errors. The role of dimension reduction was analyzed. LSA simulates word sorting and word relationships. Subject-matter knowledge and sematic priming are in LSA. Predictive learning and text comprehension for humans.

Questions:

Landauer (et. al.) deny that LSA is a word frequency counter. However, the first step they do is a log of the word frequency times an entropy measure. Disingenious or deluding themselves?

Oh god, I hope LSA doesn't catch on, though I fear it will. Would it be better to repeat words (thus achieving high frequency count and correlations) or to try for arcane synonyms and push up the "human thesaurus" ability? Maybe I'll start to do a little "preprocessing" of my own before I hand in any LSA-graded articles.

I've always been proud of my ability to link sentences together with transitions and "leading words". I've worked hard at making my written word resemble my speech. However, LSA throws away any idea of word sucession, or the talent of building up an argument by a linear progression of examples. Will all of our thoughts, quickly resemble, this type, and style, of speech and the (Orwellian double-speak) fuzzy-logic liberal-arts postmodern incoherent and difficult-to-read analysis of subtext win out. In the end?