Technique for information retrieval using enhanced latent semantic analysis generating rank approximation matrix by factorizing the weighted morpheme-by-document matrix
First Claim
1. A computer implemented method of information retrieval, comprising:
- parsing a corpus to identify instances of wordforms within each document of the corpus;
performing a morphological tokenization on the wordforms to convert the wordforms into morphemes by separating affixes from their stems;
generating a morpheme-by-document matrix based at least in part on a number of instances of the affixes and the stems within each document of the corpus, wherein the morpheme-by-document matrix accounts for information related to past tense and present tense inflections by separately enumerating the instances of the affixes from the instances of the stems;
applying a weighting function to attribute-values within the morpheme-by-document matrix to generate a weighted morpheme-by-document matrix, wherein applying the weighting function includes;
applying the weighting function to the attribute-values of the stems; and
applying the weighting function to the attribute-values of the affixes separately from the attribute-values of the stems to separately account for relative importance of the affixes;
generating at least one lower rank approximation matrix by factorizing the weighted morpheme-by-document matrix; and
retrieving information with reference to the at least one lower rank approximation matrix.
3 Assignments
0 Petitions
Accused Products
Abstract
A technique for information retrieval includes parsing a corpus to identify a number of wordform instances within each document of the corpus. A weighted morpheme-by-document matrix is generated based at least in part on the number of wordform instances within each document of the corpus and based at least in part on a weighting function. The weighted morpheme-by-document matrix separately enumerates instances of stems and affixes. Additionally or alternatively, a term-by-term alignment matrix may be generated based at least in part on the number of wordform instances within each document of the corpus. At least one lower rank approximation matrix is generated by factorizing the weighted morpheme-by-document matrix and/or the term-by-term alignment matrix.
32 Citations
26 Claims
-
1. A computer implemented method of information retrieval, comprising:
-
parsing a corpus to identify instances of wordforms within each document of the corpus; performing a morphological tokenization on the wordforms to convert the wordforms into morphemes by separating affixes from their stems; generating a morpheme-by-document matrix based at least in part on a number of instances of the affixes and the stems within each document of the corpus, wherein the morpheme-by-document matrix accounts for information related to past tense and present tense inflections by separately enumerating the instances of the affixes from the instances of the stems; applying a weighting function to attribute-values within the morpheme-by-document matrix to generate a weighted morpheme-by-document matrix, wherein applying the weighting function includes; applying the weighting function to the attribute-values of the stems; and applying the weighting function to the attribute-values of the affixes separately from the attribute-values of the stems to separately account for relative importance of the affixes; generating at least one lower rank approximation matrix by factorizing the weighted morpheme-by-document matrix; and retrieving information with reference to the at least one lower rank approximation matrix. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer implemented method of information retrieval, comprising:
-
parsing a corpus to identify instances of wordforms within each document of the corpus; performing a morphological tokenization on the wordforms to convert the wordforms into morphemes by separating affixes from their stems; generating a morpheme-by-document matrix based at least in part on a number of instances of the affixes and the stems within each document of the corpus, wherein the morpheme-by-document matrix accounts for information related to past tense and present tense inflections by separately enumerating the instances of the affixes from the instances of the stems; generating a term-by-term alignment matrix based at least in part on the number of instances of the affixes and the stems within each document of the corpus and the morpheme-by-document matrix; and generating at least one lower rank approximation matrix by factorizing the term-by-term alignment matrix. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer-readable storage medium that provides instructions that, when executed by a computer, will cause the computer to perform operations comprising:
-
parsing a corpus to identify instances of wordforms within each document of the corpus; performing a morphological tokenization on the wordforms to convert the wordforms into morphemes by separating affixes from their stems; generating a weighted morpheme-by-document matrix based at least in part on a number of instances of the affixes and the stems within each document of the corpus and based at least in part on a weighting function, wherein the weighted morpheme-by-document matrix accounts for information related to past tense and present tense inflections by separately enumerating the instances of the affixes from the instances of the stems, wherein the weighting function is applied to attribute-values of the affixes separately form attribute-values of the stems to separately account for relative importance of the affixes; and generating at least one lower rank approximation matrix by factorizing the weighted morpheme-by-document matrix. - View Dependent Claims (21, 22, 23, 24, 25, 26)
-
Specification