Text indexing system
First Claim
Patent Images
1. An apparatus for generating an index for a collection of words, the apparatus comprising:
- means for selecting an input word from said collection of words;
means for generating words that are lexically related to said input word wherein a word that is lexically related to said input word is a word having a meaning related to a meaning of said input word, and wherein said input word and said lexically related words form a group of words having related meanings,said means for generating words that are lexically related to said input word comprising a recognition engine includinga means for identifying the underlying lexical form of the input word, the underlying lexical form of the input word being comprised of the lexical morphemes forming the input word,a means for scanning the lexical form of the input word for finding at least one lexical stem within the lexical morphemes forming said lexical form of the input word wherein a lexical stem of a word has a meaning related to the meaning of the input word, anda means for identifying suffixes attached to said lexical stem, wherein said suffixes include ending suffixes and continuation suffixes and said stem finding means and suffix identifying means cooperate to conduct morphological analysis of the input word from the root to the affix and wherein said recognition engine performs inflectional and derivational analysis, wherein each derivational analysis is performed recursively using more than two derivational suffixes within each said input word; and
an indexing engine for representing the occurrence in said collection of words of any of the members of said group by a single member of said group.
2 Assignments
0 Petitions
Accused Products
Abstract
An apparatus for generating an index for a collection of words, the apparatus including means for selecting an input word from the collection of words; means for generating words that are lexically related to the input word, wherein the input word and the lexically related words form a group of words; and an indexing engine for representing the occurrence in the collection of words of any of the members of the group by a single member of the group.
165 Citations
16 Claims
-
1. An apparatus for generating an index for a collection of words, the apparatus comprising:
-
means for selecting an input word from said collection of words; means for generating words that are lexically related to said input word wherein a word that is lexically related to said input word is a word having a meaning related to a meaning of said input word, and wherein said input word and said lexically related words form a group of words having related meanings, said means for generating words that are lexically related to said input word comprising a recognition engine including a means for identifying the underlying lexical form of the input word, the underlying lexical form of the input word being comprised of the lexical morphemes forming the input word, a means for scanning the lexical form of the input word for finding at least one lexical stem within the lexical morphemes forming said lexical form of the input word wherein a lexical stem of a word has a meaning related to the meaning of the input word, and a means for identifying suffixes attached to said lexical stem, wherein said suffixes include ending suffixes and continuation suffixes and said stem finding means and suffix identifying means cooperate to conduct morphological analysis of the input word from the root to the affix and wherein said recognition engine performs inflectional and derivational analysis, wherein each derivational analysis is performed recursively using more than two derivational suffixes within each said input word; and an indexing engine for representing the occurrence in said collection of words of any of the members of said group by a single member of said group. - View Dependent Claims (2, 3)
-
-
4. An apparatus for generating a plurality of topic words from a collection of words having certain informational content, said apparatus comprising:
-
means for selecting a subset of words from said collection of words; a morphological analyzer for generating morphological information about each of the words in said subset of words; said morphological analyzer including a recognition engine including a means for identifying the underlying lexical form of each of the words, the underlying lexical form of each of the words being comprised of the lexical morphemes forming the word, a means for scanning the lexical form of each word for finding at least one lexical stem within lexical morphemes forming the lexical form of each of said words wherein each lexical stem has a meaning related to the meaning of each of said words, the scanning means scanning each of said words from the start of the lexical form of the character strings of each of said words, choosing successively longer sequences of characters, and, for each said sequence of characters, determining whether the remaining characters of each of said words contain one or more suffixes, and a means for identifying suffixes attached to each said lexical stem, wherein said suffixes include ending suffixes and continuation suffixes and said stem finding means and suffix identifying means cooperate to conduct morphological analysis of the input word from the root to the affix and wherein said recognition engine performs inflectional and derivational analysis, wherein each derivational analysis is performed recursively using more than two derivational suffixes within each said input word; means for evaluating whether a given word of said subset of words conveys information about the content of said collection of words, said evaluation means basing its determination upon the morphological information generated by said morphological analyzer; and means for generating a topic word corresponding to said given word, if said evaluation means determines that said given word conveys information about the content of said collection of words. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for generating an index for a collection of words, the method comprising:
-
selecting an input word from said collection of words; generating words that are lexically related to said input word wherein a word that is lexically related to said input word is a word having a meaning related to a meaning of said input word, and wherein said input word and said lexically related words form a group of lexically related words by identifying the underlying lexical form of the input word, the underlying lexical form of the input word being comprised of the lexical morphemes forming the input word, scanning the lexical form of the input word for finding at least one lexical stem within the lexical morphemes forming said lexical form of said input word wherein each lexical stem has a meaning related to the meaning of the input word, and identifying suffixes attached to said lexical stem, wherein said suffixes include ending suffixes and continuation suffixes and said stem finding and suffix identifying are performed cooperatively to conduct a morphological analysis of the input word from the root to the affix and performing inflectional and derivational analysis, wherein each deftrational analysis is performed recursively using more than two derivational suffixes within each said input word; and representing the occurrence in said collection of words of any of the members of said group by a single member of said group.
-
-
16. A method for generating a plurality of topic words from a collection of words having certain information content, said method comprising:
-
selecting a subset of words from said collection of words; using a morphological analyzer to generate morphological information about each of the words in said subset of words;
byfinding a lexical stem among the lexical morphemes within a word of each of said words, and identifying suffixes attached to said lexical stem, wherein said lexical stem finding and suffix identifying are performed cooperatively to conduct a morphological analysis of the input word from the root to the affix by scanning each said word from the start of the lexical form of the character strings of each said word, choosing successively longer sequences of characters, and, for each said sequence of characters, determining whether the remaining characters of each said word contains one or more suffixes, and performing inflectional and derivational analysis, wherein each derivational analysis is performed recursively using more than two derivational suffixes within each said input word; and evaluating whether a given word of said subset of words conveys information about the content of said collection of words, said evaluating step basing its determination upon the morphological information generated by said morphological analyzer; and generating a topic word corresponding to said given word, if said evaluating step determines that said given word conveys information about the content of said collection of words.
-
Specification