Finite-state transduction of related word forms for text indexing and retrieval
First Claim
1. A computerized information retrieval or text indexing device, comprising:
- (a) a database stored on a computer readable medium, said database comprising a data structure for representing stem-variant relations of a language, said data structure comprising a finite state transducer (FST) encoding along a plurality of branches sets of ordered-pairs of upper and lower strings wherein the upper string of each pair is a valid word stem and the lower string of each pair is a valid word variant, said data structure being constructed such that traversing a branch of the FST via the upper string of a pair will enable retrieval of the lower string of the pair, or traversing a branch of the FST via the lower string of a pair will enable retrieval of the upper string of the pair,(b) processing means connected to the computer readable medium, in response to a user query inputting a word incorporating a stem or a variant, for traversing the data structure FST searching for a complete path through an FST branch having a lower string matching the query word, said processing means further comprising means in response to finding a complete path through a branch for outputting the upper string stem represented by that branch and corresponding to the query word or an identification of a document containing the same, or for outputting another word variant represented by that branch and having the same stem as the query word or an identification of a document containing the same.
4 Assignments
0 Petitions
Accused Products
Abstract
The present invention solves a number of problems in using stems (canonical indicators of word meanings) in full-text retrieval of natural language documents, and thus permits recall to be improved without sacrificing precision. It uses various arrangements of finite-state transducers to accurately encode a number of desirable ways of mapping back and forth between words and stems, taking into account both systematic aspects of a language'"'"'s morphological rule system and also the word-by-word irregularities that also occur. The techniques described apply generally across the languages of the world and are not just limited to simple suffixing languages like English. Although the resulting transducers can have many states and transitions or arcs, they can be compacted by finite-state compression algorithms so that they can be used effectively in resource-limited applications. The invention contemplates the information retrieval system comprising the novel finite state transducer as a database and a processor for responding to user queries, for searching the database, and for outputting proper responses, if they exist, as well as the novel database used in such a system and methods for constructing the novel database.
74 Citations
24 Claims
-
1. A computerized information retrieval or text indexing device, comprising:
-
(a) a database stored on a computer readable medium, said database comprising a data structure for representing stem-variant relations of a language, said data structure comprising a finite state transducer (FST) encoding along a plurality of branches sets of ordered-pairs of upper and lower strings wherein the upper string of each pair is a valid word stem and the lower string of each pair is a valid word variant, said data structure being constructed such that traversing a branch of the FST via the upper string of a pair will enable retrieval of the lower string of the pair, or traversing a branch of the FST via the lower string of a pair will enable retrieval of the upper string of the pair, (b) processing means connected to the computer readable medium, in response to a user query inputting a word incorporating a stem or a variant, for traversing the data structure FST searching for a complete path through an FST branch having a lower string matching the query word, said processing means further comprising means in response to finding a complete path through a branch for outputting the upper string stem represented by that branch and corresponding to the query word or an identification of a document containing the same, or for outputting another word variant represented by that branch and having the same stem as the query word or an identification of a document containing the same. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A data structure stored on a computer readable medium used in a computerized device executing an information retrieval or text indexing application program,
said data structure representing stem-variant relations of a language, said data structure comprising a single combined finite state transducer (FST), said FST comprising a plurality of branches defining a plurality of FST paths from a start state via transitions to different end states, each branch forming a path from the start state to the end state representing a string, each of said transitions representing an ordered pair comprising one character or null symbol in an upper string constituting a valid word stem and one character or null symbol in a lower string constituting a valid word variant of said word stem, said single FST comprising the merger of a first FST derived from a first list of irregular word stem-variant pairs constituting a morpheme lexicon and of a second FST derived from at least a second list of linguistic rules governing the addition of affixes to word stems to form regular word variants of said stems, said single FST mapping stems of words obeying the linguistic rules to variants of said stems except where the word is an irregular word in the morpheme lexicon and is overridden by the latter, said data structure being constructed such that traversing a branch of the FST via the upper string of a pair will enable retrieval by the program of the lower string of the pair, or traversing a branch of the FST via the lower string of a pair will enable retrieval by the program of the upper string of the pair.
-
16. A method of using a database for a language for word indexing and retrieval from valid words in that language stored in the database with the aid of a single FST that maps via word-stem pairs the surface form of valid words for the language along a lower string of the pair to its lexical stem or stems along an upper string of the pair, the database being stored on a computer readable medium, the method comprising the steps of:
-
(a) inputting a query containing a valid word in the language requesting identification of any related words in the database or of a document containing such words, (b) operating the FST upward by traversing a branch of the FST via the lower string of the pair to obtain the query word'"'"'s stem or stems and then scanning the database looking for words with a matching stem, or (c) operating the FST upwards by traversing a branch of the FST via the lower string of the pair to obtain the query word'"'"'s lexical stem or stems, then operating the FST downward by traversing a branch of the FST containing the lexical stem or stems via the upper string of the pair to generate all variants of the lexical stem or stems and then scanning the database with each of the variants searching for a match. - View Dependent Claims (17)
-
-
18. A computerized language processing device, comprising:
-
(a) a database stored on a computer readable medium, said database comprising a data structure for representing stem-variant relations and variant-parts of speech-affix relations of a language, said data structure comprising a combined finite state transducer (FST) encoding along a plurality of branches sets of ordered-pairs of upper and lower labels and parts-of-speech or affix tags/label wherein the upper string of each pair is a character of a valid word stem or a part-of-speech or affix tag, and the lower string of each pair is a character of a valid word variant or nul, said data structure being constructed such that traversing a branch of the FST via the upper pair will enable retrieval of the lower pair, or traversing a branch of the FST via the lower pair will enable retrieval of the upper pair, (b) processing means connected to the computer readable medium, in response to a user query inputting a word incorporating a stem or a variant, for traversing the data structure FST searching for a complete path through an FST branch having a lower pair matching the query word, said processing means further comprising means in response to finding a complete path through a branch for outputting the upper string stem or part-of-speech represented by that branch and corresponding to the query word or an identification of a document containing the same, or for outputting another word variant represented by that branch and having the same stem as the query word or an identification of a document containing the same.
-
-
19. A data structure stored on a computer readable medium used in a computerized device executing a language processing application program,
said data structure representing stem-variant relations and variant-parts of speech-affix relations as tags of a language, said data structure comprising a single combined finite state transducer (FST), said FST comprising a plurality of branches defining a plurality of FST paths from a single start state via transitions to different end states, each branch forming a path from the start state to the end state representing a string or a string and a tag, each of said transitions representing an ordered pair comprising one character or null symbol in an upper string constituting a valid word stem or a label representing a part-of-speech or affix relation and one character or null symbol in a lower string constituting a valid word variant of said word stem, said single FST comprising the merger of a first FST derived from a first list of irregular word stem-variant pairs constituting a morpheme lexicon and of a second FST derived from at least a second list of linguistic rules, said data structure being constructed such that traversing a branch of the FST via the upper string of a pair will enable retrieval by the program of the lower string of the pair, or traversing a branch of the FST via the lower string of a pair will enable retrieval by the program of the upper string of the pair or the upper string with the tag.
-
22. A method of using a database in a language for word indexing and retrieval from documents in that language or for other language processing applications with the aid of a single FST that maps via word-stem pairs the surface form of valid words in the language along a lower string of the pair to lexical counterpart forms along an upper string of the pair, the database being stored on a computer readable medium, the method comprising the steps of:
-
(a) inputting a query containing a valid word in the language requesting identification of any related words in the database or of a document containing such words or of the query word'"'"'s part of speech, (b) operating the FST upward by traversing a branch of the FST via the lower string of the pair to obtain the query word'"'"'s stem and then scanning the database looking for a matching stem, or (c) operating the FST upwards by traversing a branch of the FST via the lower string of the pair to obtain the query word'"'"'s lexical counterpart, then operating the FST downward by traversing a branch of the FST containing the lexical counterpart via the upper string of the pair to generate all variants of the lexical counterpart and then scanning the database with each of the variants searching for a match, or (d) operating the FST upward by traversing a branch of the FST via the lower string of the pair to obtain an identification of the query word'"'"'s part-of-speech. - View Dependent Claims (23)
-
-
24. A computerized language processing device, comprising:
-
(a) a database stored on a computer readable medium, said database comprising a data structure for representing stem-variant relations and variant-parts of speech-affix relations of a language, said data structure comprising a combined finite state machine (FSM) converted from finite state transducers (FST) each encoding along a plurality of branches sets of ordered-pairs of upper and lower labels and parts-of-speech or affix tags label wherein the upper string of each pair is a character of a valid word stem or a part-of-speech or affix tag or null, and the lower string of each pair is a character of a valid word variant or null, each of said FST transitions from an original to a destination state being replaced in the FSM with an odd-numbered transition labelled to represent the upper string label or tag or null to a new state followed by an even-numbered transition labelled to represent the lower string character or null from the new state to the destination state, said data structure being constructed such that traversing a branch of the FSM via the odd-numbered transition will enable retrieval of the lower string character, or traversing a branch of the FSM via the even-numbered transition will enable retrieval of the upper string label or tag, (b) processing means connected to the computer readable medium, in response to a user query inputting a word incorporating a stem or a variant, for traversing the data structure FSM searching for a complete path through an FSM branch having a lower character matching that of the query word, said processing means further comprising means in response to finding a complete path through a branch for outputting the upper string stem or part-of-speech represented by that branch and corresponding to the query word or an identification of a document containing the same, or for outputting another word variant represented by that branch and having the same stem as the query word or an identification of a document containing the same.
-
Specification