Language independent stemming
First Claim
1. A method for stemming a word for use in a text search system running in a computing system, the method comprising the steps of:
- (a) calling a stemming algorithm to process a word;
(b) parsing the word through a main routine of said stemming algorithm;
wherein said main routine determines all possible prefixes and suffixes for the word;
(c) parsing a remaining portion of the word through a recursive subroutine called from within said main routine, wherein said recursive subroutine determines all possible roots and infixes of the remaining portion of the word;
(d) assigning through a cost calculator function of said stemming algorithm a cost for each of said prefixes, suffixes, roots, and infixes found;
(e) sequencing by said stemming algorithm said prefixes, suffixes, roots, and infixes found into one or more unique paths that traverse the word;
(f) adding up by said stemming algorithm a total cost for each of said one or more unique paths to determine a least cost path;
(g) outputting by said stemming algorithm one or more roots found in said least cost path as a stem for the word;
(h) performing a search with the text search system using said one or more roots for the word instead of the word itself in both a querying and a indexing phases of the search.
0 Assignments
0 Petitions
Accused Products
Abstract
A stemming framework for combining stemming algorithms together in a multilingual environment to obtain improved stemming behavior over any individual stemming algorithm, together with a new language independent stemming algorithm based on shortest path techniques. The stemmer essentially treats the stemming problem as a simple instance of the shortest path problem where the cost for each path can be computed from its word component and its number of characters. The goal of the stemmer is to find the shortest path to construct the entire word. The stemmer uses dynamic dictionaries constructed as lexical analyzer state transition tables to recognize the various allowable word parts for any given language in order to obtain maximum speed. The stemming framework provides the necessary logic to combine multiple stemmers in parallel and to merge their results to obtain the best behavior. Mapping dictionaries handle irregular plurals, tense, phrase mapping and proper name recognition.
-
Citations
25 Claims
-
1. A method for stemming a word for use in a text search system running in a computing system, the method comprising the steps of:
-
(a) calling a stemming algorithm to process a word; (b) parsing the word through a main routine of said stemming algorithm;
wherein said main routine determines all possible prefixes and suffixes for the word;(c) parsing a remaining portion of the word through a recursive subroutine called from within said main routine, wherein said recursive subroutine determines all possible roots and infixes of the remaining portion of the word; (d) assigning through a cost calculator function of said stemming algorithm a cost for each of said prefixes, suffixes, roots, and infixes found; (e) sequencing by said stemming algorithm said prefixes, suffixes, roots, and infixes found into one or more unique paths that traverse the word; (f) adding up by said stemming algorithm a total cost for each of said one or more unique paths to determine a least cost path; (g) outputting by said stemming algorithm one or more roots found in said least cost path as a stem for the word; (h) performing a search with the text search system using said one or more roots for the word instead of the word itself in both a querying and a indexing phases of the search. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for stemming text comprising:
-
a processor for processing one or more programming instructions; logically connected to said processor, one or more storage devices; one or more lookup dictionaries, stored in said one or more storage devices, which describe character sequences in a target language corresponding to one or more of the following word components;
prefix, suffix, root, and infix;logically connected to said processor, a client application that as part of its processing presents a stream of text to be stemmed; and a first stemming algorithm, stored in said one or more storage devices, which is based on a shortest-path path technique, wherein said stream of text is passed to said first stemming algorithm said first stemming algorithm having a main routine for determining all prefixes and suffixes for each word of said stream of text, and having a recursive subroutine called from within said main routine to determine all possible roots and infixes of a remaining portion of each word of said stream of text, said stemming algorithm parsing one or more paths through each word in said stream of text in terms of combinations of said word components for which allowable combinations are identified by accessing said one or more lookup dictionaries, wherein a cost for any said one or more paths through said each word is a calculation from said word components involved and a number of characters in each of said word components, wherein a stemmed result for said each word is selected from said one or more paths that has a least cost to completely traverse said each word. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
Specification