Language independent stemming

US 8,015,175 B2
Filed: 03/16/2007
Issued: 09/06/2011
Est. Priority Date: 03/16/2007
Status: Active Grant

First Claim

Patent Images

1. A method for stemming a word for use in a text search system running in a computing system, the method comprising the steps of:

(a) calling a stemming algorithm to process a word;

(b) parsing the word through a main routine of said stemming algorithm;

wherein said main routine determines all possible prefixes and suffixes for the word;

(c) parsing a remaining portion of the word through a recursive subroutine called from within said main routine, wherein said recursive subroutine determines all possible roots and infixes of the remaining portion of the word;

(d) assigning through a cost calculator function of said stemming algorithm a cost for each of said prefixes, suffixes, roots, and infixes found;

(e) sequencing by said stemming algorithm said prefixes, suffixes, roots, and infixes found into one or more unique paths that traverse the word;

(f) adding up by said stemming algorithm a total cost for each of said one or more unique paths to determine a least cost path;

(g) outputting by said stemming algorithm one or more roots found in said least cost path as a stem for the word;

(h) performing a search with the text search system using said one or more roots for the word instead of the word itself in both a querying and a indexing phases of the search.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A stemming framework for combining stemming algorithms together in a multilingual environment to obtain improved stemming behavior over any individual stemming algorithm, together with a new language independent stemming algorithm based on shortest path techniques. The stemmer essentially treats the stemming problem as a simple instance of the shortest path problem where the cost for each path can be computed from its word component and its number of characters. The goal of the stemmer is to find the shortest path to construct the entire word. The stemmer uses dynamic dictionaries constructed as lexical analyzer state transition tables to recognize the various allowable word parts for any given language in order to obtain maximum speed. The stemming framework provides the necessary logic to combine multiple stemmers in parallel and to merge their results to obtain the best behavior. Mapping dictionaries handle irregular plurals, tense, phrase mapping and proper name recognition.

Citations

25 Claims

1. A method for stemming a word for use in a text search system running in a computing system, the method comprising the steps of:
- (a) calling a stemming algorithm to process a word;
  
  (b) parsing the word through a main routine of said stemming algorithm;
  
  wherein said main routine determines all possible prefixes and suffixes for the word;
  
  (c) parsing a remaining portion of the word through a recursive subroutine called from within said main routine, wherein said recursive subroutine determines all possible roots and infixes of the remaining portion of the word;
  
  (d) assigning through a cost calculator function of said stemming algorithm a cost for each of said prefixes, suffixes, roots, and infixes found;
  
  (e) sequencing by said stemming algorithm said prefixes, suffixes, roots, and infixes found into one or more unique paths that traverse the word;
  
  (f) adding up by said stemming algorithm a total cost for each of said one or more unique paths to determine a least cost path;
  
  (g) outputting by said stemming algorithm one or more roots found in said least cost path as a stem for the word;
  
  (h) performing a search with the text search system using said one or more roots for the word instead of the word itself in both a querying and a indexing phases of the search.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1 wherein steps (b) and further comprise:
    - parsing the word according to a following pattern;
      
      [Prefix] Root {[Infix] Root} [Suffix]wherein a pattern element [ . . . ] implies zero or one occurrence, and a pattern element implies zero, one, or more occurrences.
  - 3. The method according to claim 1 further comprising the steps of:
    - accessing by said stemming algorithm one or more lookup dictionaries to determine if the word is a member of a set within said one or more lookup dictionaries; and
      
      if the word is said member of said set, setting delimiter characters around the word to prevent said stemming algorithm from making any changes to the word;
      
      wherein said one or more lookup dictionaries are implemented through a lexical analyzer that is table driven.
  - 4. The method according to claim 3 further comprising the step of:
    - determining if the word is a member of a one of the following sets;
      
      male names, female names, surnames, place names, acronyms, and stop words.
  - 5. The method according to claim 1 further comprising the step of:
    - accessing by said stemming algorithm one or more mapping dictionaries to determine if the word is a member of said one or more mapping dictionaries; and
      
      if the word is said member of said one or more mapping dictionaries, replacing the word with one or more alternate words mapped to the word in said one or more mapping dictionaries;
      
      wherein said one or more mapping dictionaries are implemented through a lexical analyzer that is table driven combined with a string list functionality provided by a flat memory model.
  - 6. The method according to claim 5 further comprising the step ofdetermining if the word is a member of one of the following mapping dictionaries:
    - common phrases mapped to equivalent meanings, irregular tense word forms, irregular plural word forms, or fix up.
  - 7. The method according to claim 1 wherein assigning step (d) further comprises the steps of:
    - determining a default cost for said prefixes according to a first formula 2*n+1;
      
      determining a default cost for said suffixes according to a second formula 2*n−
      
      2;
      
      determining a default cost for said roots according to a third formula 2*n/2+1; and
      
      determining a default cost for said infixes according to a fourth formula 2*n+2;
      
      wherein “
      
      n”
      
      is a number of characters in each of said prefixes, suffixes, roots, and infixes.

8. A system for stemming text comprising:
- a processor for processing one or more programming instructions;
  
  logically connected to said processor, one or more storage devices;
  
  one or more lookup dictionaries, stored in said one or more storage devices, which describe character sequences in a target language corresponding to one or more of the following word components;
  
  prefix, suffix, root, and infix;
  
  logically connected to said processor, a client application that as part of its processing presents a stream of text to be stemmed; and
  
  a first stemming algorithm, stored in said one or more storage devices, which is based on a shortest-path path technique, wherein said stream of text is passed to said first stemming algorithm said first stemming algorithm having a main routine for determining all prefixes and suffixes for each word of said stream of text, and having a recursive subroutine called from within said main routine to determine all possible roots and infixes of a remaining portion of each word of said stream of text, said stemming algorithm parsing one or more paths through each word in said stream of text in terms of combinations of said word components for which allowable combinations are identified by accessing said one or more lookup dictionaries, wherein a cost for any said one or more paths through said each word is a calculation from said word components involved and a number of characters in each of said word components, wherein a stemmed result for said each word is selected from said one or more paths that has a least cost to completely traverse said each word.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 9. The system of claim 8 further comprising:
    - one or more additional lookup dictionaries, stored on said one or more storage devices, which logically describe character sequences in said target language corresponding to one or more of the following logical sets;
      
      male names, female names, surnames, place names, acronyms, and stop words, wherein if a word in said stream of text is a member of one of said logical sets, delimiter characters are set around said word to prevent said first stemming algorithm from making any changes to said word.
  - 10. The system of claim 9 further comprising:
    - one or more mapping dictionaries, stored on said one or more storage devices, for replacing each said word in said stream of text with one or more alternate words mapped to each said word if each said word is found in said one or more mapping dictionaries, wherein said one or more mapping dictionaries include common phrases mapped to equivalent meanings, irregular tense word forms, irregular plural word forms, or fix up.
  - 11. The system of claim 10 further comprising:
    - a stemming framework logically connected to said processor which allows an invocation of one or more registered stemming algorithms for said target language, in addition to said first stemming algorithm, through one or more function call-backs within said stemming framework, together with a logic to select a best result returned by any of said one or more registered stemming algorithms or said first stemming algorithm, wherein said stemming framework also makes use of any of said one or more lookup dictionaries, said one or more additional lookup dictionaries, and said one or more mapping dictionaries that are available for said target language in order to map said stream of text presented for stemming to an alternate stream in order to overcome limitations inherent in said one or more registered stemming algorithms.
  - 12. The system of claim 11 wherein said one or more lookup dictionaries, said one or more additional lookup dictionaries, and said one or more mapping dictionaries are provided in a plurality of languages thereby allowing operation of the system for stennning text in multiple languages.
  - 13. The system of claim 11 wherein said one or more lookup dictionaries include a stemmed root word dictionary that maps stemmed root words in said target language to a corresponding word or phrase in a base language such that said stemming framework may choose not only to stem the said stream of text in said target language, but also to map it where possible to said base language.
  - 14. The system of claim 13 wherein said client application utilizes said stemmed root word dictionary through said first stemming algorithm and said stemming framework in order to accomplish cross language searching or machine translation.
  - 15. The system of claim 11 wherein said client application utilizes said stemming framework and said first stemming algorithm as part of a text search system.
  - 16. The system of claim 11 wherein said stemming framework implements and manages a steamier cache to store a results of stemmer calls and optionally utilizes said cache in order to avoid making subsequent identical calls, thereby improving execution speed, and further wherein said stemmer cache can be dumped to an output file or device in order to allow examination of performance of said stemmer calls.
  - 17. The system of claim 11 wherein a user interface is provided within said client application in order to allow a creation and editing of any of said one or more lookup, additional lookup, and mapping dictionaries and to immediately see a results of any changes made without the need for application re-start or code recompilation.
  - 18. The system of claim 8 wherein said client application utilizes said first stemming algorithm as part of a spelling corrector algorithm.
  - 19. The system of claim 11 wherein said one or more lookup dictionaries, said one or more additional lookup dictionaries, and said one or more mapping dictionaries are implemented using a lexical analyzer state transition table responsive to said stream of text or extracted word component to index into said one or more lookup, additional lookup, and mapping dictionaries.
  - 20. The system of claim 19 wherein said lexical analyzer state transition table used to implement said one or more lookup, additional lookup, and mapping dictionaries is combined with a string list functionality provided by a flat memory model.
  - 21. The system of claim 8 wherein said first stemming algorithm attempts to parse said each word in said stream of text into its constituent components according to a following pattern:
    - [Prefix] Root {[Infix] Root} [Suffix]wherein a pattern element [ . . . ] implies zero or one occurrence, and a pattern element { . . . } implies zero, one, or more occurrences.
  - 22. The system of claim 8 wherein said stream of text is presented and directly processed in UTF-8 encoding.
  - 23. The system of claim 8 wherein said calculation of said word component costs is done by a cost calculator function that can be substituted as a function of said target language involved, wherein said cost calculator function further comprises:
    - a first formula for a default cost for said prefixes, wherein said first formula is 2*n+1;
      
      a second formula for a default cost for said suffixes, wherein said second formula is 2*n−
      
      2;
      
      a third formula for a default cost for said roots, wherein said third formula is 2*n−
      
      n/2+1; and
      
      a fourth formula for a default cost for said infixes, wherein said fourth formula is 2*n+2;
      
      where “
      
      n”
      
      is the number of characters in each of said prefixes, suffixes, roots, and infixes in said target language.
  - 24. The system of claim 8 wherein a recognition of any given root, suffix, prefix, or infix as part of recognizing a path through an input word being stemmed can optionally result in output of one or more words to an output stream thereby allowing the system to convert said input word into one or more output words and to maintain any meaning associated with affix sequences in said input word, wherein a control logic in said one or more lookup dictionaries controls an order in which output mappings for said word components are written to said output stream.
  - 25. The system of claim 8 wherein said first stemming algorithm can break a single multi-root input word into an output stream containing and preserving each of a stemmed roots as separate words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
John Fairweather
Original Assignee
John Fairweather
Inventors
Fairweather, John
Primary Examiner(s)
Fleurantin; Jean B
Assistant Examiner(s)
Hershley; Mark E

Application Number

US11/687,402
Publication Number

US 20080228748A1
Time in Patent Office

1,635 Days
Field of Search

707/3, 707/713, 707/736, 707/758, 707/759
US Class Current

707/713
CPC Class Codes

G06F 16/31 Indexing; Data structures t...

G06F 16/3332 Query translation

Language independent stemming

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Language independent stemming

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links