LEMMATIZING, STEMMING, AND QUERY EXPANSION METHOD AND SYSTEM

US 20100082333A1
Filed: 06/01/2009
Published: 04/01/2010
Est. Priority Date: 05/30/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of stemming text comprising:

removing stop words from a document based on at least one stop word entry in an array of stop words;

flagging as nouns words determined to be attached to definite articles and preceded by a noun array entry in an array of stop words preceding at least one noun;

adding flagged nouns to a noun dictionary;

flagging as verbs words determined to be preceded by an verb array entry in an array of stop words preceding at least one verb;

adding flagged verbs to a verb dictionary;

searching the document for nouns and verbs based on the flagged nouns and the flagged verbs;

removing remaining stop words subsequent to searching the document;

applying light stemming on the flagged nouns;

applying a root-based stemming on the flagged verbs; and

storing the stemmed document.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of stemming text and system therefore are described. The method comprises removing stop words from a document based on at least one stop word entry in an array of stop words and flagging as nouns words determined to be attached to definite articles and preceded by a noun array entry in an array of stop words preceding at least one noun; adding flagged nouns to a noun dictionary; flagging as verbs words determined to be preceded by an verb array entry in an array of stop words preceding at least one verb; adding flagged verbs to a verb dictionary; searching the document for nouns and verbs based on the flagged nouns and the flagged verbs; removing remaining stop words subsequent to searching the document; applying light stemming on the flagged nouns; applying a root-based stemming on the flagged verbs; and storing the stemmed document.

Citations

16 Claims

1. A computer-implemented method of stemming text comprising:
- removing stop words from a document based on at least one stop word entry in an array of stop words;
  
  flagging as nouns words determined to be attached to definite articles and preceded by a noun array entry in an array of stop words preceding at least one noun;
  
  adding flagged nouns to a noun dictionary;
  
  flagging as verbs words determined to be preceded by an verb array entry in an array of stop words preceding at least one verb;
  
  adding flagged verbs to a verb dictionary;
  
  searching the document for nouns and verbs based on the flagged nouns and the flagged verbs;
  
  removing remaining stop words subsequent to searching the document;
  
  applying light stemming on the flagged nouns;
  
  applying a root-based stemming on the flagged verbs; and
  
  storing the stemmed document.
- View Dependent Claims (2, 3, 4, 9)
- - 2. The method as claimed in claim 1, further comprising:
    - storing the flagged verbs to the verb dictionary.
  - 3. The method as claimed in claim 1, further comprising:
    - storing the flagged nouns to the noun dictionary.
  - 4. The method as claimed in claim 1, wherein the document comprises Arabic text.
  - 9. A memory or a computer-readable medium storing instructions which, when executed by a processor, cause the processor to perform the method of claim 1.

5. A system for stemming text comprising:
- a processor;
  
  a memory, storing an input document, a stemming process, an noun stop word array of stop words preceding at least one noun, and a verb stop word array of stop words preceding at least one verb, and coupled with the processor,wherein the stemming process comprises a sequence of one or more instructions which, when executed by the processor, causes the processor to;
  
  remove stop words from the input document based on at least one noun stop word entry in the noun stop word array;
  
  flag as nouns words determined to be attached to definite articles and preceded by a noun stop word array entry;
  
  flag as verbs words determined to be preceded by a verb stop word array entry;
  
  search the document for nouns and verbs based on the flagged nouns and the flagged verbs;
  
  remove remaining stop words subsequent to searching the document to create a stemmed document;
  
  apply light stemming on the flagged nouns;
  
  apply a root-based stemming on the flagged verbs; and
  
  store the stemmed document to the memory.
- View Dependent Claims (6, 7, 8)
- - 6. The system as claimed in claim 5, further comprising instructions to cause the processor to store the flagged verbs to a verb dictionary stored in memory.
  - 7. The system as claimed in claim 5, further comprising instructions to cause the processor to store the flagged nouns to a noun dictionary stored in memory.
  - 8. The system as claimed in claim 5, wherein the input document comprises Arabic text.

10. A method of stemming text, comprising:
- tokenizing words in a received input document;
  
  normalizing words in the received input document;
  
  removing stop words from the received input document;
  
  for each word remaining in the received input document which is greater than three letters in length;
  
  performing at least one of removing suffixes or modifying endings of the each word based on the contents of a safe suffix set;
  
  applying at least one of a prefix detection model or a suffix detection model to the each word; and
  
  performing a similarity comparison between the each word and a set of words from the received input document to generate a concepts group set from matching words.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The method as claimed in claim 10, further comprising:
    - detecting plural patterns in the received input document based on the concepts group set; and
      
      stemming detected patterns in the received input document using the concepts group set to replace at least one detected pattern with a local stem corresponding to a concept group set.
  - 12. The method as claimed in claim 10, further comprising:
    - performing a query expansion based on the concepts group set.
  - 13. The method as claimed in claim 10, wherein the for each word remaining steps comprise performing at least one of a string match or a string search process.
  - 14. The method as claimed in claim 10, wherein the concepts group set comprises one or more clustering rules.
  - 15. The method as claimed in claim 10, further comprising performing a vocabulary size reduction based on the concepts group set.
  - 16. The method as claimed in claim 15, wherein the performing a vocabulary size reduction is performed on at least the received input document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Eiman Tamah Al-Shammari
Original Assignee
Eiman Tamah Al-Shammari
Inventors
Al-Shammari, Eiman Tamah

Granted Patent

US 8,473,279 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/10
CPC Class Codes

G06F 16/3335   Syntactic pre-processing, e...

G06F 16/3338   Query expansion

G06F 40/242   Dictionaries

G06F 40/268   Morphological analysis

LEMMATIZING, STEMMING, AND QUERY EXPANSION METHOD AND SYSTEM

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

LEMMATIZING, STEMMING, AND QUERY EXPANSION METHOD AND SYSTEM

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links