LEMMATIZING, STEMMING, AND QUERY EXPANSION METHOD AND SYSTEM
First Claim
1. A computer-implemented method of stemming text comprising:
- removing stop words from a document based on at least one stop word entry in an array of stop words;
flagging as nouns words determined to be attached to definite articles and preceded by a noun array entry in an array of stop words preceding at least one noun;
adding flagged nouns to a noun dictionary;
flagging as verbs words determined to be preceded by an verb array entry in an array of stop words preceding at least one verb;
adding flagged verbs to a verb dictionary;
searching the document for nouns and verbs based on the flagged nouns and the flagged verbs;
removing remaining stop words subsequent to searching the document;
applying light stemming on the flagged nouns;
applying a root-based stemming on the flagged verbs; and
storing the stemmed document.
0 Assignments
0 Petitions
Accused Products
Abstract
A method of stemming text and system therefore are described. The method comprises removing stop words from a document based on at least one stop word entry in an array of stop words and flagging as nouns words determined to be attached to definite articles and preceded by a noun array entry in an array of stop words preceding at least one noun; adding flagged nouns to a noun dictionary; flagging as verbs words determined to be preceded by an verb array entry in an array of stop words preceding at least one verb; adding flagged verbs to a verb dictionary; searching the document for nouns and verbs based on the flagged nouns and the flagged verbs; removing remaining stop words subsequent to searching the document; applying light stemming on the flagged nouns; applying a root-based stemming on the flagged verbs; and storing the stemmed document.
-
Citations
16 Claims
-
1. A computer-implemented method of stemming text comprising:
-
removing stop words from a document based on at least one stop word entry in an array of stop words; flagging as nouns words determined to be attached to definite articles and preceded by a noun array entry in an array of stop words preceding at least one noun; adding flagged nouns to a noun dictionary; flagging as verbs words determined to be preceded by an verb array entry in an array of stop words preceding at least one verb; adding flagged verbs to a verb dictionary; searching the document for nouns and verbs based on the flagged nouns and the flagged verbs; removing remaining stop words subsequent to searching the document; applying light stemming on the flagged nouns; applying a root-based stemming on the flagged verbs; and storing the stemmed document. - View Dependent Claims (2, 3, 4, 9)
-
-
5. A system for stemming text comprising:
-
a processor; a memory, storing an input document, a stemming process, an noun stop word array of stop words preceding at least one noun, and a verb stop word array of stop words preceding at least one verb, and coupled with the processor, wherein the stemming process comprises a sequence of one or more instructions which, when executed by the processor, causes the processor to; remove stop words from the input document based on at least one noun stop word entry in the noun stop word array; flag as nouns words determined to be attached to definite articles and preceded by a noun stop word array entry; flag as verbs words determined to be preceded by a verb stop word array entry; search the document for nouns and verbs based on the flagged nouns and the flagged verbs; remove remaining stop words subsequent to searching the document to create a stemmed document; apply light stemming on the flagged nouns; apply a root-based stemming on the flagged verbs; and store the stemmed document to the memory. - View Dependent Claims (6, 7, 8)
-
-
10. A method of stemming text, comprising:
-
tokenizing words in a received input document; normalizing words in the received input document; removing stop words from the received input document; for each word remaining in the received input document which is greater than three letters in length; performing at least one of removing suffixes or modifying endings of the each word based on the contents of a safe suffix set; applying at least one of a prefix detection model or a suffix detection model to the each word; and performing a similarity comparison between the each word and a set of words from the received input document to generate a concepts group set from matching words. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
Specification