Multi-language document search and retrieval system
First Claim
1. A method for indexing textual content in any of a plurality of languages for searching purposes, comprising the steps of:
- separating a string of text into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any combination of the plurality of languages; and
storing the stems in an index.
2 Assignments
0 Petitions
Accused Products
Abstract
A multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. During the tokenization phase of the process, a string of text is separated into individual word tokens, and predetermined types of tokens are eliminated from further processing. The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. Known word endings are removed from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In a preferred implementation, the stemming process is only applied to nouns.
-
Citations
18 Claims
-
1. A method for indexing textual content in any of a plurality of languages for searching purposes, comprising the steps of:
-
separating a string of text into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any combination of the plurality of languages; and
storing the stems in an index. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for searching for documents which may contain text in any of a plurality of languages, comprising the steps of:
-
separating text in each document to be searched into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
storing the stems in an index which identifies the documents in which words containing the stems appeared;
receiving a query containing a string of text to be searched;
parsing the string of text into individual word tokens;
reducing the word tokens from said query to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
searching the index for entries which match the stems obtained from said query; and
displaying an identification of the documents which contained matching entries. - View Dependent Claims (9, 10)
-
-
11. A system for searching for documents which may contain text in any of a plurality of languages, comprising:
-
a tokenizer which receives text strings from documents to be searched and user queries, and separates the text into individual word tokens;
a stemmer which reduces the word tokens to grammatical stems by removing word endings which are associated with any one or more of the plurality of languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
an index which stores the stems from documents and identifies the documents in which words containing the stems appeared;
a search engine which searches the index for entries which match the stems obtained from user queries; and
a display system which displays an identification of the documents which contain matching entries. - View Dependent Claims (12, 13, 14)
-
-
15. A computer-readable medium containing a program which executes the steps of:
-
separating a string of text into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; and
storing the stems in an index. - View Dependent Claims (16, 17, 18)
-
Specification