Multi-language document search and retrieval system
First Claim
1. A computer-readable medium containing a computer program for searching for documents that may contain text in any of a plurality of languages, wherein the computer program performs the steps of:
- separating text in each document to be searched into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
storing the stems in an index that identifies the documents in which words containing the stems appeared;
parsing a query containing a string of text into individual word tokens;
reducing the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
searching the index for entries that match the stems obtained from the query; and
displaying an identification of the documents that contained matching entries.
1 Assignment
0 Petitions
Accused Products
Abstract
A multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. During the tokenization phase of the process, a string of text is separated into individual word tokens, and predetermined types of tokens are eliminated from further processing. The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. Known word endings are removed from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In a preferred implementation, the stemming process is only applied to nouns.
24 Citations
24 Claims
-
1. A computer-readable medium containing a computer program for searching for documents that may contain text in any of a plurality of languages, wherein the computer program performs the steps of:
-
separating text in each document to be searched into individual word tokens; reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; storing the stems in an index that identifies the documents in which words containing the stems appeared; parsing a query containing a string of text into individual word tokens; reducing the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; searching the index for entries that match the stems obtained from the query; and displaying an identification of the documents that contained matching entries. - View Dependent Claims (2, 3)
-
-
4. A system for searching for documents which may contain text in any of a plurality of languages, comprising:
-
a tokenizer which separates text in each document to be searched into individual word tokens; a stemmer which reduces the word tokens to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; an indexer which stores the stems in an index and which identifies the documents in which words containing the stems appeared; a search engine which parses a query containing a string of text into individual word tokens, reduces the word tokens from said query to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages, and searches the index for entries which match the stems obtained from said query; and a display which displays an identification of the documents which contained matching entries. - View Dependent Claims (5, 6)
-
-
7. A system for determining a relevance ranking for documents that may contain text in any of a plurality of languages, comprising:
-
a tokenizer that parses a string of text in a received query into individual word tokens; a stemmer that reduces the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; a search engine that searches an index for entries that match the stems obtained from the query, wherein the index identifies the documents in which words containing the stems appeared, retrieves a summary for each document identified as containing matching entries, separates text in each summary into individual word tokens, reduces the word tokens from each summary to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any combination of the plurality of languages, and compares the stems obtained from the query with the stems obtained from each summary to determine the relevance ranking for each document identified as containing matching entries; and a display that displays said relevance rankings. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system for determining a relevance ranking for documents that may contain text in any of a plurality of languages, comprising:
-
a tokenizer that parses the string of text in a received query into individual word tokens; a stemmer that reduces the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; a search engine that searches an index for entries that match the stems obtained from the query, wherein the index identifies the documents in which words containing the stems appeared, separates, into individual word tokens, text in each document identified as containing matching entries, reduces the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages, and compares the stems obtained from the query with the stems obtained from each document identified as containing matching entries to determine the relevance ranking for each identified document; and a display that displays said relevance rankings. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24)
-
Specification