Multi-language document search and retrieval system
First Claim
1. A computer-readable medium containing a computer program for searching for documents that may contain text in any of a plurality of languages, wherein the computer program performs the steps of:
- separating text in each document to be searched into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
storing the stems in an index that identifies the documents in which words containing the stems appeared;
receiving a query containing a string of text to be searched;
parsing the string of text into individual word tokens;
reducing the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
searching the index for entries that match the stems obtained from the query; and
displaying an identification of the documents that contained matching entries.
1 Assignment
0 Petitions
Accused Products
Abstract
A multi-lingual indexing and search system performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. During the tokenization phase of the process, a string of text is separated into individual word tokens, and predetermined types of tokens are eliminated from further processing. The stemming phase of the process reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. Known word endings are removed from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In a preferred implementation, the stemming process is only applied to nouns.
-
Citations
44 Claims
-
1. A computer-readable medium containing a computer program for searching for documents that may contain text in any of a plurality of languages, wherein the computer program performs the steps of:
-
separating text in each document to be searched into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
storing the stems in an index that identifies the documents in which words containing the stems appeared;
receiving a query containing a string of text to be searched;
parsing the string of text into individual word tokens;
reducing the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
searching the index for entries that match the stems obtained from the query; and
displaying an identification of the documents that contained matching entries. - View Dependent Claims (2, 3)
displaying a matching entry along with the identification of the document in which it appears, wherein a stem is displayed together with an ending to present a full word to the user.
-
-
3. The computer-readable medium of claim 1, wherein a stem is stored in the index together with the ending that was removed from a word token to form that stem, and an entry in the index that matches a stem from a query is displayed with the stored ending.
-
4. A method for determining a relevance ranking for documents that may contain text in any of a plurality of languages, comprising the step of:
-
receiving a query containing a string of text to be searched;
parsing the string of text into individual word tokens;
reducing the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
searching an index for entries that match the stems obtained from the query, wherein the index identifies the documents in which words containing the stems appeared;
retrieving a summary for each document identified as containing matching entries;
separating text in each summary into individual word tokens;
reducing the word tokens from each summary to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any combination of the plurality of languages; and
comparing the stems obtained from the query with the stems obtained from each summary to generate the relevance ranking for each document identified as containing matching entries. - View Dependent Claims (5, 6, 7, 8, 9, 11, 13)
separating text in each document to be searched into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; and
storing the stems in the index.
-
-
6. The method of claim 4, comprising the step of:
displaying an identification of the documents that contained matching entries, in an order of relevance ranking.
-
7. The method of claim 6, comprising the step of:
-
displaying a matching entry along with the identification of the document in which it appears, wherein a stem is displayed together with an ending to present a full word to the user. 8.The method of claim 6, wherein a stem is stored in the index together with the ending that was removed from a word token to form that stem, and an entry in the index that matches a stem from a query is displayed with the stored ending.
-
-
8. The method of claim 4, wherein the word endings that are removed are limited to those ending that are associated with nouns.
-
9. The method of claim 4, wherein a word ending is not removed if the resulting stem is less than a predetermined length.
-
11. The method of claim 4, wherein the reducing steps are carried out once per word token.
-
13. The method of claim 4, comprising the step of:
disregarding stopwords during the removing and storing steps, wherein stopwords are words that occur with relatively high frequency in at least one of the languages and that are not also significant nouns in another one of the languages.
-
10. The method of claim 10, wherein the predetermined length is four characters.
-
12. The method of claim 12, wherein the reducing steps are performed by first examining each word token for the longest known endings, and examining the token for successively shorter endings until a known ending is identified in the word token and removed.
-
14. A method for determining a relevance ranking for documents that may contain text in any of a plurality of languages, comprising the step of:
-
receiving a query containing a string of text to be searched;
parsing the string of text into individual word tokens;
reducing the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
searching an index for entries that match the stems obtained from the query, wherein the index identifies the documents in which words containing the stems appeared;
separating, into individual word tokens, text in each document identified as containing matching entries;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
comparing the stems obtained from the query with the stems obtained from each document identified as containing matching entries to generate the relevance ranking for each identified document. - View Dependent Claims (26)
displaying an identification of the documents that contained matching entries, in an order of relevance ranking.
-
-
15. The method of claim 15, comprising the steps of:
-
separating text in each document to be searched into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; and
storing the stems in the index. - View Dependent Claims (16, 19, 20, 22)
displaying an identification of the documents that contained matching entries, in an order of relevance ranking.
-
-
19. The method of claim 15, wherein the word endings that are removed are limited to those ending that are associated with nouns.
-
20. The method of claim 15, wherein a word ending is not removed if the resulting stem is less than a predetermined length.
-
22. The method of claim 15, wherein the reducing steps are carried out once per word token.
-
17. The method of claim 17, comprising the step of:
displaying a matching entry along with the identification of the document in which it appears, wherein a stem is displayed together with an ending to present a full word to the user. - View Dependent Claims (18)
-
21. The method of claim 21, wherein the predetermined length is four characters.
-
23. The method of claim 23, wherein the reducing steps are performed by first examining each word token for the longest known endings, and examining the token for successively shorter endings until a known ending is identified in the word token and removed.
-
24. A computer-readable medium containing a computer program for determining a relevance ranking for documents that may contain text in any of a plurality of languages, wherein the computer program performs the steps of:
-
receiving a query containing a string of text to be searched;
parsing the string of text into individual word tokens;
reducing the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
searching an index for entries that match the stems obtained from the query, wherein the index identifies the documents in which words containing the stems appeared;
retrieving a summary for each document identified as containing matching entries;
separating text in each summary into individual word tokens;
reducing the word tokens from each summary to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any combination of the plurality of languages; and
comparing the stems obtained from the query with the stems obtained from each summary to generate the relevance ranking for each document identified as containing matching entries.
-
-
25. The computer-readable medium of claim 25, wherein the computer program performs the step of:
-
separating text in each document to be searched into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; and
storing the stems in the index. - View Dependent Claims (29, 30, 32, 34)
disregarding stopwords during the removing and storing steps, wherein stopwords are words that occur with relatively high frequency in at least one of the languages and that are not also significant nouns in another one of the languages.
-
-
27. The computer-readable medium of claim 27, wherein the computer program performs the step of:
displaying a matching entry along with the identification of the document in which it appears, wherein a stem is displayed together with an ending to present a full word to the user. - View Dependent Claims (28)
-
31. The computer-readable medium of claim 31, wherein the predetermined length is four characters.
-
33. The computer-readable medium of claim 33, wherein the reducing steps are performed by first examining each word token for the longest known endings, and examining the token for successively shorter endings until a known ending is identified in the word token and removed.
-
35. A computer-readable medium containing a computer program for determining a relevance ranking for documents that may contain text in any of a plurality of languages, wherein the computer program performs the steps of:
-
receiving a query containing a string of text to be searched;
parsing the string of text into individual word tokens;
reducing the word tokens from the query to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
searching an index for entries that match the stems obtained from the query, wherein the index identifies the documents in which words containing the stems appeared;
separating, into individual word tokens, text in each document identified as containing matching entries;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages;
comparing the stems obtained from the query with the stems obtained from each document identified as containing matching entries to generate the relevance ranking for each identified document.
-
-
36. The computer-readable medium of claim 36, wherein the computer program performs the step of:
-
separating text in each document to be searched into individual word tokens;
reducing the word tokens to grammatical stems by removing word endings that are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any of the plurality of languages; and
storing the stems in the index. - View Dependent Claims (37, 40, 41, 43)
displaying an identification of the documents that contained matching entries, in an order of relevance ranking.
-
-
40. The computer-readable medium of claim 36, wherein the word endings that are removed are limited to those ending that are associated with nouns.
-
41. The computer-readable medium of claim 36, wherein a word ending is not removed if the resulting stem is less than a predetermined length.
-
43. The computer-readable medium of claim 36, wherein the reducing steps are carried out once per word token.
-
38. The computer-readable medium of claim 38, wherein the computer program performs the step of:
displaying a matching entry along with the identification of the document in which it appears, wherein a stem is displayed together with an ending to present a full word to the user.
-
39. The computer-readable medium of claim 39, wherein a stem is stored in the index together with the ending that was removed from a word token to form that stem, and an entry in the index that matches a stem from a query is displayed with the stored ending.
-
42. The computer-readable medium of claim 42, wherein the predetermined length is four characters.
-
44. The computer-readable medium of claim 44, wherein the reducing steps are performed by first examining each word token for the longest known endings, and examining the token for successively shorter endings until a known ending is identified in the word token and removed.
Specification