Multi-language document search and retrieval system
First Claim
1. A system for indexing textual content in any of a plurality of languages for searching purposes, comprising:
- a processing device, comprising;
a tokenizer which separates a string of text into individual word tokens,a stemmer which reduces the word tokens to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any combination of the plurality of languages, andan indexer which creates an index from the stems; and
a computer-readable medium which stores the created index.
1 Assignment
0 Petitions
Accused Products
Abstract
A multi-lingual indexing and search system is presented that performs tokenization and stemming in a manner which is independent of whether index entries and search terms appear as words in a dictionary. The system includes a tokenizer that separates a string of text into individual word tokens, and eliminates predetermined types of tokens from further processing. The system also includes a stemmer that reduces words to grammatical stems by removing known word-endings associated with the various languages to be supported. The stemmer removes known word endings from the word tokens without any effort to guarantee that the remaining stem is contained in a dictionary. In an embodiment, the stemmer only removes those word endings which are associated with nouns. The system further includes an indexer that stores the stems in an index.
-
Citations
7 Claims
-
1. A system for indexing textual content in any of a plurality of languages for searching purposes, comprising:
a processing device, comprising; a tokenizer which separates a string of text into individual word tokens, a stemmer which reduces the word tokens to grammatical stems by removing word endings which are associated with any one or more of the languages, without regard to whether the remaining stem is a recognized word in any combination of the plurality of languages, and an indexer which creates an index from the stems; and
a computer-readable medium which stores the created index.- View Dependent Claims (2, 3, 4, 5, 6, 7)
Specification