Method and system for fast indexing and searching of text in compound-word languages
First Claim
1. A method in a computer system for generating a search result that identifies objects that satisfy a search criteria, the computer system having a collection of objects and a plurality of terms, each object containing one or more of the terms, the objects being represented in different tyes of symbols in a compound word language such as Japanese or Chinese, the method comprising the computer-implemented steps of:
- creating a content-index that contains, for each of the plurality of terms, a reference to each object that contains the term, by;
creating a preliminary index term of a first or second type of symbol for each plurality of terms delimited by a word separator or a character type transition;
for each preliminary index term of the first type, utilizing the preliminary index term as an index term;
for each preliminary index term of the second type, step indexing the symbols in the preliminary index term to create a plurality of index terms of a length equal to or less than a predetermined step size, the plurality of index terms comprising a collection of substrings of symbols selected from the preliminary index term that begins with one of the symbols in the preliminary index term and extends to a length of either the end of the preliminary index term or to the number of symbols of the predetermined step size, whichever is smaller;
creating the content-index by associating the object with each of its index terms; and
after creating the content-index, using the content-index to generate the search result.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and system for fast indexing and searching of text in compound-word languages such as Japanese, Chinese, Hebrew, and Arabic. Computer codings of such compound-word languages often contain different character types, e.g. the shift-JIS coding of Japanese represents kanji, katakana, hiragana, and roman characters with different codings in the same character set, to form index terms and search terms. In a preferred embodiment, a content-index search system is invoked in response to a query on a collection of objects. The collection of objects is indexed by the content-index and may, for example, be a corpus of documents indexed by the terms contained in the documents. A content-index search system uses the content-index to generate and store an initial search result in response to the query; a direct search system is used in certain situations. The content-index contains, for each of a plurality of terms, a reference to each object. The content-index is created by first creating a preliminary index term for each plurality of terms delimited by a word separator or a character type transition in a string of characters to be indexed. For each preliminary index term of a first type, e.g. katakana or roman, the preliminary index term is utilized as an index term. For each preliminary index term of a second type, e.g. kanji, the preliminary index term is step-indexed to create a plurality of index terms of a length less than a predetermined step size. The index terms are then added to the content-index in association with the object being indexed. A string of text entered into a search engine as a search term is processed into preliminary search terms and search terms in a similar manner.
309 Citations
41 Claims
-
1. A method in a computer system for generating a search result that identifies objects that satisfy a search criteria, the computer system having a collection of objects and a plurality of terms, each object containing one or more of the terms, the objects being represented in different tyes of symbols in a compound word language such as Japanese or Chinese, the method comprising the computer-implemented steps of:
-
creating a content-index that contains, for each of the plurality of terms, a reference to each object that contains the term, by; creating a preliminary index term of a first or second type of symbol for each plurality of terms delimited by a word separator or a character type transition; for each preliminary index term of the first type, utilizing the preliminary index term as an index term; for each preliminary index term of the second type, step indexing the symbols in the preliminary index term to create a plurality of index terms of a length equal to or less than a predetermined step size, the plurality of index terms comprising a collection of substrings of symbols selected from the preliminary index term that begins with one of the symbols in the preliminary index term and extends to a length of either the end of the preliminary index term or to the number of symbols of the predetermined step size, whichever is smaller; creating the content-index by associating the object with each of its index terms; and after creating the content-index, using the content-index to generate the search result. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method in a computer system for generating a search result that identifies objects that satisfy a search criteria, the computer system having a collection of objects and a plurality of terms, each object containing one or more of the terms, the objects being represented in different types of symbols in a compound word language such as Japanese or Chinese, and an index associating terms and objects, the method comprising the computer-implemented steps of:
-
receiving a string of text as a preliminary search string; creating a preliminary search term of a first or second type of symbol for each plurality of terms in the preliminary search string delimited by a word separator or a character type transition; for each preliminary search term of the first type, utilizing the preliminary search term as a search term; for each preliminary search term of the second type, step indexing the symbols in the preliminary search term to create a plurality of search terms of a length equal to or less than a predetermined step size, the plurality of search terms comprising a collection of substrings of symbols selected from the preliminary search term that begins with one of the symbols in the preliminary search term and extends to a length of either the end of the preliminary search term or the number of symbols of the predetermined step size, whichever is smaller; and using the search terms in the index to generate the search result. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A method in a computer system for providing a search result that identifies objects in a compound word language such as Japanese or Chinese that satisfy a search criteria, the objects contained in a collection of objects, the search criteria having a content-index search portion used with a content-index to determine a set of objects of the collection that satisfy the content-index search portion, the search criteria having a direct search portion, the direct search portion further restricting the set of objects that satisfy the content-index search portion in order to satisfy the search criteria, the method comprising the computer-implemented steps of:
-
receiving a string of text in a compound word language such as Japanese or Chinese as a preliminary search string, the compound word language having symbols of a first type such as kanji, katakana, and roman and symbols of a second type such as hiragana; creating a preliminary search term for each plurality of terms in the preliminary search string delimited by a word separator or a character type transition; for each preliminary search term of the first type, utilizing the preliminary search term as a search term in the search criteria; for each preliminary search term of the second type, setting a direct search indicator; in response to the direct search indicator, generating a proposed list of references to objects that satisfy the direct search portion of the search criteria by directly searching the collection of objects with the preliminary search term of the second type; generating a proposed list of references to objects that satisfy the content-index portion of the search criteria by searching the content-index with the search term of the first type; and providing the search result by listing the collection of objects that match the search criteria of the content-index searching and of the direct searching. - View Dependent Claims (19, 20, 21, 22)
-
-
23. A computer system for generating a search result that identifies objects that satisfy a search criteria, the computer system storing a collection of objects and a plurality of terms, each object containing one or more of the terms, the objects being represented in different types of symbols in a compound word language such as Japanese or Chinese, comprising:
-
a content-index that contains, for each of the plurality of terms, a reference to each object that contains the term; a preliminary index term generator that generates, for each plurality of terms delimited by a word separator or a character type transition, a preliminary search term of a first or second type of symbol; an indexer that, for each preliminary index term of the first type, utilizes the preliminary index term as an index term; the indexer, for each preliminary index term of the second type, also step indexing the symbols in the preliminary index term to create a plurality of index terms of a length equal to or less than a predetermined step size, the plurality of index terms comprising a collection of substrings of symbols selected from the preliminary index term that begins with one of the symbols in the preliminary index term and extends to a length of either the end of the preliminary index term or the number of symbols of the predetermined step size, whichever is smaller; an object/index term associator that creates the content-index by associating the object with each of its index terms; and a search engine that, after creating the content-index, uses the content-index to generate the search result. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A computer system for generating a search result that identifies objects that satisfy a search criteria, the computer system storing a collection of objects in a compound word language such as Japanese or Chinese and a plurality of terms, each object containing one or more of the terms, and an index associating terms and objects, comprising the:
-
an input device for providing a string of text as a preliminary search string; a preliminary search term generator that generates a kanji preliminary search term for each plurality of kanji terms in the preliminary search string delimited by a word separator or a character type transition; a search term generator that provides, for each kanji preliminary search term, a plurality of search terms of a length equal to or less than a predetermined step size by step indexing the symbols in the preliminary kanji search term, the plurality of search terms comprising a collection of substrings of symbols selected from the preliminary kanji search term that begins with one of the symbols in the preliminary kanji search term and extends to a length of either the end of the preliminary kanji search term or the number of symbols of the predetermined step size, whichever is smaller; and a search engine that uses the search terms in the index to generate the search result. - View Dependent Claims (34, 35, 36)
-
-
37. A computer system for providing a search result that identifies objects n a compound word language such as Japanese or Chinese that satisfy a search criteria, the objects contained in a collection of objects, the search criteria having a content-index search portion used with a content-index to determine a set of objects of the collection that satisfy the content-index search portion, the search criteria having a direct search portion, the direct search portion further restricting the set of objects that satisfy the content-index search portion in order to satisfy the search criteria, comprising:
-
an input device for providing a string of text in a compound word language such as Japanese or Chinese as a preliminary search string, the compound word language having symbols of a first type such as kanji, katakana, and roman and symbols of a second type such as hiragana; a preliminary search term generator for creating a preliminary search term for each plurality of terms in the preliminary search string delimited by a word separator or a character type transition; a search term generator that, for each preliminary search term of the first type, utilizes the preliminary search term as a search term in the search criteria; the search term generator, for each preliminary search term of the second type, setting a direct search indicator; a search engine that, in response to the direct search indicator, generates a proposed list of references to objects that satisfy the direct search portion of the search criteria by directly searching the collection of objects with the preliminary search term of the second type; the search engine also generating a proposed list of references to objects that satisfy the content-index portion of the search criteria by searching the content-index with the search term of the first type; and an output device that provides the search results by listing the collection of objects that match the search criteria of the content-index searching and of the direct searching. - View Dependent Claims (38, 39, 40, 41)
-
Specification