Dictionary and index creating system and document retrieval system
First Claim
1. A dictionary and index creating system designed to create a regular expression dictionary and a word index on the basis of a retrieval document undergoing retrieval and a word dictionary including words w (where w is one of n words w1, w2, w3, . . . , wn, n being a number greater than 1), said system comprising:
- a retrieval document storage unit for storing said retrieval document composed of a lineup of a finite number of characters included in a predetermined character set;
a word dictionary storage unit for storing said word dictionary in which are registered a finite number of words each being a lineup of one or more characters included in said character set;
means for reading out each word w of said words w1, . . . , wn from said word dictionary in said word dictionary storage unit and further for making out m sets of regular expressions a, b, where m represents a variable to establish one or more regular expression sets and depends on each word w of said words w1, . . . , wn and where each of a, b is indicative of a set of equal length character strings having lengths which are equal to each other, except null sets on said character set, with said regular expressions a, b being determined according to a rule depending on each said word w of said words w1, . . . , wn, itself, or on an attribute of each said word w;
a regular expression dictionary storage unit for joining said regular expressions a, b to before and after each word w of said words w1, . . . , wn, respectively, to make out m sets of regular expressions awb corresponding to each word w and further for collecting all of said regular expressions awb made out to produce said regular expression dictionary, different from said word dictionary, according to a predetermined rule depending on each word w of said words w1, . . . , wn and for storing said regular expression dictionary;
means for retrieving a character string matching with a regular expression in said regular expression dictionary from said retrieval document storage unit and further for creating an index element comprising a set of said regular expressions and a matching character positional range in said retrieval document; and
a word index storage unit for storing a word index made out by a collection of said index elements decided as being non-deducible from other index elements.
2 Assignments
0 Petitions
Accused Products
Abstract
A high-speed document retrieval system creates a regular expression dictionary and a word index on the basis of a retrieval document and a word dictionary to conduct retrieval to a document through the regular expression dictionary and the word index at a high speed. A regular expression dictionary expressing a set of character strings having the same length is created from a word dictionary. In terms of a character string included in a retrieval document and matching with a regular expression in the regular expression dictionary, an index element is recorded in a word index when there is no different index element which allows an observing index element to be deducible, which eventually produces a word index capable of achieving a high-speed full-text retrieval without the noticeable increase in the index capacity. The document retrieval system performs the retrieval of the retrieval document through the use of the word dictionary, the regular expression dictionary and the word index, so that a high-speed full-text retrieval is possible without the impairment of retrieval efficiency even if the retrieval character string is covered with words having a small number of characters and making less overlap.
114 Citations
14 Claims
-
1. A dictionary and index creating system designed to create a regular expression dictionary and a word index on the basis of a retrieval document undergoing retrieval and a word dictionary including words w (where w is one of n words w1, w2, w3, . . . , wn, n being a number greater than 1), said system comprising:
-
a retrieval document storage unit for storing said retrieval document composed of a lineup of a finite number of characters included in a predetermined character set;
a word dictionary storage unit for storing said word dictionary in which are registered a finite number of words each being a lineup of one or more characters included in said character set;
means for reading out each word w of said words w1, . . . , wn from said word dictionary in said word dictionary storage unit and further for making out m sets of regular expressions a, b, where m represents a variable to establish one or more regular expression sets and depends on each word w of said words w1, . . . , wn and where each of a, b is indicative of a set of equal length character strings having lengths which are equal to each other, except null sets on said character set, with said regular expressions a, b being determined according to a rule depending on each said word w of said words w1, . . . , wn, itself, or on an attribute of each said word w;
a regular expression dictionary storage unit for joining said regular expressions a, b to before and after each word w of said words w1, . . . , wn, respectively, to make out m sets of regular expressions awb corresponding to each word w and further for collecting all of said regular expressions awb made out to produce said regular expression dictionary, different from said word dictionary, according to a predetermined rule depending on each word w of said words w1, . . . , wn and for storing said regular expression dictionary;
means for retrieving a character string matching with a regular expression in said regular expression dictionary from said retrieval document storage unit and further for creating an index element comprising a set of said regular expressions and a matching character positional range in said retrieval document; and
a word index storage unit for storing a word index made out by a collection of said index elements decided as being non-deducible from other index elements. - View Dependent Claims (2, 3, 4, 8)
-
-
5. A dictionary and index creating system made to create a regular expression dictionary and a word index on the basis of a retrieval document undergoing retrieval, a word dictionary and word frequency data, said system comprising:
-
a retrieval document storage unit for storing a retrieval document composed of a lineup of a finite number of characters included in a predetermined character set;
a word dictionary storage unit for storing a word dictionary in which are registered a finite number of words each being a lineup of one or more characters included in said character set;
a word frequency data storage unit for storing word frequency data indicative of an occurrence frequency of each of a plurality of words of said word dictionary in a sample document comprising a lineup of a finite number of characters included in said predetermined character set;
means for reading out each word w (where w is one of n words w1, w2, w3, . . . , wn, n being a number greater than
1) from said word dictionary in said word dictionary storage unit and further for making out m sets of regular expressions a, b, where m represents a variable to establish one or more regular expression sets and depends on each word w of said words w1, . . . , wn, and where each of a, b is indicative of a set of equal length character strings having lengths which are equal to each other, except null sets on said character set, with said regular expressions a, b being determined according to a rule depending on the frequency of each word w of said words w1, . . . , wn in said word frequency data;
a regular expression dictionary storage unit for joining said regular expressions a, b to before and after each word w of said words w1, . . . , wn, respectively, to make out m sets of regular expressions awb corresponding to each word w of said words w1, . . . , wn and further for collecting all said regular expressions awb made out in connection with all said words in said word dictionary to produce said regular expression dictionary different from said word dictionary and for storing said regular expression dictionary;
means for retrieving a character string matching with a regular expression in said regular expression dictionary from said retrieval document storage unit and further for creating an index element comprising a set of said regular expressions and a matching character positional range in said retrieval document; and
a word index storage unit for storing a word index made out by a collection of said index elements decided as being non-deducible from other index elements. - View Dependent Claims (6, 7, 9, 10)
means for making out a regular expression composed of a specific word w itself if the occurrence frequency of said specific word w recorded in said word frequency data is below a first frequency limit value;
means for joining a character class a being an element in an mth left-side character class set and a character class b being an element in an mth right-side character class set to said word w to make out regular expressions awb in relation to all the possible character classes a, b if the occurrence frequency of said word w recorded in said word frequency data is higher than a mth frequency limit value but is lower than a m+1th frequency limit value; and
means for joining a character class a being an element in an Nth left-side character class set and a character class b being an element in an Nth right-side character class set to make out regular expressions awb in relation to all the possible character classes a, b if the occurrence frequency of said word w recorded in said word frequency data is higher than an N−
1th frequency limit value.
-
-
7. A dictionary and index creating system as defined in claim 5, wherein said sample document is made up of all or a portion of said retrieval document.
-
9. A dictionary and index creating system as defined in claim 5, wherein an enlarged character set is used which is prepared by adding as a terminal character one special character not included in said retrieval document, and said terminal character is added to before and after said retrieval document as occasion demands to produce an enlarged retrieval document, so that said enlarged character set is employed as a character set and said enlarged retrieval document is used as a retrieval document.
-
10. A dictionary and index creating system as defined in claim 5, further comprising means for, if a word composed of only c which is an arbitrary character in a determined character set is not included in a given word dictionary, creating an extended word dictionary by adding said word to said word dictionary, and means for creating a regular expression dictionary and a word index through the use of said extended word dictionary as said word dictionary.
-
11. A dictionary and index creating system made to create a regular expression dictionary and a word index on the basis of a retrieval document undergoing retrieval, a word dictionary and a sample document, said system comprising:
-
a retrieval document storage unit for storing a retrieval document composed of a lineup of a finite number of characters included in a predetermined character set;
a word dictionary storage unit for storing a word dictionary in which are registered a finite number of words each being a lineup of one or more characters included in said character set;
a sample document storage unit for storing a sample document comprising a lineup of a finite number of characters included in a predetermined character set;
means for retrieving a character string matching with a word in said word dictionary from said sample document storage unit and further for creating an index element being a set of said words and a matching character positional range in said retrieval document to check whether or not said index element is deducible from other index elements and for collecting said index elements decided as being non-deducible from the other index elements to produce a first word index;
means for producing word frequency data in a manner that the number of index elements for each of a plurality of words in said first word index is handled as a word frequency;
means for reading out each word w (where w is one of n words w1, w2, w3, . . . , wn, n being a number greater than
1) from said word dictionary in said word dictionary storage unit and further for making out m sets of regular expressions a, b, where m represents a variable to establish one or more regular expression sets and depends on each word w of said words w1, . . . , wn, and where each of a, b is indicative of a set of equal length character strings having lengths which are equal to each other, except null sets on said character set, with said regular expressions a, b being determined according to a rule depending on said frequency of each word w of said words w1, . . . , wn, in said word frequency data;
a regular expression dictionary storage unit for joining said regular expressions a, b to before and after each word w of said words w1, . . . , wn, respectively, to make out m sets of regular expressions awb corresponding to each word w of said words w1, . . . , wn, and further for collecting all said regular expressions awb made out in connection with all said words in said word dictionary to produce said regular expression dictionary different from said word dictionary and for storing said regular expression dictionary;
means for retrieving a character string matching with a regular expression in said regular expression dictionary from said retrieval document storage unit and further for creating an index element comprising a set of said regular expressions and a matching character positional range in said retrieval document; and
a word index storage unit for storing a second word index made out by a collection of said index elements decided as being non-deducible from other index elements. - View Dependent Claims (12, 13, 14)
means for making out a regular expression composed of a specific word w itself if the occurrence frequency of said specific word w recorded in said word frequency data is below a first frequency limit value;
means for joining a character class a being an element in an mth left-side character class set and a character class b being an element in an mth right-side character class set to said word w to make out regular expressions awb in relation to all the possible character classes a, b if the occurrence frequency of said word w recorded in said word frequency data is higher than a mth frequency limit value but is lower than a m+1th frequency limit value; and
means for joining a character class a being an element in an Nth left-side character class set and a character class b being an element in an Nth right-side character class set to make out regular expressions awb in relation to all the possible character classes a, b if the occurrence frequency of said word w recorded in said word frequency data is higher than an N−
1th frequency limit value.
-
-
13. A dictionary and index creating system as defined in claim 11, wherein an enlarged character set is used which is prepared by adding as a terminal character one special character not included in said retrieval document, and said terminal character is added to before and after said retrieval document as occasion demands to produce an enlarged retrieval document, so that said enlarged character set is employed as a character set and said enlarged retrieval document is used as a retrieval document.
-
14. A dictionary and index creating system as defined in claim 11, further comprising means for, if a word composed of only c which is an arbitrary character in a determined character set is not included in a given word dictionary, creating an extended word dictionary by adding said word to said word dictionary, and means for creating a regular expression dictionary and a word index through the use of said extended word dictionary as said word dictionary.
Specification