Document data processing method and apparatus for document retrieval
First Claim
1. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
- upon registration of text documents in said document database,creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character, and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and
registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and
executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search.
1 Assignment
0 Petitions
Accused Products
Abstract
High-speed full document retrieval method and system capable of providing result of retrieval within practically acceptable short search time. Upon registration of documents in a document database, condensed texts are created by decomposing each of textual character strings of the documents to be registered into fragmental character strings in dependence on character species and by checking mutual inclusion relations existing among the fragmental character strings. A component character table is created in which characters occurring in each of the condensed texts are registered without duplication. The condensed texts and the component character table are registered in the data base together with the texts of the documents to be registered. Upon retrieval of a document containing a search term designated by a user, a component character table search is first executed to extract those documents which contain all species of characters constituting the search term by consulting the component character table, and subsequently a condensed text search is executed by consulting the condensed texts of the documents. Finally, a text body search is executed for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through the component character table search and the condensed text search.
143 Citations
50 Claims
-
1. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character, and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character, and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; creating a component character table in which characters occurring in registered texts are registered without duplication; and registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search. - View Dependent Claims (10, 11, 12)
-
-
13. A document data processing method for retrieving a document containing all of plural search terms designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; creating a component character table in which characters occurring in registered texts are registered without duplication; and registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting each of said search terms designated by the operator by consulting said component character table; executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain all the fragmental character strings constituting each of said search terms designated by the operator to thereby select the documents containing the designated search terms; and executing finally a text body search for extracting a document which satisfies query condition imposed on said search terms such as positional relation thereof in the text by consulting the texts of the documents extracted through said component character table search and said condensed text search.
-
-
14. A document data processing method for retrieving a document containing any one of search terms designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; creating a component character table in which characters occurring in registered texts are registered without duplication; and registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting any one of said search terms designated by the operator by consulting said component character table; executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain all the fragmental character strings constituting any one of said search terms designated by the operator to thereby select the documents containing the designated search terms; and executing finally a text body search for extracting a document which satisfies query condition imposed on said search terms by consulting the texts of the documents extracted through said component character table search and said condensed text search.
-
-
15. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings in dependence on character species each of the fragmental character strings being able to include one of katakana character string, hiragana character string, kanji character string, alphabetic character string, numeric character string and symbol character string, and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, while checking said hiragana character string by consulting a basic word dictionary and conjunction rules as to whether said hiragana character string represents a succession of subsidiary words having semantically no meaning as the search term, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string and any hiragana character string found to be a succession of the semantically meaningless subsidiary words are excluded; creating a component character table in which characters occurring in registered texts are registered without duplication; and registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term unless said fragmental character strings have been determined to be a succession of semantically meaningless words as the search term after the check of said fragmental character strings by using the basic word dictionary and the conjunction rules; and executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search while consulting the registered texts of the documents extracted through said component character table search when any one of said fragmental character strings has been determined to be a succession of the semantically meaningless words, for thereby extracting a document which contains each of the fragmental character strings and which satisfies the retrieval condition imposed on the search term concerning the positional relation thereof. - View Dependent Claims (16)
-
-
17. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings in dependence on character species each of the fragmental character strings being able to include one of katakana character string, hiragana character string, kanji character string, alphabetic character string, numeric character string and symbol character string, and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, after having eliminated all the hiragana character strings, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is excluded; creating a component character table in which characters occurring in registered texts are registered without duplication; and registering in said document database a plurality of said condensed texts corresponding to said character species, respectively, together with said component character table in addition to the texts of the documents to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator provided that said fragmental character strings constituting the search term designated by the operator has been determined as including none of the hiragana character strings as a result of corresponding decision step; and executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted or alternatively for extracting a document containing the designated fragmental character strings and satisfying said query condition by consulting the original text of the document extracted through said component character table search. - View Dependent Claims (18)
-
-
19. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create a plurality of condensed texts separately on a character species basis, each of said condensed texts being constituted by the fragmental character strings of a same character species while excluding any character string found to be included by other character string; creating a component character table describing the species of the characters occurring in registered texts; registering in said document database said plurality of character-species based condensed texts together with said component character table in addition to the text of the document to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all the species of characters constituting the search term designated by the operator by consulting said component character table; executing subsequently a condensed text search by consulting the condensed text corresponding to the character species of the fragmental character strings constituting the search term designated by the operator in the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search. - View Dependent Claims (20)
-
-
21. A document data processing method for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and registering in said document database said condensed texts together with said component character table in addition to the text of the document to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; and executing subsequently a condensed table search by consulting the condensed texts of the documents extracted through said component character table search for thereby extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby extract the documents containing the designated search term; creating a component character table in which characters occurring in texts are registered without duplication; and registering in said document database said component character table in addition to the texts of the documents to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; and executing subsequently a text body search by consulting the texts of the documents extracted through said component character table search for thereby extracting only the document which contains the designated search term and which satisfies query condition imposed on the search term such as positional relation thereof in the text, whereby a full text retrieval is carried out at an equivalently increased speed.
-
-
22. A document data processing method for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising the steps of:
-
upon registration of text documents in said document database, creating a component character table in which characters occurring in texts are registered without duplication; and registering in said document database said component character table in addition to the texts of the documents to be registered; and upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; and executing subsequently a text body search by consulting the texts of the documents extracted through said component character table search for thereby extracting only the document which contains the designated search term and which satisfies query condition imposed on the search term.
-
-
23. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; and registering in said document database said condensed texts in addition to the texts of the documents to be registered; and upon retrieval of the document containing the designated search term, executing a condensed text search by consulting the condensed texts of the documents for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and executing a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said condensed text search. - View Dependent Claims (24)
-
-
25. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
-
for registration of text documents in said document database, means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including include hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; means for creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and means for registering in said document database said condensed texts together with said component character table in addition to the texts of the documents to be registered; and for document retrieval, component character table search means for extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; condensed text search means for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator by consulting the condensed texts of the documents extracted through the component character table search; and text body search means for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted.
-
-
26. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
-
for registration of text documents in said document database, means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including include hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; means for creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; means for registering in said document database said condensed texts together with said component character table in addition to the texts of the documents to be registered; and means for storing the condensed text data in a RAM disk while storing the component character table in a semiconductor memory; and for document retrieval, component character table search means for extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; condensed text search means for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator by consulting the condensed texts of the documents extracted through the component character table search; and text body search means for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted.
-
-
27. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
-
for registration of text documents in said document database, means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including include hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; means for creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and means for registering in said document database said condensed texts together with said component character table in addition to the texts of the documents to be registered and storing the text data and the condensed text data in a magnetic disk while storing said component character table in a semiconductor memory; and for document retrieval, component character table search means for extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; means for checking the number of the documents extracted through the component character table search; condensed text search means for reading out all of said condensed texts by neglecting the result of the component character table search, when said number of said extracted documents has attained a predetermined number, to thereby extract only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator, while consulting the condensed texts of the documents extracted through said component character table search to thereby extract only the documents corresponding to the condensed text containing the fragmental character strings which-constitute the search term designated by the operator, when said number of said extracted documents is smaller than said predetermined number; and text body search means for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted.
-
-
28. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
-
for registration of text documents in said document database, means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including include hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; means for creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and means for registering in said document database said condensed texts together with said component character table in addition to the texts of the documents to be registered and storing the text data and the condensed text data in a magnetic disk while storing said component character table in a semiconductor memory; and for document retrieval, component character table search means for extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; means for checking the number of the documents extracted through the component character table search; condensed text search means for reading out all of said condensed texts by neglecting the result of the component character table search only when said number of said extracted documents has attained a predetermined number, to thereby extract only the documents corresponding to-the condensed texts which contain the fragmental character strings constituting the search term designated by the operator; and text body search means for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted, while consulting the condensed texts of the documents extracted through said component character table search to thereby extract only the document corresponding to the condensed text containing the fragmental character strings which constitute the search term designated by the operator, when said number of said extracted documents is smaller than said predetermined number.
-
-
29. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; creating a concatenated component character table by preparing, for each of the documents, information of all usable character strings each composed of at least two characters, said information including first information indicating those character strings which are used in the document to be registered and second information indicating those character strings unused in the document to be registered; and registering in said document database said condensed texts together with said concatenated component character table in addition to the texts of the document to be registered; and upon retrieval of the document containing the designated search term, executing a component character table search for extracting all the documents in which all the character strings contained in the search term designated by the operator and each composed of at least two characters are used, by consulting said concatenated component character table; executing a condensed text search by consulting the condensed texts corresponding to the documents extracted through said component character table search for thereby extracting only the documents which contain the fragmental character strings constituting the search term designated by the operator; and executing finally a text body search for extracting a document from the documents selected through said condensed text search which document satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said concatenated component character table search and said condensed text search. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
-
-
40. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character and other symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; creating a single component character table and a concentrated component character table by preparing, for each of the documents, information of all usable single characters and character strings each composed of at least two characters, said information including first information indicating those single-character and character strings which are used in the document to be registered and second information indicating those single-character and character strings unused in the document to be registered, respectively; and registering in said document database said condensed texts together with said concatenated component character table in addition to the texts of the document to be registered; and upon retrieval of the document containing the designated search term, executing a component character table search for extracting all the documents in which all the character strings contained in the search term designated by the operator and each composed of at least two characters are used, by consulting said concatenated component character table; executing a condensed text search by consulting the condensed texts corresponding to the documents extracted through said component character table search for thereby extracting only the documents which contain the fragmental character strings constituting the search term designated by the operator; and executing finally a text body search for extracting a document from the documents selected through said condensed text search which document satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said concatenated component character table search and said condensed text search. - View Dependent Claims (41, 42)
-
-
43. A text data creating method for creating a text database for storing document information as character code data, comprising steps of:
-
(1) fetching text data; (2) determining frequencies at which individual character strings each constituted by a predetermined number n of characters are used in the text data and rearraying said character strings in a sequential order in dependence on said frequencies; (3) establishing correspondences between said character strings and a number of entries which is smaller than the number of said character strings and storing said correspondences in the form of a hash table; and (4) storing at the entry corresponding to the character strings used in said text data said character strings in the form of a componeht character table.
-
-
44. A full text retrieval method for retrieving a document containing a search term designated by an operator from a text data database registering therein document information as character code data while referring to textual content of said document, comprising steps of:
-
(1) fetching text data; (2) determining frequencies at which individual character strings each constituted by a predetermined number n of characters are used in the text data and rearraying said character strings in a sequential order in dependence on said frequencies; (3) establishing correspondences between said character strings and a number of entries which is smaller than the number of said character strings and storing said correspondences in the form of a hash table; (4) storing at the entry corresponding to the character strings used in said text data said character strings in the form of a component character table; (5) decomposing the search term designated by the operator into fragmental character strings each composed of n characters; (6) extracting from said component character table those entries which correspond to said fragmental character strings resulting from said decomposition; and (7retrieving said document in which all the character strings constituting said search terms exist, by consulting the entries extracted from said component character table.
-
-
45. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
-
for registration of text documents in said document database, means for registering texts of documents to be registered; means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create and register the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by-other character string is eliminated; and means for creating a concatenated component character table by preparing, for each of the documents, information of all usable character strings each composed of at least two characters, said information including first information indicating those character strings which are used in the document to be registered and second information indicating those character strings unused in the document to be registered and registering said concatenated component character table in said database; and for retrieval of the document containing the designated search term, component character table search means for extracting all the documents in which all the character strings contained in the search term designated by the operator and each composed of at least two characters are used, by consulting said concatenated component character table; condensed text search means for executing a condensed text search by consulting the condensed texts corresponding to the documents extracted through said component character table search for thereby extracting only the documents which contain the fragmental character strings constituting the search term designated by the operator; and text body search means for executing a text body search for extracting a document from the documents selected through said condensed text search which document satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said concatenated component character table search and said condensed text search.
-
-
46. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
-
for registration of text documents in said document database, means for registering texts of documents to be registered; means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create and register the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; means for creating a hash table by checking frequencies at which said fragmental character strings are used, determining a hash function on the basis of the frequency information and mapping said fragmental character strings to a bit list having entries in a number smaller than that of combinations of actually used character; and means for creating a concatenated component character table by preparing, for each of the documents, information of all usable character strings each composed of at least two characters by consulting said hash table, said information including first information indicating those character strings which are used in the document to be registered and second information indicating those character strings unused in the document to be registered and registering said concatenated component character table in said database; and for retrieval of the document containing the designated search term, component character table search means for extracting all the documents in which all the character strings contained in the search term designated by the operator and each composed of at ieast two characters are used, by consulting said concatenated component character table; condensed text search means for executing a condensed text search by consulting the condensed texts corresponding to the documents extracted through said component character table search for thereby extracting only the documents which. contain the fragmental character strings constituting the search term designated by the operator; and text body search means for executing a text body search for extracting a document from the documents selected through said condensed text search which document satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said concatenated component character table search and said condensed text search.
-
-
47. An index creating apparatus, comprising:
-
means for fetching data for retrieval; counting means for determining frequencies at which characters contained in said data for retrieval are used; sorting means for rearraying said characters in the order of frequencies at which said characters are used; means for establishing correspondences between said characters and a number of bits, respectively, said bit number being smaller than that of said characters, means for converting character codes of said characters to the corresponding bits; and means for manipulating said bits on a bit-by-bit basis.
-
-
48. A document retrieval apparatus, comprising:
-
input means for inputting a search term; means for extracting bit lists corresponding to character strings constituting said search term from a component character table; means for logically ANDing said bit lists; and means for transforming result of said ANDing operation into a document identifier affixed to a document.
-
-
49. A document data processing method for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to the textual content of said document, comprising steps of:
-
upon registration of text documents in said document database, creating a concatenated component character table in which character strings, each being constituted with n-characters (n<
2) and occurring in the text documents, are registered without duplication for each of the text documents, and registering in said document database said component character table in addition to the texts of the documents to be registered; andupon retrieval of a document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said concatenated component character table; and executing subsequently a text body search by consulting the texts of the documents extracted through said component character table search for thereby extracting only the document which contains the designated search term and which satisfies a query condition imposed on the search term. - View Dependent Claims (50)
-
Specification