Apparatus for retrieving similar documents and apparatus for extracting relevant keywords
First Claim
1. A similar document retrieving apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating a retrieval condition consisting of a document group including at least one document x1, - - - , xr selected from said document database D and for retrieving documents similar to said document group of said retrieval condition from said document database D, said similar document retrieving apparatus comprising:
- keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence fdt of each keyword t appearing in each document d stored in said document database D;
document length calculating means for calculating a document length data L which represents a length ld of said each document d;
keyword weight calculating means for calculating a keyword weight data W which represents a weight wt of each keyword t of said M kinds of keywords appearing in said document database D;
document profile vector producing means for producing a M-dimensional document profile vector Pd having components respectively representing a relative frequency-of-occurrence pdt of each keyword t in the concerned document d;
document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in said document database D and for obtaining a predefined (K)-dimensional document feature vector Ud corresponding to said document profile vector Pd for said each document d; and
similar document retrieving means for receiving said retrieval condition consisting of the document group including at least one document x1, - - - , xr selected from said document database D, calculating a similarity between each document d and said retrieval condition based on a document feature vector of said received document group and the document feature vector of each document d in said document database D, and outputting a designated number of similar documents in order of the calculated similarity.
6 Assignments
0 Petitions
Accused Products
Abstract
Three kinds of data, i.e., a keyword frequency-of-appearance, a document length, and a keyword weight, are produced. Then, a document profile vector and a keyword profile vector are calculated. Then, by independently performing the weighted principal component analysis considering the document length and the keyword weight, a document feature vector and a keyword feature vectors are obtained. Then, documents and keywords having higher similarity to the feature vectors calculated with reference to the retrieval and extracting conditions are obtained and displayed.
-
Citations
52 Claims
-
1. A similar document retrieving apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating a retrieval condition consisting of a document group including at least one document x1, - - - , xr selected from said document database D and for retrieving documents similar to said document group of said retrieval condition from said document database D, said similar document retrieving apparatus comprising:
-
keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence fdt of each keyword t appearing in each document d stored in said document database D;
document length calculating means for calculating a document length data L which represents a length ld of said each document d;
keyword weight calculating means for calculating a keyword weight data W which represents a weight wt of each keyword t of said M kinds of keywords appearing in said document database D;
document profile vector producing means for producing a M-dimensional document profile vector Pd having components respectively representing a relative frequency-of-occurrence pdt of each keyword t in the concerned document d;
document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in said document database D and for obtaining a predefined (K)-dimensional document feature vector Ud corresponding to said document profile vector Pd for said each document d; and
similar document retrieving means for receiving said retrieval condition consisting of the document group including at least one document x1, - - - , xr selected from said document database D, calculating a similarity between each document d and said retrieval condition based on a document feature vector of said received document group and the document feature vector of each document d in said document database D, and outputting a designated number of similar documents in order of the calculated similarity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
10. The similar document retrieving apparatus in accordance with claim 1, wherein said keyword weight calculating means calculates the weight wt of the concerned keyword t according to the following formula
-
11. The similar document retrieving apparatus in accordance with claim 1, wherein said document profile vector producing means calculates the relevant frequency-of-occurrence pdt of each keyword t in the concerned document d by dividing the frequency-of-occurrence fdt of each keyword t in the concerned document d by a sum Σ
- fdj of the frequency-of-occurrence value of all keywords j appearing in the concerned document d.
-
-
12. A similar document retrieving apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating a retrieval condition consisting of a keyword group including at least one keyword y1, - - - , ys selected from said document database D and for retrieving documents relevant to said retrieval condition from said document database D, said similar document retrieving apparatus comprising:
-
keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence fdt of each keyword t appearing in each document d stored in said document database D;
document length calculating means for calculating a document length data L which represents a length ld of said each document d;
keyword weight calculating means for calculating a keyword weight data W which represents a weight wt of each keyword t of said M kinds of keywords appearing in said document database D;
document profile vector producing means for producing a M-dimensional document profile vector Pd having components respectively representing a relative frequency-of-occurrence pdt of each keyword t in the concerned document d;
keyword profile vector producing means for producing a N-dimensional keyword profile vector Qt having components respectively representing a relative frequency-of-occurrence qdt of the concerned keyword t in each document d;
document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in said document database D and for obtaining a predefined (K)-dimensional document feature vector Ud corresponding to said document profile vector Pd for said each document d;
keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in said document database D and for obtaining a predefined (K)-dimensional keyword feature vector Vt corresponding to said keyword profile vector Qt for said each keyword t, said keyword feature vector having the same dimension as that of said document feature vector, as well as for obtaining a keyword contribution factor (i.e., eigenvalue of a correlation matrix) θ
j of each dimension j;
retrieval condition feature vector calculating means for receiving said retrieval condition consisting of keyword group including at least one keyword y1, - - - , ys, and for calculating a retrieval condition feature vector corresponding to said retrieval condition based on said keyword weight data of the received keyword group, said keyword feature vector and said keyword contribution factor; and
similar document retrieving means for calculating a similarity between each document d and said retrieval condition based on the calculated retrieval condition feature vector and a document feature vector of said each document d, and outputting a designated number of similar documents in order of the calculated similarity. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
24. The similar document retrieving apparatus in accordance with claim 12, wherein said keyword weight calculating means calculates the weight wt of the concerned keyword t according to the following formula
-
25. The similar document retrieving apparatus in accordance with claim 12, wherein said document profile vector producing means calculates the relevant frequency-of-occurrence pdt of each keyword t in the concerned document d by dividing the frequency-of-occurrence fdt of each keyword t in the concerned document d by a sum Σ
- fdj of the frequency-of-occurrence value of all keywords j appearing in the concerned document d.
-
26. The similar document retrieving apparatus in accordance with claim 12, wherein said keyword profile vector producing means calculates the relevant frequency-of-occurrence qdt of the concerned keyword t in each document d by dividing the frequency-of-occurrence fdt of the concerned keyword t in said each document d by a sum Σ
- fit of the frequency-of-occurrence value of the concerned keywords t in all documents j containing the concerned keyword t.
-
-
27. A relevant keyword extracting apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating an extracting condition consisting of a keyword group including at least one keyword y1, - - - , ys selected from said document database D and for extracting keywords relevant to said keyword group of said extracting condition from said document database D, said relevant keyword extracting apparatus comprising:
-
keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence fdt of each keyword t appearing in each document d stored in said document database D;
document length calculating means for calculating a document length data L which represents a length ld of said each document d;
keyword weight calculating means for calculating a keyword weight data W which represents a weight wt of each keyword t of said M kinds of keywords appearing in said document database D;
keyword profile vector producing means for producing a N-dimensional keyword profile vector Qt having components respectively representing a relative frequency-of-occurrence qdt of the concerned keyword t in each document d;
keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in said document database D and for obtaining a predefined (K)-dimensional keyword feature vector Vt corresponding to said keyword profile vector Qt for said each keyword t; and
relevant keyword extracting means for receiving said extracting condition consisting of the keyword group including at least one keyword y1, - - - , ys selected from said document database D, calculating a relevancy between each keyword t and said extracting condition based on a keyword feature vector of said received keyword group and the keyword feature vector of each keyword t in said document database D, and outputting a designated number of relevant keywords in order of the calculated relevancy. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
-
36. The relevant keyword extracting apparatus in accordance with claim 27, wherein said keyword weight calculating means calculates the weight wt of the concerned keyword t according to the following formula
-
37. The relevant keyword extracting apparatus in accordance with claim 27, wherein said keyword profile vector producing means calculates the relevant frequency-of-occurrence qdt of the concerned keyword t in each document d by dividing the frequency-of-occurrence fdt of the concerned keyword t in said each document d by a sum Σ
- fit of the frequency-of-occurrence value of the concerned keywords t in all documents j containing the concerned keyword t.
-
-
38. A relevant keyword extracting apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating an extracting condition consisting of a document group including at least one document x1, - - - , xr selected from said document database D and for extracting keywords relevant to the document group of said extracting condition from said document database D, said relevant keyword extracting apparatus comprising:
-
keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence fdt of each keyword t appearing in each document d stored in said document database D;
document length calculating means for calculating a document length data L which represents a length ld of said each document d;
keyword weight calculating means for calculating a keyword weight data W which represents a weight wt of each keyword t of said M kinds of keywords appearing in said document database D;
document profile vector producing means for producing a M-dimensional document profile vector Pd having components respectively representing a relative frequency-of-occurrence pdt of each keyword t in the concerned document d;
keyword profile vector producing means for producing a N-dimensional keyword profile vector Qt having components respectively representing a relative frequency-of-occurrence qdt of the concerned keyword t in each document d;
document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in said document database D and for obtaining a predefined (K)-dimensional document feature vector Ud corresponding to said document profile vector Pd for said each document d as well as for obtaining a document contribution factor (i.e., eigenvalue of a correlation matrix) λ
j of each dimension j;
keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in said document database D and for obtaining a predefined (K)-dimensional keyword feature vector Vt corresponding to said keyword profile vector Qt for said each keyword t, said keyword feature vector having the same dimension as that of said document feature vector;
extracting condition feature vector calculating means for receiving said extracting condition consisting of the document group including at least one document x1, - - - , Xr, and for calculating an extracting condition feature vector corresponding to said extracting condition based on said document length data of the received document group, said document feature vector and said document contribution factor; and
relevant keyword extracting means for calculating a relevancy between each keyword t and said extracting condition based on the calculated extracting condition feature vector and a keyword feature vector of each keyword t, and outputting a designated number of relevant keywords in order of the calculated relevancy. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52)
-
50. The relevant keyword extracting apparatus in accordance with claim 38, wherein said keyword weight calculating means calculates the weight wt of the concerned keyword t according to the following formula
-
51. The relevant keyword extracting apparatus in accordance with claim 38, wherein said document profile vector producing means calculates the relevant frequency-of-occurrence pdt of each keyword t in the concerned document d by dividing the frequency-of-occurrence fdt of each keyword t in the concerned document d by a sum Σ
- fdj of the frequency-of-occurrence value of all keywords j appearing in the concerned document d.
-
52. The relevant keyword extracting apparatus in accordance with claim 38, wherein said keyword profile vector producing means calculates the relevant frequency-of-occurrence qdt of the concerned keyword t in each document d by dividing the frequency-of-occurrence fdt of the concerned keyword t in said each document d by a sum Σ
- fit of the frequency-of-occurrence value of the concerned keywords t in all documents j containing the concerned keyword t.
-
Specification