Document categorizing method, document categorizing apparatus, and storage medium on which a document categorization program is stored
First Claim
1. A document categorizing method for categorizing a plurality of documents in an electronic system according to semantic similarity, said method comprising:
- obtaining a plurality of clusters of documents, each cluster having a distinctive name;
evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters;
merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and
assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters;
wherein;
if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and
if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and
wherein;
said first naming convention includes a concatenation of at least a name segment of each of said constituent evaluated clusters with a first delimiter inserted between the concatenated name segments; and
said second naming convention includes a concatenation of at least a name segment of each of said constituent evaluated clusters with a second delimiter, different from said first delimiter, inserted between the concatenated name segments.
1 Assignment
0 Petitions
Accused Products
Abstract
A document categorizing apparatus includes a sentence analyzer 12 for analyzing a plurality of documents to detect titles thereof; a feature element extractor 13 for extracting feature elements from the titles detected by the sentence analyzer 12 from the respective documents; feature table generating means 14 for generating a feature table representing the relationships between the feature elements extracted from the title and the documents including the feature elements; a document categorizing unit 15 for categorizing the documents into a plurality of clusters according to semantic similarity on the basis of the content of the feature table; a categorization result storage unit 16 for storing the clusters created by the document categorization unit 15; a cluster merging unit 2 for performing a cluster merging process upon the clusters stored in the categorization result storage unit 6; and an output control unit 31 for outputting the result of the cluster merging process to a display unit 32.
49 Citations
21 Claims
-
1. A document categorizing method for categorizing a plurality of documents in an electronic system according to semantic similarity, said method comprising:
-
obtaining a plurality of clusters of documents, each cluster having a distinctive name; evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters; merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein; said first naming convention includes a concatenation of at least a name segment of each of said constituent evaluated clusters with a first delimiter inserted between the concatenated name segments; and said second naming convention includes a concatenation of at least a name segment of each of said constituent evaluated clusters with a second delimiter, different from said first delimiter, inserted between the concatenated name segments. - View Dependent Claims (2, 3)
-
-
4. A document categorizing method for categorizing a plurality of documents in an electronic system according to semantic similarity, said method comprising:
-
obtaining a plurality of clusters of documents, each cluster having a distinctive name; evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters; merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein said new combined cluster constitutes a cluster combination, said method further comprising; determining a degree of relation between a previously uncombined cluster within said plurality of said clusters with said cluster combination by evaluating their similarity based on the documents included in said uncombined cluster and said cluster combination; merging the evaluated uncombined cluster and the evaluated cluster combination into a newer combined cluster when their degree of relation is determined to be not less than said predetermined first value; assigning a newer name to said newer combined cluster based on the degree of relation between its constituent evaluated previously uncombined cluster and evaluated cluster combination, wherein if their degree of relation is less then said second predetermined value, the newer name assigned to said newer combined cluster conforms to a third naming convention, and wherein if their degree of relation is not less then said second predetermined value, the newer name assigned to said newer combined cluster conforms to a fourth naming convention. - View Dependent Claims (5)
-
-
6. A document categorizing method for categorizing a plurality of documents in an electronic system according to semantic similarity, said method comprising:
-
obtaining a plurality of clusters of documents, each cluster having a distinctive name; evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters; merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein said new combined cluster constitutes a cluster combination, said method further comprising; obtaining a plurality of said cluster combinations, each cluster combination having a distinctive name; determining a degree of relation between at least two cluster combinations by evaluating the similarity between the evaluated cluster combinations based on the documents included in the respective evaluated cluster combinations; merging the evaluated cluster combinations into a new combined cluster combination when their degree of relation is determined to be not less than said predetermined first value; assigning a new name to said new combined cluster combination based on the degree of relation between its constituent cluster combinations, wherein if the degree of relation of its constituent cluster combinations is less then said second predetermined value, the new name assigned to said new cluster combination conforms to a fifth naming convention indicative of a degree of relation between said first and second predetermined values, and wherein if the degree of relation of its constituent cluster combinations is not less then said second predetermined value, the new name assigned to said new combined cluster combination conforms to a sixth naming convention indicative of a degree of relation not less than said second predetermined value. - View Dependent Claims (7)
-
-
8. A machine readable memory medium having machine executable instructions for categorizing a plurality of documents in an electronic system according to semantic similarity, said machine readable memory medium comprising:
-
obtaining a plurality of clusters of documents, each cluster having a distinctive name; evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein; said first naming convention includes a concatenation of at least a name segment of each of said constituent evaluated clusters with a first delimiter inserted between the concatenated name segments; and said second naming convention includes a concatenation of at least a name segment of each of said constituent evaluated clusters with a second delimiter, different from said first delimiter, inserted between the concatenated name segments. - View Dependent Claims (9, 10)
-
-
11. A machine readable memory medium having machine executable instructions for categorizing a plurality of documents in an electronic system according to semantic similarity, said machine readable memory medium comprising:
-
obtaining a plurality of clusters of documents, each cluster having a distinctive name; evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters; merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein said new combined cluster constitutes a cluster combination, said machine readable memory medium further comprising; determining a degree of relation between a previously uncombined cluster within said plurality of said clusters with said cluster combination by evaluating their similarity based on the documents included in said uncombined cluster and said cluster combination; merging the evaluated uncombined cluster and the evaluated cluster combination into a newer combined cluster when their degree of relation is determined to be not less than said predetermined first value; assigning a newer name to said newer combined cluster based on the degree of relation between its constituent evaluated previously uncombined cluster and evaluated cluster combination, wherein if their degree of relation is less then said second predetermined value, the newer name assigned to said newer combined cluster conforms to a third naming convention, and wherein if their degree of relation is not less then said second predetermined value, the newer name assigned to said newer combined cluster conforms to a fourth naming convention. - View Dependent Claims (12)
-
-
13. A machine readable memory medium having machine executable instructions for categorizing a plurality of documents in an electronic system according to semantic similarity, said machine readable memory medium comprising:
-
obtaining a plurality of clusters of documents, each cluster having a distinctive name; evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters; merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein said new combined cluster constitutes a cluster combination, said memory medium further comprising; obtaining a plurality of said cluster combinations, each cluster combination having a distinctive name; determining a degree of relation between at least two cluster combinations by evaluating the similarity between the evaluated cluster combinations based on the documents included in the respective evaluated cluster combinations; merging the evaluated cluster combinations into a new combined cluster combination when their degree of relation is determined to be not less than said predetermined first value; assigning a new name to said new combined cluster combination based on the degree of relation between its constituent cluster combinations, wherein if the degree of relation of its constituent cluster combinations is less then said second predetermined value, the new name assigned to said new cluster combination conforms to a fifth naming convention indicative of a degree of relation between said first and second predetermined values, and wherein if the degree of relation of its constituent cluster combinations is not less then said second predetermined value, the new name assigned to said new combined cluster combination conforms to a sixth naming convention indicative of a degree of relation not less than said second predetermined value. - View Dependent Claims (14)
-
-
15. A document categorizing apparatus for categorizing a plurality of documents in an electronic system according to semantic similarity, said apparatus comprising:
-
means for obtaining a plurality of clusters of documents, each cluster having a distinctive name; means for evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters; means for merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and means assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein; said first naming convention includes a concatenation of at least a name segment of each of said constituent evaluated clusters with a first delimiter inserted between the concatenated name segments; and said second naming convention includes a concatenation of at least a name segment of each of said constituent evaluated clusters with a second delimiter, different from said first delimiter, inserted between the concatenated name segments. - View Dependent Claims (16, 17)
-
-
18. A document categorizing apparatus for categorizing a plurality of documents in an electronic system according to semantic similarity, said apparatus comprising:
-
means for obtaining a plurality of clusters of documents, each cluster having a distinctive name; means for evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters; means for merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and means for assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein said new combined cluster constitutes a cluster combination, said apparatus further comprising; means for determining a degree of relation between a previously uncombined cluster within said plurality of said clusters with said cluster combination by evaluating their similarity based on the documents included in said uncombined cluster and said cluster combination; means for merging the evaluated uncombined cluster and the evaluated cluster combination into a newer combined cluster when their degree of relation is determined to be not less than said predetermined first value; means for assigning a newer name to said newer combined cluster based on the degree of relation between its constituent evaluated previously uncombined cluster and evaluated cluster combination, wherein if their degree of relation is less then said second predetermined value, the newer name assigned to said newer combined cluster conforms to a third naming convention, and wherein if their degree of relation is not less then said second predetermined value, the newer name assigned to said newer combined cluster conforms to a fourth naming convention. - View Dependent Claims (19)
-
-
20. A document categorizing apparatus for categorizing a plurality of documents in an electronic system according to semantic similarity, said apparatus comprising:
-
means for obtaining a plurality of clusters of documents, each cluster having a distinctive name; means for evaluating a degree of relation between at least two clusters by evaluating the similarity between the evaluated clusters based on the documents included in the respective evaluated clusters; means for merging the evaluated clusters into a new combined cluster when their degree of relation is determined to be not less than a predetermined first value; and means for assigning a new name to said new combined cluster based on the degree of relation between its constituent evaluated clusters; wherein; if the degree of relation of said constituent evaluated clusters is less then a second predetermined value, which is greater than said first predetermined value, the new name assigned to said new combined cluster conforms to a first naming convention indicative of a degree of relation between said first and second predetermined values; and if the degree of relation of said constituent evaluated clusters is not less then said second predetermined value, the new name assigned to said new combined cluster conforms to a second naming convention indicative of a degree of relation not less than said second predetermined value; and wherein said new combined cluster constitutes a cluster combination, said apparatus further comprising; means for obtaining a plurality of said cluster combinations, each cluster combination having a distinctive name; means for determining a degree of relation between at least two cluster combinations by evaluating the similarity between the evaluated cluster combinations based on the documents included in the respective evaluated cluster combinations; means for merging the evaluated cluster combinations into a new combined cluster combination when their degree of relation is determined to be not less than said predetermined first value; means for assigning a new name to said new combined cluster combination based on the degree of relation between its constituent cluster combinations, wherein if the degree of relation of its constituent cluster combinations is less then said second predetermined value, the new name assigned to said new cluster combination conforms to a fifth naming convention indicative of a degree of relation between said first and second predetermined values, and wherein if the degree of relation of its constituent cluster combinations is not less then said second predetermined value, the new name assigned to said new combined cluster combination conforms to a sixth naming convention indicative of a degree of relation not less than said second predetermined value. - View Dependent Claims (21)
-
Specification