Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus
First Claim
Patent Images
1. A clustering apparatus comprising:
- a memory;
a Central Processing Unit;
a similarity calculation unit which respectively calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document;
a conversion unit which converts similarity calculated by the similarity calculation unit to an absolute value by normalization; and
a clustering unit which executes clustering of a plurality of documents, based on similarity of the absolute value;
wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among the similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation;
normalized similarity=α
×
(similarity of target document/similarity of document of first place)+β
×
(similarity of target document/mean value of the similarities), wherein α and
β
are coefficients, and wherein α
is 0 and β
is 1, and a number of higher ranking documents is 1, the above equation can be expressed as following equation;
Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.
1 Assignment
0 Petitions
Accused Products
Abstract
An input section inputs a document set. A normalization section calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, in the document set. The normalization section employs the tf·idf method. In tf·idf method, a document vector and a significance of a word included in the document is used to perform normalization to convert each similarity to an absolute value.
55 Citations
7 Claims
-
1. A clustering apparatus comprising:
-
a memory;
a Central Processing Unit;a similarity calculation unit which respectively calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document; a conversion unit which converts similarity calculated by the similarity calculation unit to an absolute value by normalization; and a clustering unit which executes clustering of a plurality of documents, based on similarity of the absolute value; wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among the similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation; normalized similarity=α
×
(similarity of target document/similarity of document of first place)+β
×
(similarity of target document/mean value of the similarities), wherein α and
β
are coefficients, and wherein α
is 0 and β
is 1, and a number of higher ranking documents is 1, the above equation can be expressed as following equation;Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output. - View Dependent Claims (2)
-
-
3. A clustering method comprising:
-
calculating a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document; converting similarity calculated to an absolute value by normalization; and executing clustering of a plurality of documents, based on the similarity of the absolute value; wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among similarities not be converted and the similarity to be converted, said normalization being carded out in accordance with a following;
normalized similarity=α
×
(similarity of target document/similarity of document of first place)+β
×
(similarity of target document/mean value of the similarities), wherein α and
β
are coefficients, and wherein α
is 0 and β
is 1, and number of higher ranking documents is 1, the above equation can be expressed as following equation;Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.
-
-
4. A computer readable recording medium having a program stored therein for causing a computer to execute operations, comprising:
-
calculating a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document; converting similarity calculated by the similarity calculated to an absolute value by normalization; and clustering a plurality of documents, based on the similarity of an absolute value; wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among the similarities not be converted and similarity to be converted, said normalization being carried out in accordance with a following equation;
normalized similarity=α
×
(similarity of target document/similarity of document of first place)+β
×
(similarity of target document/mean value of the similarities),wherein α and
β
are coefficients, and wherein α
is 0 and β
is 1, and number of higher ranking documents is 1, the above equation can be expressed as following equation;Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.
-
-
5. A document extraction apparatus comprising:
-
a memory;
a Central Processing Unit;a similarity calculation unit which respectively calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document; a conversion unit which converts a similarity calculated by the similarity calculation unit to an absolute value by normalization; a clustering unit which performs clustering of a plurality of documents, based on the similarity of the absolute value; a cluster sort unit which sorts results of the clustering, using number of documents constituting each cluster as a key; a representative document selection unit which selects a representative document from each cluster, with respect to the sorted results; and an output unit which outputs representative documents in order corresponding to the sorted results; wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation; normalized similarity=α
×
(similarity of target document/similarity of document of first place)+β
×
(similarity of target document/mean value of the similarities), wherein α and
β
are coefficients, and wherein α
is 0 and β
is 1, and number of higher ranking documents is 1, the above equation can be expressed as following equation;Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.
-
-
6. A document extraction method comprising:
-
calculating a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document; converting similarity calculated to an absolute value by normalization; clustering of a plurality of documents, based on the similarity of an absolute value; sorting results of clustering, using number of documents constituting each cluster as a key; selecting a representative document from each cluster, with respect to the sorted results; and outputting representative documents in order corresponding to the sorted results; wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation; normalized similarity=α
×
(similarity of target document/similarity of document of first place)+β
×
(similarity of target document/mean value of the similarities), wherein α and
β
are coefficients, and wherein α
is 0 and β
is 1, and number of higher ranking documents is 1, the above equation can be expressed as following equation;Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.
-
-
7. A computer readable recording medium having a program stored therein for causing a computer to operations, comprising:
-
calculating a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document; converting similarity calculated by the similarity calculated to an absolute value by normalization; and clustering of a plurality of documents, based on the similarity of the absolute value; sorting results of the clustering, using number of documents constituting each cluster as a key; selecting a representative document from each cluster, with respect to the sorted results; and outputting representative documents in order corresponding to the sorted results; wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation; normalized similarity=α
×
(similarity of target document/similarity of document of first place)+β
×
(similarity of target document/mean value of the similarities),wherein α and
β
are coefficients, and wherein α
is 0 and β
is 1, and number of higher ranking documents is 1, the above equation can be expressed as a following equation;Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.
-
Specification