Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

US 7,451,139 B2
Filed: 10/28/2002
Issued: 11/11/2008
Est. Priority Date: 03/07/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A clustering apparatus comprising:

a memory;

a Central Processing Unit;

a similarity calculation unit which respectively calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document;

a conversion unit which converts similarity calculated by the similarity calculation unit to an absolute value by normalization; and

a clustering unit which executes clustering of a plurality of documents, based on similarity of the absolute value;

wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among the similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation;

normalized similarity=α

×

(similarity of target document/similarity of document of first place)+β

×

(similarity of target document/mean value of the similarities), wherein α and

β

are coefficients, and wherein α

is 0 and β

is 1, and a number of higher ranking documents is 1, the above equation can be expressed as following equation;

Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An input section inputs a document set. A normalization section calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, in the document set. The normalization section employs the tf·idf method. In tf·idf method, a document vector and a significance of a word included in the document is used to perform normalization to convert each similarity to an absolute value.

55 Citations

View as Search Results

7 Claims

1. A clustering apparatus comprising:
- a memory;
  
  a Central Processing Unit;
  
  a similarity calculation unit which respectively calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document;
  
  a conversion unit which converts similarity calculated by the similarity calculation unit to an absolute value by normalization; and
  
  a clustering unit which executes clustering of a plurality of documents, based on similarity of the absolute value;
  
  wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among the similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation;
  
  normalized similarity=α
  
  ×
  
  (similarity of target document/similarity of document of first place)+β
  
  ×
  
  (similarity of target document/mean value of the similarities), wherein α and
  
  β
  
  are coefficients, and wherein α
  
  is 0 and β
  
  is 1, and a number of higher ranking documents is 1, the above equation can be expressed as following equation;
  
  Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), and wherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.
- View Dependent Claims (2)
- - 2. The clustering apparatus according to claim 1, further comprising an important sentence extraction unit which extracts an important sentence from each of the documents, and designates a document consisting of this important sentence as an object of the similarity calculation.

3. A clustering method comprising:
- calculating a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document;
  
  converting similarity calculated to an absolute value by normalization; and
  
  executing clustering of a plurality of documents, based on the similarity of the absolute value;
  
  wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among similarities not be converted and the similarity to be converted, said normalization being carded out in accordance with a following;
  
  normalized similarity=α
  
  ×
  
  (similarity of target document/similarity of document of first place)+β
  
  ×
  
  (similarity of target document/mean value of the similarities), wherein α and
  
  β
  
  are coefficients, and wherein α
  
  is 0 and β
  
  is 1, and number of higher ranking documents is 1, the above equation can be expressed as following equation;
  
  Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), andwherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.

4. A computer readable recording medium having a program stored therein for causing a computer to execute operations, comprising:
- calculating a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document;
  
  converting similarity calculated by the similarity calculated to an absolute value by normalization; and
  
  clustering a plurality of documents, based on the similarity of an absolute value;
  
  wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among the similarities not be converted and similarity to be converted, said normalization being carried out in accordance with a following equation;
  
  normalized similarity=α
  
  ×
  
  (similarity of target document/similarity of document of first place)+β
  
  ×
  
  (similarity of target document/mean value of the similarities),wherein α and
  
  β
  
  are coefficients, and wherein α
  
  is 0 and β
  
  is 1, and number of higher ranking documents is 1, the above equation can be expressed as following equation;
  
  Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), andwherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.

5. A document extraction apparatus comprising:
- a memory;
  
  a Central Processing Unit;
  
  a similarity calculation unit which respectively calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document;
  
  a conversion unit which converts a similarity calculated by the similarity calculation unit to an absolute value by normalization;
  
  a clustering unit which performs clustering of a plurality of documents, based on the similarity of the absolute value;
  
  a cluster sort unit which sorts results of the clustering, using number of documents constituting each cluster as a key;
  
  a representative document selection unit which selects a representative document from each cluster, with respect to the sorted results; and
  
  an output unit which outputs representative documents in order corresponding to the sorted results;
  
  wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation;
  
  normalized similarity=α
  
  ×
  
  (similarity of target document/similarity of document of first place)+β
  
  ×
  
  (similarity of target document/mean value of the similarities), wherein α and
  
  β
  
  are coefficients, and wherein α
  
  is 0 and β
  
  is 1, and number of higher ranking documents is 1, the above equation can be expressed as following equation;
  
  Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), andwherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.

6. A document extraction method comprising:
- calculating a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document;
  
  converting similarity calculated to an absolute value by normalization;
  
  clustering of a plurality of documents, based on the similarity of an absolute value;
  
  sorting results of clustering, using number of documents constituting each cluster as a key;
  
  selecting a representative document from each cluster, with respect to the sorted results; and
  
  outputting representative documents in order corresponding to the sorted results;
  
  wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation;
  
  normalized similarity=α
  
  ×
  
  (similarity of target document/similarity of document of first place)+β
  
  ×
  
  (similarity of target document/mean value of the similarities), wherein α and
  
  β
  
  are coefficients, and wherein α
  
  is 0 and β
  
  is 1, and number of higher ranking documents is 1, the above equation can be expressed as following equation;
  
  Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), andwherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.

7. A computer readable recording medium having a program stored therein for causing a computer to operations, comprising:
- calculating a similarity as a relative value between documents, with respect to combinations of a plurality of documents, using a document vector and a significance of a word included in a document;
  
  converting similarity calculated by the similarity calculated to an absolute value by normalization; and
  
  clustering of a plurality of documents, based on the similarity of the absolute value;
  
  sorting results of the clustering, using number of documents constituting each cluster as a key;
  
  selecting a representative document from each cluster, with respect to the sorted results; and
  
  outputting representative documents in order corresponding to the sorted results;
  
  wherein the absolute value is a sum of a ratio between a similarity having a highest value and a similarity to be converted and a ratio between a mean value of similarities and the similarity to be converted, or the absolute value is a ratio between the similarity having the highest value among similarities not be converted and the similarity to be converted, said normalization being carried out in accordance with a following equation;
  
  normalized similarity=α
  
  ×
  
  (similarity of target document/similarity of document of first place)+β
  
  ×
  
  (similarity of target document/mean value of the similarities),wherein α and
  
  β
  
  are coefficients, and wherein α
  
  is 0 and β
  
  is 1, and number of higher ranking documents is 1, the above equation can be expressed as a following equation;
  
  Normalized similarity=(similarity of target document/highest similarity in documents other than relevant document), andwherein a result of said normalization identifying at least one of said plurality of documents relative to the relevant document is output.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fujitsu Limited
Original Assignee
Fujitsu Limited
Inventors
Namba, Isao
Primary Examiner(s)
Truong; Cam Y
Assistant Examiner(s)
MYINT, DENNIS Y

Application Number

US10/281,318
Publication Number

US 20030172058A1
Time in Patent Office

2,206 Days
Field of Search

707/1, 707/500.1, 707/3, 707/5, 382/225
US Class Current

1/1
CPC Class Codes

G06F 16/3347   using vector based model

Y10S 707/99931   Database or file accessing

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99945   Object-oriented database st...

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

55 Citations

7 Claims

Specification

Use Cases

Quick Links

Others

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

55 Citations

7 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others