Document clustering method and apparatus based on common information of documents

US 7,499,923 B2
Filed: 03/04/2004
Issued: 03/03/2009
Est. Priority Date: 03/05/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented method of clustering documents, each clustering document having one or plural document segments in an input document, said method comprising steps:

(a) obtaining a co-occurrence matrix for the input document by using a computer, the co-occurrence matrix is a matrix reflecting occurrence frequencies of terms and co-occurrence frequencies of term pairs, and obtaining an input document frequency matrix for a set of input documents based on occurrence frequencies of terms or term pairs appearing in the set of input documents wherein said step (a) further includes;

generating an input document segment vector for each input document segment of said input document segments based on occurrence frequencies of terms appearing in said each input document segment;

obtaining the co-occurrence matrix for the input document from input document segment vectors; and

obtaining the input document frequency matrix from the co-occurrence matrix for each document;

(b) selecting a seed document from a set of remaining documents that are not included in any cluster existing, and constructing a current cluster of an initial state based on the seed document, wherein said selecting and said constructing comprises;

constructing a remaining document common co-occurrence matrix for the set of the remaining documents based on a product of corresponding components of co-occurrence matrices of all documents in the set of remaining documents; and

obtaining a document commonality of each remaining document to the set of the remaining documents based on a product sum between every component of the co-occurrence matrix of each remaining document and the corresponding component of the remaining document common co-occurrence matrix;

extracting a document having highest document commonality to the set of the remaining documents; and

constructing initial cluster by including the seed document and neighbor documents similar to the seed document;

(c) making documents, which have document commonality to a current cluster higher than a threshold, belong temporarily to the current cluster;

wherein said making comprising;

constructing a current cluster common co-occurrence matrix for the current cluster and a current cluster document frequency matrix of the current cluster based on occurrence frequencies of terms or term pairs appearing in the documents of the current cluster;

obtaining a distinctiveness value of each term and each term pair for the current cluster by comparing the input document frequency matrix with the current cluster document frequency matrix;

obtaining weights of each term and each term pair from the distinctiveness values;

obtaining a document commonality to the current cluster for each document in a input document set based on a product sum between every component of the co-occurrence matrix of the input document and the corresponding component of the current cluster common co-occurrence matrix while applying the weights to said components; and

making the documents having document commonality to the current cluster higher than the threshold belong temporarily to the current cluster;

(d) repeating step (c) until number of documents temporarily belonging to the current cluster does not increase;

(e) repeating steps (b) through (d) until a given convergence condition is satisfied; and

(f) deciding, on a basis of the document commonality of each document to each cluster, a cluster to which each document belongs and outputting said cluster.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In document (or pattern) clustering, the correct number of clusters and accurate assignment of each document (or pattern) to the correct cluster are attained. Documents (or patterns) describing the same topic (or object) are grouped, so a document (or pattern) group belonging to the same cluster has some commonality. Each topic (or object) has distinctive terms (or object features) or term (or object feature) pairs. When the closeness of each document (or pattern) to a given cluster is obtained, common information about the given cluster is extracted and used while the influence of terms (or object features) or term (or object feature) pairs not distinctive to the given cluster is excluded.

Citations

9 Claims

1. A computer implemented method of clustering documents, each clustering document having one or plural document segments in an input document, said method comprising steps:
- (a) obtaining a co-occurrence matrix for the input document by using a computer, the co-occurrence matrix is a matrix reflecting occurrence frequencies of terms and co-occurrence frequencies of term pairs, and obtaining an input document frequency matrix for a set of input documents based on occurrence frequencies of terms or term pairs appearing in the set of input documents wherein said step (a) further includes;
  
  generating an input document segment vector for each input document segment of said input document segments based on occurrence frequencies of terms appearing in said each input document segment;
  
  obtaining the co-occurrence matrix for the input document from input document segment vectors; and
  
  obtaining the input document frequency matrix from the co-occurrence matrix for each document;
  
  (b) selecting a seed document from a set of remaining documents that are not included in any cluster existing, and constructing a current cluster of an initial state based on the seed document, wherein said selecting and said constructing comprises;
  
  constructing a remaining document common co-occurrence matrix for the set of the remaining documents based on a product of corresponding components of co-occurrence matrices of all documents in the set of remaining documents; and
  
  obtaining a document commonality of each remaining document to the set of the remaining documents based on a product sum between every component of the co-occurrence matrix of each remaining document and the corresponding component of the remaining document common co-occurrence matrix;
  
  extracting a document having highest document commonality to the set of the remaining documents; and
  
  constructing initial cluster by including the seed document and neighbor documents similar to the seed document;
  
  (c) making documents, which have document commonality to a current cluster higher than a threshold, belong temporarily to the current cluster;
  
  wherein said making comprising;
  
  constructing a current cluster common co-occurrence matrix for the current cluster and a current cluster document frequency matrix of the current cluster based on occurrence frequencies of terms or term pairs appearing in the documents of the current cluster;
  
  obtaining a distinctiveness value of each term and each term pair for the current cluster by comparing the input document frequency matrix with the current cluster document frequency matrix;
  
  obtaining weights of each term and each term pair from the distinctiveness values;
  
  obtaining a document commonality to the current cluster for each document in a input document set based on a product sum between every component of the co-occurrence matrix of the input document and the corresponding component of the current cluster common co-occurrence matrix while applying the weights to said components; and
  
  making the documents having document commonality to the current cluster higher than the threshold belong temporarily to the current cluster;
  
  (d) repeating step (c) until number of documents temporarily belonging to the current cluster does not increase;
  
  (e) repeating steps (b) through (d) until a given convergence condition is satisfied; and
  
  (f) deciding, on a basis of the document commonality of each document to each cluster, a cluster to which each document belongs and outputting said cluster.
- View Dependent Claims (2, 3, 4)
- - 2. The method according to claim 1, wherein the remaining document common co-occurrence matrix or the current cluster common co-occurrence matrix reflects co-occurrence frequencies at which pairs of different terms co-occur in each document of the remaining documents or the current cluster.
  - 3. The clustering method according to claim 1, wherein the convergence condition in said step (e) is satisfied when(i) the number of documents whose document commonalities to any current clusters are less than a threshold becomes 0, or(ii) the number is less than a threshold and does not increase.
  - 4. The clustering method according to claim 1, wherein said step (f) further includes:
    - checking existence of a redundant cluster, and removing, when the redundant cluster exists, the redundant cluster and again deciding the cluster to which each document belongs.

5. A computer implemented method of clustering documents each having one or plural document segments in an input document, said method comprising steps:
- (a) using a computer to obtain a co-occurrence matrix for the input document, obtaining a co-occurrence matrix S^rfor a input document D_rbased on occurrence frequencies of terms or term pairs appearing in the set of input documents;
  
  wherein in step (a), each mn component S^r_mnof the co-occurrence matrix S^rof the document D_ris determined in accordance with;
  
  $S_{mn}^{r} = \sum_{y = 1}^{Y_{r}} d_{rym} d_{ryn}$ where;
  
  m and n denote m^thand n^thterms, respectively, among M terms appearing in the set of input documents, D_ris r^thdocument in a document set D consisting of R documents;
  
  Y_ris number of document segments in the document D_r, wherein said d_rymand d_ryndenote existence or absence of the m^thand n^thterms, respectively, in y^thdocument segment of the document D_r, and S^r_mmrepresents number of document segments in which the m^thterm occurs and S^r_mmrepresents co-occurrence counts of document segments in which the m^thand n^thterms co-occur;
  
  (b) selecting a seed document from a set of remaining documents that are not included in any cluster existing, and constructing a current cluster of an initial state based on the seed document, wherein said selecting and said constructing comprise;
  
  constructing a remaining document common co-occurrence matrix T^Afor a set of the remaining documents based on co-occurrence matrices of all documents in the set of remaining documents;
  
  obtaining a document commonality of each remaining document to the set of the remaining documents based on the co-occurrence matrix S^rof each remaining document and the remaining document common co-occurrence matrix T^A;
  
  extracting the document having a highest document commonality to the set of the remaining documents; and
  
  constructing a initial cluster by including the seed document and neighbor documents similar to the seed document;
  
  (c) making documents having document commonality higher than a threshold belong temporarily to the current cluster;
  
  (d) repeating step (c) until a number of documents temporarily belonging to the current cluster does not increase;
  
  (e) repeating steps (b) through (d) until a given convergence condition is satisfied; and
  
  (f) deciding, on basis of the document commonality of each document to each cluster, a cluster to which each document belongs and outputting said cluster.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The method according to claim 5, wherein in step (b), the remaining document common co-occurrence matrix T^Ais determined on the basis of a matrix T;
    - wherein the matrix T has an mn component determined by
7. The method according to claim 6, further comprising:
- determining a modified common co-occurrence matrix Q^Aon the basis of T^A; and
  
  in step (b), obtaining the document commonality of each remaining document to the set of the remaining documents based on the co-occurrence matrix Sr of each remaining document and the modified common co-occurrence matrix Q^A;
  
  the matrix Q^Ahaving an mn component determined by
  Q^A_mn=log T^A_mnwhen T^A_mn>
  
  1,
  Q^A_mn=0 otherwise.
8. The method according to claim 7, wherein in step (b), the document commonality of each remaining document P having a co-occurrence matrix S^Pwith respect to the set of remaining documents is given by ${com}_{q}$
- ( D ′
  
  , P ;
  
  Q A ) = ∑
  
  m = 1 M ⁢
  
  ∑
  
  n = 1 M ⁢
  
  Q mn A ⁢
  
  S mn P ∑
  
  m = 1 M ⁢
  
  ∑
  
  n = 1 M ⁢
  
  ( Q mn A ) 2 ⁢
  
  ∑
  
  m = 1 M ⁢
  
  ∑
  
  n = 1 M ⁢
  
  ( S mn P ) 2 .
9. The method according to claim 7, wherein in step (b), the document commonality of each remaining document P having a co-occurrence matrix S^Pwith respect to the set of remaining documents is given by ${com}_{q}$
- ( D ′
  
  , P ;
  
  Q A ) = ∑
  
  m = 1 M ⁢
  
  ∑
  
  n = 1 M ⁢
  
  T mn A ⁢
  
  S mn P ∑
  
  m = 1 M ⁢
  
  ∑
  
  n = 1 M ⁢
  
  ( T mn A ) 2 ⁢
  
  ∑
  
  m = 1 M ⁢
  
  ∑
  
  n = 1 M ⁢
  
  ( S mn P ) 2 .

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Kawatani, Takahiko
Primary Examiner(s)
Wassum; Luke S.
Assistant Examiner(s)
Pham; Michael

Application Number

US10/791,897
Publication Number

US 20040230577A1
Time in Patent Office

1,825 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 18/23   Clustering techniques

Y10S 707/99937   Sorting

Document clustering method and apparatus based on common information of documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Document clustering method and apparatus based on common information of documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links