Method for adapting a K-means text clustering to emerging data

US 7,779,349 B2
Filed: 04/07/2008
Issued: 08/17/2010
Est. Priority Date: 09/26/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A system for clustering documents in datasets comprising:

a storage device storing a first dataset and a second dataset;

a cluster generator operative to cluster first documents in said first dataset and produce first document classes;

a centroid seed generator operative to generate centroid seeds based on said first document classes;

a dictionary generator adapted to generate a first dictionary of most common words in said first dataset; and

a vector space model generator adapted to generate a first vector space model by counting, for each word in said first dictionary, a number of said first documents in which said word occurs, wherein said cluster generator clusters said documents in said first dataset based on said first vector space model, wherein said cluster generator clusters second documents in said second dataset using said centroid seeds, such that said second dataset has a similar, based on said centroid seeds, clustering to that of said first dataset, wherein said second dataset comprises a new, but related, dataset different than said first dataset, wherein said vector space model generator generates a second vector space model by counting, for each word in said first dictionary, a number of said second documents in which said word occurs;

a classifier adapted to classify said second documents in said second vector space model using said first document classes to produce a classified second vector space model and adapted to determine a mean of vectors in each class in said classified second vector space model, wherein said mean comprises said centroid seeds.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and structure for clustering documents in datasets which include clustering first documents and a first dataset to produce first document classes, creating centroid seeds based on the first document classes, and clustering second documents in a second dataset using the centroid seeds, wherein the first dataset and the second dataset are related. The clustering of the first documents in the first dataset forms a first dictionary of most common words in the first dataset and generates a first vector space model by counting, for each word in the first dictionary, a number of the first documents in which the word occurs, and clusters the first documents in the first dataset based on the first vector space model, and further generates a second vector space model by counting, for each word in the first dictionary, a number of the second documents in which the word occurs. Creation of the centroid seeds includes classifying second vector space model using the first document classes to produce a classified second vector space model and determining a mean of vectors in each class in the classified second vector space model, the mean includes the centroid seeds.

21 Citations

View as Search Results

9 Claims

1. A system for clustering documents in datasets comprising:
- a storage device storing a first dataset and a second dataset;
  
  a cluster generator operative to cluster first documents in said first dataset and produce first document classes;
  
  a centroid seed generator operative to generate centroid seeds based on said first document classes;
  
  a dictionary generator adapted to generate a first dictionary of most common words in said first dataset; and
  
  a vector space model generator adapted to generate a first vector space model by counting, for each word in said first dictionary, a number of said first documents in which said word occurs, wherein said cluster generator clusters said documents in said first dataset based on said first vector space model, wherein said cluster generator clusters second documents in said second dataset using said centroid seeds, such that said second dataset has a similar, based on said centroid seeds, clustering to that of said first dataset, wherein said second dataset comprises a new, but related, dataset different than said first dataset, wherein said vector space model generator generates a second vector space model by counting, for each word in said first dictionary, a number of said second documents in which said word occurs;
  
  a classifier adapted to classify said second documents in said second vector space model using said first document classes to produce a classified second vector space model and adapted to determine a mean of vectors in each class in said classified second vector space model, wherein said mean comprises said centroid seeds.
- View Dependent Claims (2, 3)
- - 2. The system in claim 1, wherein:
    - said dictionary generator is adapted to generate a second dictionary of most common words in said second dataset,said vector space model generator is adapted to generate a third vector space model by counting, for each word in said second dictionary, a number of said second documents in which said word occurs, andsaid cluster generator is adapted to cluster said second documents in said second dataset based on said third vector space model to produce a second dataset cluster.
  - 3. The system in claim 2, wherein said cluster generator is adapted to produce an adapted dataset cluster by clustering said second documents in said second dataset using said centroid seeds and said system further comprises:
    - a comparator adapted to compare classes in said adapted dataset cluster to classes in said second dataset cluster and add classes to said adapted dataset cluster based on said comparing.

4. A system for clustering documents in datasets comprising:
- a storage device storing a first dataset and a second dataset, said first dataset and said second dataset being generated by a same source during different time periods such that said first dataset and said second dataset are related but different;
  
  a cluster generator operative to cluster first documents in said first dataset and produce first document classes;
  
  a centroid seed generator operative to generate centroid seeds based on said first document classes;
  
  a dictionary generator adapted to generate a first dictionary of most common words in said first dataset; and
  
  a vector space model generator adapted to generate a first vector space model by counting, for each word in said first dictionary, a number of said first documents in which said word occurs, wherein said cluster generator clusters said documents in said first dataset based on said first vector space model, wherein said cluster generator clusters second documents in said second dataset using said centroid seeds, such that said second dataset has a similar, based on said centroid seeds, clustering to that of said first dataset, wherein said vector space model generator generates a second vector space model by counting, for each word in said first dictionary, a number of said second documents in which said word occurs;
  
  a classifier adapted to classify said second documents in said second vector space model using said first document classes to produce a classified second vector space model and adapted to determine a mean of vectors in each class in said classified second vector space model, wherein said mean comprises said centroid seeds.
- View Dependent Claims (5, 6)
- - 5. The system in claim 4, wherein:
    - said dictionary generator is adapted to generate a second dictionary of most common words in said second dataset,said vector space model generator is adapted to generate a third vector space model by counting, for each word in said second dictionary, a number of said second documents in which said word occurs, andsaid cluster generator is adapted to cluster said second documents in said second dataset based on said third vector space model to produce a second dataset cluster.
  - 6. The system in claim 5, wherein said cluster generator is adapted to produce an adapted dataset cluster by clustering said second documents in said second dataset using said centroid seeds and said system further comprises:
    - a comparator adapted to compare classes in said adapted dataset cluster to classes in said second dataset cluster and add classes to said adapted dataset cluster based on said comparing.

7. A system for clustering documents in datasets comprising:
- a storage device storing a first dataset and a second dataset, said first dataset and said second dataset being generated by a same source during different time periods such that said first dataset and said second dataset are related but different;
  
  a cluster generator operative to cluster first documents in said first dataset into a user-specified number of clusters and produce first document classes;
  
  a centroid seed generator operative to generate centroid seeds based on said first document classes;
  
  a dictionary generator adapted to generate a first dictionary of a user-specified number of most common words in said first dataset; and
  
  a vector space model generator adapted to generate a first vector space model by counting, for each word in said first dictionary, a number of said first documents in which said word occurs,wherein said cluster generator clusters said documents in said first dataset based on said first vector space model, wherein said cluster generator clusters second documents in said second dataset using said centroid seeds, such that said second dataset has a similar, based on said centroid seeds, clustering to that of said first dataset, wherein said vector space model generator generates a second vector space model by counting, for each word in said first dictionary, a number of said second documents in which said word occurs;
  
  a classifier adapted to classify said second documents in said second vector space model using said first document classes to produce a classified second vector space model and adapted to determine a mean of vectors in each class in said classified second vector space model, wherein said mean comprises said centroid seeds.
- View Dependent Claims (8, 9)
- - 8. The system of claims 7, wherein:
    - said dictionary generator is adapted to generate a second dictionary of most common words in said second dataset,said vector space model generator is adapted to generate a third vector space model by counting, for each word in said second dictionary, a number of said second documents in which said word occurs, andsaid cluster generator is adapted to cluster said second documents in said second dataset based on said third vector space model to produce a second dataset cluster.
  - 9. The system in claim 8, wherein said cluster generator is adapted to produce an adapted dataset cluster by clustering said second documents in said second dataset using said centroid seeds and said system further comprises:
    - a comparator adapted to compare classes in said adapted dataset cluster to classes in said second dataset cluster and add classes to said adapted dataset cluster based on said comparing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Spangler, William S.
Primary Examiner(s)
Stork; Kyle R

Application Number

US12/098,486
Publication Number

US 20080215314A1
Time in Patent Office

862 Days
Field of Search

715/234, 715/243, 715/254, 704/10, 707/1, 707/3, 707/5, 707/6, 707/10, 358/403, 382/305
US Class Current

715/234
CPC Class Codes

G06F 16/355 Class or cluster creation o...

G06F 18/23213 with fixed number of cluste...

Method for adapting a K-means text clustering to emerging data

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

21 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Method for adapting a K-means text clustering to emerging data

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links