×

Method for adapting a K-means text clustering to emerging data

  • US 7,779,349 B2
  • Filed: 04/07/2008
  • Issued: 08/17/2010
  • Est. Priority Date: 09/26/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A system for clustering documents in datasets comprising:

  • a storage device storing a first dataset and a second dataset;

    a cluster generator operative to cluster first documents in said first dataset and produce first document classes;

    a centroid seed generator operative to generate centroid seeds based on said first document classes;

    a dictionary generator adapted to generate a first dictionary of most common words in said first dataset; and

    a vector space model generator adapted to generate a first vector space model by counting, for each word in said first dictionary, a number of said first documents in which said word occurs, wherein said cluster generator clusters said documents in said first dataset based on said first vector space model, wherein said cluster generator clusters second documents in said second dataset using said centroid seeds, such that said second dataset has a similar, based on said centroid seeds, clustering to that of said first dataset, wherein said second dataset comprises a new, but related, dataset different than said first dataset, wherein said vector space model generator generates a second vector space model by counting, for each word in said first dictionary, a number of said second documents in which said word occurs;

    a classifier adapted to classify said second documents in said second vector space model using said first document classes to produce a classified second vector space model and adapted to determine a mean of vectors in each class in said classified second vector space model, wherein said mean comprises said centroid seeds.

View all claims
  • 0 Assignments
Timeline View
Assignment View
    ×
    ×