System and method for clustering documents

  • US 8,046,363 B2
  • Filed: 01/10/2007
  • Issued: 10/25/2011
  • Est. Priority Date: 04/13/2006
  • Status: Active Grant
  • ×
    • Pin Icon | RPX Insight
    • Pin
First Claim
Patent Images

1. A clustering device having software and hardware for clustering documents, the clustering device comprising:

  • a document storage unit storing documents;

    a document retrieving unit to retrieve at least two selected documents form a plurality of documents stored in the document storage unit;

    a document feature writing unit to acquire text from at least one identical or similar specific field among a plurality of fields in each of the selected documents, and to create index files of the selected documents, and to determine a feature vector for each of the selected documents using the index files, wherein the feature vector includes occurrence frequencies of a plurality of keywords located in the specific field of the selected document;

    a representative vector calculator to extract representative keywords having highest occurrence frequencies among the plurality of keywords in the index file of selected the document, to calculate percentages of occurrence frequencies of each of the representative keywords in each selected document, to choose a group of keywords from the representative keywords of the selected documents to be representative keywords used in clustering of the documents, the representative keywords corresponding to a group of keywords having greatest values obtained by adding the percentage of occurrence frequencies among same keywords in each of the selected documents, wherein the representative keywords are chosen to be components of the representative vector;

    a cluster database storage unit to store representative keywords constituting components of the representative vector with the representative vector; and

    a clustering unit comprising;

    a similarity calculator to determine a similarity of the specific field of a stored document in the document storage unit through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit;

    receiving a stored document as a target document for clustering;

    clustering the stored document using the specific field of the stored document through an inner product value between a feature vector of the stored document and stored representative vector in the cluster database storage unit; and

    providing the user with a clustering result.

View all claims
    ×
    ×

    Thank you for your feedback

    ×
    ×