×

CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

  • US 20110087668A1
  • Filed: 08/27/2010
  • Published: 04/14/2011
  • Est. Priority Date: 10/09/2009
  • Status: Active Grant
First Claim
Patent Images

1. In a computer system having a processor and a computer-readable storage medium, a method for grouping near-duplicate documents, the method comprising:

  • for each document in a corpus of documents to be analyzed, computing, by the processor, a hash vector based on word count information for the document, the hash vector including a plurality of components;

    assigning, by the processor, each document to one of a plurality of initial clusters of documents, wherein each of the initial clusters contains a root document and at least some of the initial clusters further contain at least one child document, and wherein each of the child documents of any one of the initial clusters satisfies a first edit-distance constraint relative to the root document of that one of the initial clusters, the first edit-distance constraint being defined as an upper limit on a number of components of the hash vectors that are different between the root document and the child document;

    merging, by the processor, the initial clusters to form a plurality of final clusters, wherein during the merging, a first one and a second one of the initial clusters are merged in the event that the first one of the initial clusters and the second one of the initial clusters satisfy a second edit-distance constraint, the second edit-distance constraint being a constraint requiring similarity of topology between the first initial cluster and the second initial cluster; and

    storing in the computer readable storage medium, by the processor, a list of the documents associated with each of the final clusters.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×