×

Clustering of near-duplicate documents

  • US 9,355,171 B2
  • Filed: 08/27/2010
  • Issued: 05/31/2016
  • Est. Priority Date: 10/09/2009
  • Status: Expired due to Fees
First Claim
Patent Images

1. In a computer system having a processor and a computer-readable storage medium, a method for grouping near-duplicate documents, the method comprising:

  • for each document in a corpus of documents to be analyzed, computing, by the processor, a hash vector based on word count information for the document, the hash vector including a plurality of components;

    assigning, by the processor, each document to one of a plurality of initial clusters of documents, wherein each of the initial clusters contains a root document and at least some of the initial clusters further contain at least one child document, and wherein each of the child documents of any one of the initial clusters satisfies a first edit-distance constraint relative to the root document of that one of the initial clusters, the first edit-distance constraint being defined as an upper limit on a number of components of the hash vectors that are different between the root document and the child document;

    merging, by the processor, the initial clusters to form a plurality of final clusters based on a second edit-distance constraint; and

    storing in the computer readable storage medium, by the processor, a list of the documents associated with each of the final clusters,wherein assigning each document to one of the initial clusters includes;

    for a target one of the documents, comparing the hash vector of the target document with the hash vector of the root document of the initial cluster to determine whether the first edit distance constraint is satisfied;

    in response to a determination that the first edit distance constraint is satisfied, adding the target document as a child document to the initial cluster, by storing an identifier of the target document in the computer-readable storage medium in association with the initial cluster; and

    in response to a determination that the first edit distance constraint is not satisfied, adding a new cluster to a list of the initial clusters in the computer-readable storage medium, wherein the target document is the root document of the new cluster.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×