CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

US 20110087668A1
Filed: 08/27/2010
Published: 04/14/2011
Est. Priority Date: 10/09/2009
Status: Active Grant

First Claim

Patent Images

1. In a computer system having a processor and a computer-readable storage medium, a method for grouping near-duplicate documents, the method comprising:

for each document in a corpus of documents to be analyzed, computing, by the processor, a hash vector based on word count information for the document, the hash vector including a plurality of components;

assigning, by the processor, each document to one of a plurality of initial clusters of documents, wherein each of the initial clusters contains a root document and at least some of the initial clusters further contain at least one child document, and wherein each of the child documents of any one of the initial clusters satisfies a first edit-distance constraint relative to the root document of that one of the initial clusters, the first edit-distance constraint being defined as an upper limit on a number of components of the hash vectors that are different between the root document and the child document;

merging, by the processor, the initial clusters to form a plurality of final clusters, wherein during the merging, a first one and a second one of the initial clusters are merged in the event that the first one of the initial clusters and the second one of the initial clusters satisfy a second edit-distance constraint, the second edit-distance constraint being a constraint requiring similarity of topology between the first initial cluster and the second initial cluster; and

storing in the computer readable storage medium, by the processor, a list of the documents associated with each of the final clusters.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents.

Citations

21 Claims

1. In a computer system having a processor and a computer-readable storage medium, a method for grouping near-duplicate documents, the method comprising:
- for each document in a corpus of documents to be analyzed, computing, by the processor, a hash vector based on word count information for the document, the hash vector including a plurality of components;
  
  assigning, by the processor, each document to one of a plurality of initial clusters of documents, wherein each of the initial clusters contains a root document and at least some of the initial clusters further contain at least one child document, and wherein each of the child documents of any one of the initial clusters satisfies a first edit-distance constraint relative to the root document of that one of the initial clusters, the first edit-distance constraint being defined as an upper limit on a number of components of the hash vectors that are different between the root document and the child document;
  
  merging, by the processor, the initial clusters to form a plurality of final clusters, wherein during the merging, a first one and a second one of the initial clusters are merged in the event that the first one of the initial clusters and the second one of the initial clusters satisfy a second edit-distance constraint, the second edit-distance constraint being a constraint requiring similarity of topology between the first initial cluster and the second initial cluster; and
  
  storing in the computer readable storage medium, by the processor, a list of the documents associated with each of the final clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1 wherein the storing includes storing the list of the documents associated with each of the final clusters such that all documents associated with a same one of the final clusters are accessible by reference to any one of the documents associated with that one of the final clusters.
  - 3. The method of claim 1 wherein the components of the hash vector are orthogonal to each other.
  - 4. The method of claim 1 wherein the assigning includes, for a target one of the documents:
    - traversing a list of extant clusters stored in the computer-readable storage medium, wherein the list is initially empty;
      
      for each extant cluster in the list of extant clusters, comparing, by the processor, the hash vector for the target document with the hash vector of the root document of the extant cluster to determine whether the first edit distance constraint is satisfied;
      
      in the event that the first edit distance constraint is satisfied for a first one of the extant clusters, adding, by the processor, the target document as a child document to the first one of the extant clusters, wherein adding the target document includes storing an identifier of the target document in the computer-readable storage medium in association with the first one of the extant clusters; and
      
      in the event that first edit distance constraint is not satisfied for any one of the extant clusters, adding a new cluster to the list of extant clusters stored in the computer-readable storage medium, wherein the target document is the root document of the new cluster.
  - 5. The method of claim 4 wherein the first-edit distance constraint corresponds to an upper limit on the number of components of the hash vector that are different between the root document and the target document.
  - 6. The method of claim 5 wherein the merging includes:
    - for each of the initial clusters that has at least one child document, grouping the child documents into one or more maps, wherein all of the child documents that are grouped within a same map have hash vectors that differ from the root document of the cluster in the same one or more of the components;
      
      determining whether the number of maps for a first one of the initial clusters is at or below an upper bound;
      
      determining whether the number of maps for a second one of the initial clusters is at or below an upper bound; and
      
      in the event that the number of maps for the first initial cluster and the number of maps for the second initial cluster are both at or below the upper bound;
      
      determining whether the maps for the first initial cluster correspond to differences from the root document in the same one or more of the plurality of components of the hash vector as the maps for the second initial cluster; and
      
      merging the first initial cluster and the second initial cluster in the event that the maps for the first cluster correspond to differences in the same one or more of the plurality of components of the hash vector as the maps for the second cluster.
  - 7. The method of claim 6 wherein the upper limit on the number of components of the hash vector that are different is 1 and the upper bound on the number of maps is 2.
  - 8. The method of claim 5 wherein the second edit-distance constraint is defined such that merging does not increase a diameter of a cluster.
  - 9. The method of claim 8 further comprising selecting the second initial cluster, wherein the second initial cluster is selected from the initial clusters that have a diameter not greater than 2.

10. A system for analyzing documents, the system comprising:
- a document information data store configured to store a vector representation of each of a plurality of documents, the vector representation being based on frequency of occurrence within the document of words from a dictionary, wherein the vector representation has a dimension that is small compared to the number of words in the dictionary;
  
  a processor configured to form clusters of near-duplicate documents based on the vector representations in the document information data store and to store cluster information in the document information data store, the cluster information including a list of the documents associated with each of the clusters of near-duplicate documents,wherein the processor is further configured to form initial clusters of near-duplicate documents by applying a first edit-distance constraint to the vector representations of the documents and to form final clusters of near-duplicate documents from the initial clusters by merging some or all of the initial clusters by applying a second edit-distance constraint to the initial clusters.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system of claim 10 wherein the control logic is further configured such that the first edit-distance constraint is based on comparing components of vector representations of different ones of the plurality of documents and the second edit-distance constraint is based on comparing topological features of the initial clusters.
  - 12. The system of claim 10 wherein the processor is further configured to generate the vector representation of each of the plurality of documents and to store the vector representations in the document information data store.
  - 13. The system of claim 12 wherein the vector representation comprises a hash vector and wherein the control logic is further configured to generate each component of the hash vector by applying a hash function to a different subset of the words in the dictionary.
  - 14. The system of claim 13 wherein the dimension of the vector representation is less than 10.
  - 15. The system of claim 10 further comprising a user interface configured to allow a user to select a document from the corpus and to view a list of all other documents in the same final cluster as the selected document.

16. A computer-readable storage medium containing program instructions, which when executed by a processor cause the processor to execute a method of clustering documents based on similarity, the method comprising:
- for each document in a corpus of documents to be analyzed, accessing a document vector that includes a plurality of components, each of the components being based on word count information for the document,assigning each document to one of a plurality of initial clusters of documents, wherein each of the initial clusters contains a root document and at least some of the initial clusters further contain at least one child document, and wherein each of the child documents of any one of the initial clusters satisfies a first edit-distance constraint relative to the root document of that one of the initial clusters, the first edit-distance constraint being defined as a minimum degree of similarity between the document vectors of the root document and the child document;
  
  merging at least some of the initial clusters to form a plurality of final clusters, wherein during the merging, a first one and a second one of the initial clusters are merged in the event that the first one of the initial clusters satisfies a second edit-distance constraint relative to the second one of the initial clusters, the second edit-distance constraint being defined as a minimum degree of similarity between topologies of the first and second initial clusters; and
  
  storing, in a document information data store, a list of the documents associated with each of the final clusters.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The computer-readable storage medium of claim 16 wherein accessing the document vector includes computing the document vector.
  - 18. The computer-readable storage medium of claim 17 wherein computing the document vector includes, for a target one of the documents:
    - determining a number of occurrences within the target document of each of a plurality of words from a dictionary; and
      
      for each of a plurality of subsets of the words from the dictionary, computing a hash function of a bit field representing the number of occurrences of each of the words in that subset.
  - 19. The computer-readable storage medium of claim 16 wherein accessing the document vector includes reading the document vector from a document information data store.
  - 20. The computer-readable storage medium of claim 16 wherein the first edit constraint is an upper limit on the number of components of the document vectors that are different between the root document and any of the child documents within a same one of the initial clusters.
  - 21. The computer-readable storage medium of claim 20 wherein the second edit-distance constraint is a constraint limiting a diameter of each of the final clusters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Stratify, Inc. (Open Text Corporation)
Inventors
Salaka, Vamsi, Goswami, Sauraj, Thomas, Joy

Granted Patent

US 9,355,171 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/738
CPC Class Codes

G06F 16/35 Clustering; Classification

G06F 16/355 Class or cluster creation o...

CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

CLUSTERING OF NEAR-DUPLICATE DOCUMENTS

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links