CLUSTERING OF NEAR-DUPLICATE DOCUMENTS
First Claim
1. In a computer system having a processor and a computer-readable storage medium, a method for grouping near-duplicate documents, the method comprising:
- for each document in a corpus of documents to be analyzed, computing, by the processor, a hash vector based on word count information for the document, the hash vector including a plurality of components;
assigning, by the processor, each document to one of a plurality of initial clusters of documents, wherein each of the initial clusters contains a root document and at least some of the initial clusters further contain at least one child document, and wherein each of the child documents of any one of the initial clusters satisfies a first edit-distance constraint relative to the root document of that one of the initial clusters, the first edit-distance constraint being defined as an upper limit on a number of components of the hash vectors that are different between the root document and the child document;
merging, by the processor, the initial clusters to form a plurality of final clusters, wherein during the merging, a first one and a second one of the initial clusters are merged in the event that the first one of the initial clusters and the second one of the initial clusters satisfy a second edit-distance constraint, the second edit-distance constraint being a constraint requiring similarity of topology between the first initial cluster and the second initial cluster; and
storing in the computer readable storage medium, by the processor, a list of the documents associated with each of the final clusters.
4 Assignments
0 Petitions
Accused Products
Abstract
Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents.
-
Citations
21 Claims
-
1. In a computer system having a processor and a computer-readable storage medium, a method for grouping near-duplicate documents, the method comprising:
-
for each document in a corpus of documents to be analyzed, computing, by the processor, a hash vector based on word count information for the document, the hash vector including a plurality of components; assigning, by the processor, each document to one of a plurality of initial clusters of documents, wherein each of the initial clusters contains a root document and at least some of the initial clusters further contain at least one child document, and wherein each of the child documents of any one of the initial clusters satisfies a first edit-distance constraint relative to the root document of that one of the initial clusters, the first edit-distance constraint being defined as an upper limit on a number of components of the hash vectors that are different between the root document and the child document; merging, by the processor, the initial clusters to form a plurality of final clusters, wherein during the merging, a first one and a second one of the initial clusters are merged in the event that the first one of the initial clusters and the second one of the initial clusters satisfy a second edit-distance constraint, the second edit-distance constraint being a constraint requiring similarity of topology between the first initial cluster and the second initial cluster; and storing in the computer readable storage medium, by the processor, a list of the documents associated with each of the final clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for analyzing documents, the system comprising:
-
a document information data store configured to store a vector representation of each of a plurality of documents, the vector representation being based on frequency of occurrence within the document of words from a dictionary, wherein the vector representation has a dimension that is small compared to the number of words in the dictionary; a processor configured to form clusters of near-duplicate documents based on the vector representations in the document information data store and to store cluster information in the document information data store, the cluster information including a list of the documents associated with each of the clusters of near-duplicate documents, wherein the processor is further configured to form initial clusters of near-duplicate documents by applying a first edit-distance constraint to the vector representations of the documents and to form final clusters of near-duplicate documents from the initial clusters by merging some or all of the initial clusters by applying a second edit-distance constraint to the initial clusters. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A computer-readable storage medium containing program instructions, which when executed by a processor cause the processor to execute a method of clustering documents based on similarity, the method comprising:
-
for each document in a corpus of documents to be analyzed, accessing a document vector that includes a plurality of components, each of the components being based on word count information for the document, assigning each document to one of a plurality of initial clusters of documents, wherein each of the initial clusters contains a root document and at least some of the initial clusters further contain at least one child document, and wherein each of the child documents of any one of the initial clusters satisfies a first edit-distance constraint relative to the root document of that one of the initial clusters, the first edit-distance constraint being defined as a minimum degree of similarity between the document vectors of the root document and the child document; merging at least some of the initial clusters to form a plurality of final clusters, wherein during the merging, a first one and a second one of the initial clusters are merged in the event that the first one of the initial clusters satisfies a second edit-distance constraint relative to the second one of the initial clusters, the second edit-distance constraint being defined as a minimum degree of similarity between topologies of the first and second initial clusters; and storing, in a document information data store, a list of the documents associated with each of the final clusters. - View Dependent Claims (17, 18, 19, 20, 21)
-
Specification