Composite locality sensitive hash based processing of documents
First Claim
1. A method of analyzing documents belonging to a corpus, the method comprising:
- computing a composite hash value for a current document from the corpus;
determining whether a previous document having the same composite hash value as the current document has been analyzed;
in the event that a previous document having the same composite hash value as the current document has not been analyzed, analyzing the current document, wherein analyzing the current document includes determining one or more items of analytic metadata to be associated with the current document;
in the event that a previous document having the same composite hash value as the current document has been analyzed, associating the current document with one or more items of analytic metadata determined from analyzing the previous document; and
storing a representation of the association between the one or more items of analytic metadata and the current document.
5 Assignments
0 Petitions
Accused Products
Abstract
Reliable identification of highly similar documents allows such documents to be treated as identical for purposes of document analysis. Identification of highly similar documents can be based on a composite hash value or other value for which the likelihood of two documents having the same value is high if and only if the documents have a high degree of similarity. Prior to performing content based analysis, the composite hash value for the current document is determined and compared to composite hash values of previously analyzed documents. If a match is found, the results of the analysis of the previous document can be applied to the current document. If no match is found, the current document is analyzed.
-
Citations
20 Claims
-
1. A method of analyzing documents belonging to a corpus, the method comprising:
-
computing a composite hash value for a current document from the corpus; determining whether a previous document having the same composite hash value as the current document has been analyzed; in the event that a previous document having the same composite hash value as the current document has not been analyzed, analyzing the current document, wherein analyzing the current document includes determining one or more items of analytic metadata to be associated with the current document;
in the event that a previous document having the same composite hash value as the current document has been analyzed, associating the current document with one or more items of analytic metadata determined from analyzing the previous document; andstoring a representation of the association between the one or more items of analytic metadata and the current document. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-readable storage medium containing program instructions that, when executed by a computer system, cause the computer system to execute a method of analyzing a plurality of documents in a corpus of documents, the method comprising:
-
selecting a current document from the corpus; computing a composite hash value for the current document; determining whether a previous document having the same composite hash value as the current document has been analyzed; in the event that a previous document having the same composite hash value as the current document has not been analyzed, analyzing the current document, wherein analyzing the current document includes determining one or more items of analytic metadata to be associated with the current document; in the event that a previous document having the same composite hash value as the current document has been analyzed, associating the current document with one or more items of analytic metadata determined from analyzing the previous document; storing a representation of the association between the one or more items of analytic metadata and the current document; and repeating the acts of selecting, computing, determining, either analyzing or associating, and storing until all of the plurality of documents have been analyzed. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A computer system comprising:
-
a storage subsystem configured to maintain a document information data store; and a processor coupled to the storage subsystem, the processor being configured to perform one or more content-based analysis operations on documents in a corpus of documents and to store results of the analysis in the document information data store, wherein the processor is further configured to; compute a composite hash value for each of the documents; assign each document to one of a plurality of groups, wherein all documents in a same group have a same composite hash value; perform the one or more content-based analysis operations on one document from each of the plurality of groups; and store a result of the one or more content-based analysis operations in the document information data store in association with each of the documents in the group. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification