System for Document De-Duplication and Modification Detection
First Claim
1. A method for the organization and collection of documents, comprising:
- collecting a first document in response to a document collection request;
generating a first hash code corresponding to a non-metadata portion of the first document;
comparing the first hash code to a plurality of hash codes, each hash code of the plurality of hash codes corresponding to a non-metadata portion of a corresponding document of a plurality of documents, each document collected in response to the document collection request;
if the first hash code does not match any hash code of the plurality of hash codes, storing the first documents including a metadata portion and the non-metadata portion, on a data storage; and
if the first hash code matches any hash code of the plurality of hash codes, extracting metadata corresponding to the first document; and
storing on the data storage the extracted metadata, but not the non-metadata portion of the first document, in conjunction with the particular document corresponding to the hash code that matches the first hash code.
0 Assignments
0 Petitions
Accused Products
Abstract
Provided is a system and method for the de-duplication and modification detection of documents collected during document production. The disclosed technology provides a simple, legally defensible, rapid and cost-efficient system for collecting responsive electronic document sets, identifying and eliminating unnecessary documents by comparing a collected document to previously collected documents and copying only information that has not been duplicated. The disclosed technology provides a method for copying the unduplicated information without transmitting or storing the duplicated portions. In addition, the claimed subject matter provides a system for detecting whether or not a document being submitted to a project archive is a modification of a previously submitted document. A document being submitted that represents a modification of a previously submitted document is prevented from being added to the project document archive.
-
Citations
15 Claims
-
1. A method for the organization and collection of documents, comprising:
-
collecting a first document in response to a document collection request; generating a first hash code corresponding to a non-metadata portion of the first document; comparing the first hash code to a plurality of hash codes, each hash code of the plurality of hash codes corresponding to a non-metadata portion of a corresponding document of a plurality of documents, each document collected in response to the document collection request; if the first hash code does not match any hash code of the plurality of hash codes, storing the first documents including a metadata portion and the non-metadata portion, on a data storage; and if the first hash code matches any hash code of the plurality of hash codes, extracting metadata corresponding to the first document; and storing on the data storage the extracted metadata, but not the non-metadata portion of the first document, in conjunction with the particular document corresponding to the hash code that matches the first hash code. - View Dependent Claims (2, 3, 4, 5, 7, 12)
-
-
6. A system for the organization and collection of documents, comprising
a processor; -
a memory coupled to the processor; logic, stored on the memory for execution on the processor, for collecting a first document in response to a document collection request; logic, stored on the memory for execution on the processor, for generating a first hash code corresponding to a non-metadata portion of the first document; logic, stored on the memory for execution on the processor, for comparing the first hash code to a plurality of hash codes, each hash code of the plurality of hash codes corresponding to a non-metadata portion of a corresponding document of a plurality of documents, each document collected in response to the document collection request; logic, stored on the memory for execution on the processor, for, if the first hash code does not match any hash code of the plurality of hash codes, storing the first document, including a metadata portion and the non-metadata portion, on a data storage; and logic, stored on the memory for execution on the processor, for, if the first hash code matches any hash code of the plurality of hash codes, extracting metadata corresponding to the first document; and storing on the data storage the extracted metadata, but not the non-metadata portion of the first document, in conjunction with the particular document corresponding to the hash code that matches the first hash code. - View Dependent Claims (8, 9, 10)
-
-
11. A computer programming product for the organization and collection of documents, comprising:
-
a memory; logic, stored on the memory for execution on a processor, for collecting a first document in response to a document collection request; logic, stored on the memory for execution on the processor, for generating a first hash code corresponding to a non-metadata portion of the first document; logic, stored on the memory for execution on the processor, for comparing the first hash code to a plurality of hash codes, each hash code of the plurality of hash codes corresponding to a non-metadata portion of a corresponding document of a plurality of documents, each document collected in response to the document collection request; logic, stored on the memory for execution on the processor, for, if the first hash code does not match any hash code of the plurality of hash codes, storing the first document, including a metadata portion and the non-metadata portion, on a data storage; and logic, stored on the memory for execution on the processor, for, if the first hash code matches any hash code of the plurality of hash codes, extracting metadata corresponding to the first document; and storing on the data storage the extracted metadata, but not the non-metadata portion of the first document, in conjunction with the particular document corresponding to the hash code that matches the first hash code. - View Dependent Claims (13, 14, 15)
-
Specification