Method for organizing large numbers of documents
First Claim
1. A computer implemented method for organizing documents into nodes, in which a node represents a group of near equivalent documents, said computer implemented method comprising:
- (i) providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text;
(ii) selecting a document from among said plurality of original documents and associating the selected document with a node;
(iii) comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document;
(iv) searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter;
(v) constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body of said selected document, and associating said presumed document with a node;
(vi) comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and
(vii) if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv),whereineach fingerprint comprises a representation of a corresponding document, andthe plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;
(1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer product including a data structure for organizing of a plurality of documents, and capable of being utilized by a processor for manipulating data of the data structure and capable of displaying selected data on a display unit. The data structure includes a plurality of directionally interlinked nodes, each node being associated with one or more documents having a header and body text. All the documents are associated with a given node and have identical normalized body text. All documents that have identical normalized body text are associated with the same node. One or more of the nodes is associated with more than one document. For any node that is a descendent of another node, the normalized body text of each document associated with the node is inclusive of the normalized body text of a document that is associated with the other node.
41 Citations
23 Claims
-
1. A computer implemented method for organizing documents into nodes, in which a node represents a group of near equivalent documents, said computer implemented method comprising:
-
(i) providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text; (ii) selecting a document from among said plurality of original documents and associating the selected document with a node; (iii) comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document; (iv) searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter; (v) constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body of said selected document, and associating said presumed document with a node; (vi) comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and (vii) if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv), wherein each fingerprint comprises a representation of a corresponding document, and the plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;
(1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A non-transitory computer program product comprising a storage storing computer code for performing a method for organizing documents into nodes, in which a node represents a group of near equivalent documents, the computer program product comprising:
-
(i) a computer code portion for providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text; (ii) a computer code portion for selecting a document from among said plurality of original documents and associating the selected document with a node; (iii) a computer code portion for comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document; (iv) a computer code portion for searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter; (v) a computer code portion for constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body text of said selected document, and associating said presumed document with a node; (vi) a computer code portion for comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and (vii) a computer code portion for determining if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv), wherein each fingerprint comprises a representation of a corresponding document, and the plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;
(1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof.
-
-
23. A system for organizing documents into nodes, in which a node represents a group of substantially near equivalent documents, the system comprising:
-
a memory; and a processor configured to perform at least the following; (i) providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text; (ii) selecting a document from among said plurality of original documents and associating the selected document with a node; (iii) comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document; (iv) searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter; (v) constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body text of said selected document, and associating said presumed document with a node; (vi) comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and (vii) determining if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv), wherein each fingerprint comprises a representation of a corresponding document, and the plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;
(1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof.
-
Specification