×

Method for organizing large numbers of documents

  • US 8,938,461 B2
  • Filed: 07/20/2010
  • Issued: 01/20/2015
  • Est. Priority Date: 07/02/2007
  • Status: Active Grant
First Claim
Patent Images

1. A computer implemented method for organizing documents into nodes, in which a node represents a group of near equivalent documents, said computer implemented method comprising:

  • (i) providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text;

    (ii) selecting a document from among said plurality of original documents and associating the selected document with a node;

    (iii) comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document;

    (iv) searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter;

    (v) constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body of said selected document, and associating said presumed document with a node;

    (vi) comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and

    (vii) if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv),whereineach fingerprint comprises a representation of a corresponding document, andthe plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;

    (1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×