Method for organizing large numbers of documents

US 8,938,461 B2
Filed: 07/20/2010
Issued: 01/20/2015
Est. Priority Date: 07/02/2007
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for organizing documents into nodes, in which a node represents a group of near equivalent documents, said computer implemented method comprising:

(i) providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text;

(ii) selecting a document from among said plurality of original documents and associating the selected document with a node;

(iii) comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document;

(iv) searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter;

(v) constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body of said selected document, and associating said presumed document with a node;

(vi) comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and

(vii) if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv),whereineach fingerprint comprises a representation of a corresponding document, andthe plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;

(1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer product including a data structure for organizing of a plurality of documents, and capable of being utilized by a processor for manipulating data of the data structure and capable of displaying selected data on a display unit. The data structure includes a plurality of directionally interlinked nodes, each node being associated with one or more documents having a header and body text. All the documents are associated with a given node and have identical normalized body text. All documents that have identical normalized body text are associated with the same node. One or more of the nodes is associated with more than one document. For any node that is a descendent of another node, the normalized body text of each document associated with the node is inclusive of the normalized body text of a document that is associated with the other node.

41 Citations

View as Search Results

23 Claims

1. A computer implemented method for organizing documents into nodes, in which a node represents a group of near equivalent documents, said computer implemented method comprising:
- (i) providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text;
  
  (ii) selecting a document from among said plurality of original documents and associating the selected document with a node;
  
  (iii) comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document;
  
  (iv) searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter;
  
  (v) constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body of said selected document, and associating said presumed document with a node;
  
  (vi) comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and
  
  (vii) if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv),whereineach fingerprint comprises a representation of a corresponding document, andthe plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;
  
  (1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The computer implemented method of claim 1, further comprising:
    - (viii) storing the document fingerprint for future comparison with other document fingerprints.
  - 3. The computer implemented method of claim 2, further comprising:
    - displaying on a display unit symbols indicative of said nodes, andaffiliating for each node a body text and subject parameter of at least one document associated with the node.
  - 4. The computer implemented method of claim 3, further comprising:
    - affiliating each node with a plurality of header parameters from each document associated with the node, said plurality of header parameters being arranged in a table.
  - 5. The computer implemented method of claim 4, wherein said documents are emails and wherein said plurality of header parameters comprises two or more fields from an email header selected from the group of fields consisting of:
    - “
      
      To”
      
      , “
      
      From”
      
      , “
      
      Subject”
      
      , and “
      
      Date”
      
      .
  - 6. The computer implemented method of claim 2, further comprising:
    - displaying the nodes; and
      
      suppressing nodes associated with a presumed document from the display.
  - 7. The computer implemented method of claim 6, further comprising:
    - affiliating each displayed node with header parameters of each document associated with said displayed node; and
      
      affiliating header parameters of documents associated with suppressed nodes with a node associated with a document from which the presumed document associated with said suppressed node is constructed.
  - 8. The computer implemented method of claim 2, wherein (ii) further comprisescomparing for near-duplication at least a portion of the body text of said selected document to at least a portion of the body texts of other documents from amongst said plurality of documents.
  - 9. The computer implemented method of claim 8, further comprising:
    - creating an association between nodes that are associated with documents found to near-duplicate to each other.
  - 10. The computer implemented method of claim 9, further comprising:
    - enabling a user to define a degree of similarity between documents for documents to be considered near-duplicating.
  - 11. The computer implemented method of claim 8, further comprising:
    - associating documents with document sets by associating to a document set;
      
      a first document, and documents that are associated with a node that is linked to the node associated with said first document already associated with said document set, and documents that near-duplicate to a document already associated with said document set.
  - 12. The computer implemented method of claim 2, further comprising:
    - creating an association between nodes that are associated with documents having related Conversation ID or related Message ID indicators.
  - 13. The computer implemented method of claim 2 further comprising:
    - removing at least one member of the group consisting of;
      
      disclaimers, signatures, program added text and attachment notifications from the body text of documents, and replacing unique text of each removed member with a unique short text identifier prior to said comparing in (ii), wherein said fingerprint of (iii) is a fingerprint of said body text after said replacing.
  - 14. The computer implemented method of claim 2, further comprising:
    - displaying documents in a data structure able to be sorted according to one or more members of the group consisting of;
      
      document identifier, document set, node address, inclusive flag, first copy of an inclusive flag.
  - 15. The computer implemented method of claim 2, further comprising:
    - (ix) if a document is found to be a duplicate of a prior document, suppressing step (viii).
  - 16. The computer implemented method of claim 15, further comprising:
    - (x) forming a subset of a large number of documents by including each document into the subset except for documents that duplicate to another document already in the subset, and except for documents that duplicate to a presumed document whereby only a single copy of inclusive documents are in the subset.
  - 17. The computer implemented method of claim 16, further comprising:
    - affiliating each document in said subset with other documents that duplicate to said document and with documents that duplicate to a presumed document derived from said document.
  - 18. The computer implemented method of claim 1, wherein (ii) is applied to selected documents from amongst said plurality of original documents.
  - 19. The computer implemented method of claim 1, whereinsaid (ii) further includes creating a fingerprint for each of at least a portion of a normalized body text and a normalized subject parameter of said selected document,said normalized body text and said normalized subject parameter are processed from said body text and a subject parameter of said header parameters,said (iii) includes comparing the created fingerprint of said selected document to previously stored created fingerprints of other documents from amongst said plurality of documents, and in the case of a match, merging the node associated with said selected document with a node associated with the matching document, wherein said comparison for detecting and indicating duplicating documents;
    - andsaid (vi) includes creating a fingerprint for each of at least a portion of the normalized body text and the normalized subject parameter of said presumed document and comparing the created fingerprint of said presumed document to a previously stored created fingerprint of at least one other document from among said plurality of documents and in the case of a match, merging a node associated with said presumed document with a node associated with the matching document.
  - 20. The computer implemented method for organizing documents of claim 19, further comprising a step of linking nodes, in which linking implies that the normalized body text of a document on a first side of said link is inclusive of the normalized body text of a document on a second side of said link, and wherein (vi) further comprises linking the associated node to be a parent of the node stipulated in (iii);
    - and wherein (vii) comprises linking the associated node to be a parent of the associated node of the most recent iteration of (vi).
  - 21. The computer implemented method of claim 1, wherein said (vii) includes:
    - for each of said instances, constructing said corresponding presumed document irrespective of whether the subject parameter of the header of said corresponding presumed document is the same as the subject parameter of the header of said original selected document or of previous constructed presumed document.

22. A non-transitory computer program product comprising a storage storing computer code for performing a method for organizing documents into nodes, in which a node represents a group of near equivalent documents, the computer program product comprising:
- (i) a computer code portion for providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text;
  
  (ii) a computer code portion for selecting a document from among said plurality of original documents and associating the selected document with a node;
  
  (iii) a computer code portion for comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document;
  
  (iv) a computer code portion for searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter;
  
  (v) a computer code portion for constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body text of said selected document, and associating said presumed document with a node;
  
  (vi) a computer code portion for comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and
  
  (vii) a computer code portion for determining if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv),whereineach fingerprint comprises a representation of a corresponding document, andthe plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;
  
  (1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof.

23. A system for organizing documents into nodes, in which a node represents a group of substantially near equivalent documents, the system comprising:
- a memory; and
  
  a processor configured to perform at least the following;
  
  (i) providing a plurality of original documents, each of the original documents comprising a header and a body text, and wherein said header comprises at least one header parameter and wherein said body text comprises text;
  
  (ii) selecting a document from among said plurality of original documents and associating the selected document with a node;
  
  (iii) comparing a fingerprint of said selected document to previously stored fingerprints of other documents from amongst said plurality of original documents, and in the case of a match between the fingerprints, merging the node associated with said selected document with a node associated with a matching document having a fingerprint matching the fingerprint of said selected document;
  
  (iv) searching in text order through said body text of said selected document to locate a first instance of header-type text within said selected document, wherein said header-type text contains at least one header parameter;
  
  (v) constructing a presumed document from a subset of the body text of said selected document, the constructed presumed document having (a) a header that includes one or more parameters from said header-type text located within said body text of said selected document, irrespective of whether the subject parameter of the header of said presumed document is the same as the subject parameter of the header of said selected document, and (b) body text that includes the text of said selected document located after said header-type text in said body text of said selected document, and associating said presumed document with a node;
  
  (vi) comparing a fingerprint of said presumed document to the previously stored fingerprint of at least one other document from among said plurality of original documents and in the case of a match between the fingerprints, merging a node associated with said presumed document with a node associated with a matching document having a fingerprint matching the fingerprint of said presumed document; and
  
  (vii) determining if the comparing of (vi) does not result in a match, processing repeatedly a remainder of the body text of said selected document for successive instances of header-type text according to step (iv), and for each successive instance of the header-type text, constructing a corresponding presumed document according to step (v), and comparing for any matching documents to the corresponding presumed document according to step (vi), said processing of steps (iv)-(vi) is repeatedly performed until a match is found in step (vi) or until no new instances of header-type text are found in step (iv),whereineach fingerprint comprises a representation of a corresponding document, andthe plurality of nodes are arranged in terms of more than one tree, each tree comprising at least one node from the plurality of nodes, each tree comprising at least a root node and at least a leaf node, a root node being a node that is not a descendant of any other node, and a leaf node being a node that has no descendent nodes, a node not being prohibited from being both a root node and a leaf node, all nodes that are descendant from the root node are contained by the tree, each node being associated with either;
  
  (1) one of the original documents and any matching document thereof or (2) one of the presumed documents and any matching document thereof.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Israel Research And Development (2002) Ltd (Microsoft Corporation)
Original Assignee
Equivio Ltd. (Microsoft Corporation)
Inventors
Ravid, Yiftach, Milo, Amir
Primary Examiner(s)
Morrison, Jay
Assistant Examiner(s)
GORTAYO, DANGELINO N

Application Number

US12/839,976
Publication Number

US 20100287466A1
Time in Patent Office

1,645 Days
Field of Search

707/1, 707/100, 707/101, 707/748, 707/749, 707/752, 707/754, 707/755, 707/758, 707/773, 707/803
US Class Current

707/749
CPC Class Codes

G06F 16/248   Presentation of query results

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/93   Document management systems

G06F 40/194   Calculation of difference b...

G06V 30/416   Extracting the logical stru...

G06V 30/418   Document matching, e.g. of ...

H04L 51/216   Handling conversation histo...

Method for organizing large numbers of documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

41 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Method for organizing large numbers of documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links