Method for Organizing Large Numbers of Documents

US 20090012984A1
Filed: 01/02/2008
Published: 01/08/2009
Est. Priority Date: 07/02/2007
Status: Abandoned Application

First Claim

Patent Images

1. ) A computer product including a data structure for organizing of a plurality of documents, and capable of being utilized by a processor for manipulating data of said data structure and capable of displaying selected data on a display unit;

said data structure comprising;

a plurality of directionally interlinked nodes, each node being associated with at least one document having at least a header and body text; and

wherein all documents associated with a given node having substantially identical normalized body text, and wherein all documents having substantially identical normalized body text being associated with the same node, and wherein at least one node being associated with more than one document;

for any first node of said nodes that is a descendent of a second node of said nodes, the normalized body text of each document associated with said first node is substantially inclusive of the normalized body text of each document that is associated with said second node.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer product including a data structure for organizing of a plurality of documents, and capable of being utilized by a processor for manipulating data of the data structure and capable of displaying selected data on a display unit. The data structure includes a plurality of directionally interlinked nodes, each node being associated with one or more documents having a header and body text. All the documents are associated with a given node and have identical normalized body text. All documents that have identical normalized body text are associated with the same node. One or more of the nodes is associated with more than one document. For any node that is a descendent of another node, the normalized body text of each document associated with the node is inclusive of the normalized body text of a document that is associated with the other node.

Citations

59 Claims

1. ) A computer product including a data structure for organizing of a plurality of documents, and capable of being utilized by a processor for manipulating data of said data structure and capable of displaying selected data on a display unit;
- said data structure comprising;
  
  a plurality of directionally interlinked nodes, each node being associated with at least one document having at least a header and body text; and
  
  wherein all documents associated with a given node having substantially identical normalized body text, and wherein all documents having substantially identical normalized body text being associated with the same node, and wherein at least one node being associated with more than one document;
  
  for any first node of said nodes that is a descendent of a second node of said nodes, the normalized body text of each document associated with said first node is substantially inclusive of the normalized body text of each document that is associated with said second node.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. ) The computer product of claim 1, wherein all documents associated with a given node further having substantially identical normalized subject parameter in said header.
  - 3. ) The computer product of claim 1 wherein said documents are emails.
  - 4. ) The computer product of claim 1, wherein said plurality of directionally interlinked nodes including at least a node, a first descendant node descendant from said node, and a second descendant node descendant from said first descendent node.
  - 5. ) The computer product of claim 1 wherein said plurality of directionally interlinked nodes including at least a node, and two descendant node, each independently descendant from said node.
  - 6. ) The computer product of claim 1 wherein said plurality of nodes being arranged in terms of more than one tree, wherein each tree comprises at least one node from said plurality of directionally interlinked nodes and wherein each tree comprises at least a root node and at least a leaf node, wherein a root node is a node that is not a descendant of any other node, and a leaf node is a node that has no descendent nodes;
    - and wherein a node is not prohibited from being both a root node and a leaf node, and wherein all nodes that are descendant from said root node are contained by said tree.
  - 7. ) The computer product of claim 6 wherein said plurality of nodes being arranged in terms of at least a first tree and a second tree that contain a link to one another;
    - said link is indicative of the fact that said first tree contains a node that is associated with a document that near-duplicates to a document that is associated with a node in said second tree.
  - 8. ) A processor and associated display communicating with the data structure of claim 7, and capable of manipulating data of said data structure and displaying selected data on a display unit, wherein said processor further being configured to display said first tree and a node from said second tree in close proximity on said display unit.
  - 9. ) A processor and associated display communicating with the data structure of claim 6, and capable of manipulating data of said data structure and displaying selected data on a display unit wherein said processor further being configured to indicate on the display unit which nodes are the leaf nodes.
  - 10. ) The processor and associated display of claim 9 wherein said processor is configured to mark for said display unit an entire thread inclusively including all the nodes directly between a root node and given leaf node.
  - 11. ) The processor and associated display of claim 10 wherein in response to user command said processor is configured to mark nodes, on said display unit, in order to indicate (i) whether a thread has been read, or (ii) the relevance of the thread, or (iii) the level of importance of the thread;
    - said processor is further configured to allow the addition of reviewer comments on said display unit.
  - 12. ) A processor and associated display communicating with the data structure of claim 1, and capable of manipulating data of said data structure and displaying selected data on a display unit, wherein said processor further being configured to compare text of two documents that are associated with different nodes.
  - 13. ) A processor and associated display communicating with the data structure of claim 1, and capable of manipulating data of said data structure and displaying selected data on a display unit wherein said processor further being configured to display the subject and body text of a document that is associated with said node.
  - 14. ) The processor and associated display of claim 13 wherein said node is represented by a clickable icon, and wherein said processor further being configured to display the subject and body text of a document that is associated with said node, in response to clicking on said icon.
  - 15. ) The processor and associated display of claim 14 wherein said processor further being configured to display a plurality of header parameters for the documents associated with the node.
  - 16. ) The processor and associated display of claim 15 wherein said plurality of header parameters are arranged in tabular form.
  - 17. ) The processor and associated display of claim 13 wherein at least one of said documents includes at least one member of a group that includes:
    - signature, disclaimers, attachment notification, and at least one attachment, and wherein said processor is configured to suppress the display of at least one of said members.
  - 18. ) A processor and associated display communicating with the data structure of claim 1, and capable of manipulating data of said data structure and displaying selected data on a display unit, wherein said processor further being configured to display a list of documents that are associated with leaf nodes, wherein a leaf node comprises a node that has no descendant nodes.
  - 19. ) The processor and associated display of claim 18 in which entries in said displayed list comprise a listing of the documents associated with nodes of which said leaf node is a descendant.
  - 20. ) The computer product according to claim 1 wherein said documents are emails, and wherein at least two from among said emails are obtained from different email archives.
  - 21. ) The computer product of claim 1 wherein said documents are emails, and further comprising additional nodes associated with presumed documents.
  - 22. ) The computer product of claim 1, wherein the body text of each document associated with said first node is substantially inclusive of the body text of each document that is associated with said second node, irrespective of whether a normalized subject parameter from a header of a document associated with said first node and a normalized subject parameter from a header of a document associated with said second nodes are identical.
  - 23. ) The computer product of claim 22, wherein all documents associated with a given node further having substantially the same normalized subject parameter in said header.

24. ) A method for organizing documents into nodes, in which a node represents a group of substantially equivalent documents, said method comprising:
- (i) providing a plurality of original documents, each comprising a header and a body, and wherein said header comprises at least one parameter and wherein said body comprises text,(ii) selecting a document from among said documents and associating the document with a node, comparing at least a portion of the body text of said document to at least a portion of the body texts of other documents from amongst said plurality of documents, and in the case of a match, merging the node associated with said document with a node associated with the matching document,(iii) searching the body of said document to locate a first instance of header-type text, wherein said header-type text contains at least one header parameter;
  
  (iv) constructing a presumed document comprising a header and a body, wherein said header of said presumed document comprises one or more parameters from said header-type text located within said body of said original document, and wherein said body of said presumed document substantially comprises the text located after said header-type text in said body of said original document, and associating said presumed document with a node;
  
  (v) comparing at least a portion of the body text of the presumed document to at least a portion of the body texts of at least one other document from among said plurality of documents and in the case of a match, merging a node associated with said presumed document with a node associated with the matching document,(vi) if the comparison of (v) does not find a match, processing repeatedly the remainder of the body of said document for successive instances of header-type text, as stipulated in stages (iii)-(v), and for each instance, constructing a presumed document, comparing for any matching documents to the presumed document, and if found, merging the nodes associated with the matching documents, until no new instances of header-type text are found.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
- - 25. ) The method of claim 24 further comprising(vii) storing at least a portion of the document, or a fingerprint thereof, for future comparison with other documents.
  - 26. ) The method of claim 24 wherein (ii) is applied to selected documents from amongst said plurality of documents.
  - 27. ) The method for organizing documents of claim 25, further comprising the step of linking nodes, in which linking implies that the text of a document on a first side of said link is substantially inclusive of the text of a document on a second side of said link, and wherein (v) further comprising linking the associated node to be a parent of the node stipulated in (ii);
    - and wherein (vi) comprising linking the associated node to be a parent of the associated node of the most recent iteration of (v).
  - 28. ) The method of claim 25, wherein (ii) and (v) comprising comparing both of at least a portion of the body text and a normalized subject parameter, with at least a portion of the body text and a normalized subject parameter of said other documents.
  - 29. ) The method of claim 28 further comprising displaying on the display unit symbols indicative of said nodes, and further comprising affiliating for each node a body text and subject parameter of at least one document associated with the node.
  - 30. ) The method of claim 29 further comprising affiliating each node with a plurality of header parameters from each document associated with the node, said plurality of header parameters being arranged in a table.
  - 31. ) The method of claim 30 wherein said documents are emails and wherein said at plurality of header parameter comprises two or more fields from the email header selected from the group of fields consisting of:
    - “
      
      To”
      
      , “
      
      From”
      
      , “
      
      Subject”
      
      , and “
      
      Date”
      
      .
  - 32. ) The method of claim 25 further comprising displaying the nodes;
    - and suppressing nodes associated with a presumed document from the display.
  - 33. ) The method of claim 32 further comprising affiliating each displayed node with header parameters of each document associated with said displayed node;
    - and affiliating, header parameters of documents associated with suppressed nodes with a node associated with a document from which the presumed document associated with said suppressed node is constructed.
  - 34. ) The method of claim 25 wherein (ii) further comprises comparing for near-duplication at least a portion of the body text of said document to at least a portion of the body texts of other documents from amongst said plurality of documents.
  - 35. ) The method of claim 34 further comprising creating an association between nodes that are associated with documents found to near-duplicate to each other.
  - 36. ) The method of claim 35 further including enabling a user to define the degree of similarity between documents for documents to be considered near-duplicating.
  - 37. ) The method of claim 25 further comprising creating an association between nodes that are associated with documents having related Conversation ID or related Message ID indicators.
  - 38. ) The method of claim 25 further comprising removing at east one member of the group consisting of disclaimers, signatures, program added text and attachment notifications from the body text of documents, and replacing unique text of each removed member with a unique short text identifier prior to said comparing in (ii), wherein said comparing is applied to at least a portion of said body text after said replacing.

39. ) A method for reducing duplicate document display of a large number of documents, said method comprising:
- a) comparing a fingerprint of a document with previously stored document fingerprints, wherein a fingerprint is formed for each of at least a portion of the normalized body text and a normalized subject parameter of a document, wherein said comparison for detecting and indicating duplicating documents;
  
  b) searching the document for instances of header-type text, searching in text order through the normalized body text of the document, and if header-type text is found in said search,i) deriving a presumed document comprising a header and a body text, by treating parameters from the instance of header-type text in the document as parameters of a header for the presumed document, and by treating all ensuing body text of the normalized body text of the document as the body text of the presumed document, and applying step a) to the presumed documents, andii) if the fingerprint of the presumed document is unique, continuing to search the normalized body text of the document from which the presumed document is derived for further instances of header-type text, searching in text order through the normalized body text of the document, and if a further instance of header-type text is found in said search, applying step i) to derive and process an additional presumed document, andii) repeating step ii) until no more instances of header-type text are found.
- View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
- - 40. ) The method of claim 39 wherein a) is applied to selected documents from amongst said large group of documents.
  - 41. ) The method of claim 39 further comprising providing a plurality of nodes, and associating each document having a unique fingerprint with a unique node, and associating each document detected as duplicating to a prior document with the node associated with the prior document.
  - 42. ) The method of claim 41 further comprising linking nodes to provide that a node associated with a first presumed document becomes the parent of the node associated with the document from which the first presumed document is derived, and to provide that the node associated with each sequentially derived presumed document derived from the same document becomes a parent of the node associated with the previously derived presumed document.
  - 43. ) The method of claim 42 further comprising removing each of disclaimers, signatures, program added text and attachment notifications from the body text of documents, and replacing each unique disclaimer, signature, program added text, and attachment notification with a unique short identifier prior to a), wherein said fingerprint of a) is a fingerprint of the body text after said replacement.
  - 44. ) The method of claim 42 further comprising displaying said nodes in a computer format, and affiliating each node with the body text and subject parameter of the document associated with the node.
  - 45. ) The method of claim 44 further comprising affiliating each node with a plurality of header parameters from each document associated with the node, said plurality of header parameters being arranged in a table.
  - 46. ) The method of claim 45 wherein said documents are emails and wherein said headers comprises fields from the email headers, including “
    - To”
      
      . “
      
      From”
      
      , “
      
      Subject”
      
      , and “
      
      Date”
      
      .
  - 47. ) The method of claim 42 further comprising displaying documents in a data structure able to be sorted according to one or more members of the group consisting of:
    - document identifier, document set, node address, inclusive flag, first copy of an inclusive flag.
  - 48. ) The method of claim 39 wherein said comparison of a) further for detecting and indicating near-duplicating documents.
  - 49. ) The method of claim 48 further comprising enabling a user to set the degree of similarity for documents to be considered as near-duplicating.
  - 50. ) The method of claim 48 further comprising associating documents with document sets by associating to a document set:
    - a first document, and documents that are associated with a node that is linked to the node associated with a document already associated with said document set, and documents that near-duplicate to a document already associated with said document set.
  - 51. ) The method of claim 50 further comprising associating to a document set documents that have related Conversation ID or related Message ID indicators with a document already associated with said same document set.
  - 52. ) The method of claim 39 further comprising:
    - c) storing the document fingerprint, for future comparison with other document fingerprints.
  - 53. ) The method of claim 52 further comprising:
    - d) if a document is found to be a duplicate of a prior document, suppressing step c).
  - 54. ) The method of claim 39 further comprisinge) forming a subset of the large number of documents by including each document into the subset except for documents that duplicate to another document already in the subset, and except for documents that duplicate to a presumed document whereby only a single copy of inclusive documents are in the subset.
  - 55. ) The method of claim 54 further comprising affiliating each document in said subset with other documents that duplicate to said document and with documents that duplicate to a presumed document derived from said document.

56. ) A computer product including a data structure for organizing of a plurality of documents and capable of being utilized by a processor for manipulating data of said data structure and capable of displaying selected data on a display unit;
- said data structure comprising;
  
  one or more trees, wherein a tree comprises at least a trunk and at least one node, wherein said at least one node being associated with a document having at least a header and body text, and wherein a trunk being associated with zero or more documents having at least a header and a body text and wherein all documents whose body text includes the same included document are associated with the same tree, and wherein a unique inclusive document, as well as documents that duplicate to said unique inclusive document, are associated with one of one or more unique nodes of said tree, and wherein an included document, as well as documents that duplicate to said included document, are associated with said trunk of said tree.
- View Dependent Claims (57, 58, 59)
- - 57. ) The computer product of claim 56 wherein said documents are associated with said trunk irrespective of whether a normalized subject parameter of documents associated with said trunk is identical to a normalized subject parameter of documents associated with said node.
  - 58. ) The computer product of claim 56 wherein documents qualify as duplicating if they comprise the same normalized body text, and the same normalized subject parameter.
  - 59. ) The computer product of claim 56 wherein two documents qualify as duplicating according to whether they comprise the same normalized body text, and, if the normalized subject parameter for both documents is not blank, additionally according to whether they comprise.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Equivio Ltd. (Microsoft Corporation)
Original Assignee
Equivio Ltd. (Microsoft Corporation)
Inventors
MILO, Amir, RAVID, Yiftach

Application Number

US11/968,433
Publication Number

US 20090012984A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/248   Presentation of query results

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/93   Document management systems

G06F 40/194   Calculation of difference b...

G06V 30/416   Extracting the logical stru...

G06V 30/418   Document matching, e.g. of ...

H04L 51/216   Handling conversation histo...

Method for Organizing Large Numbers of Documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

59 Claims

Specification

Solutions

Use Cases

Quick Links

Method for Organizing Large Numbers of Documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

59 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links