METHOD FOR ORGANIZING LARGE NUMBERS OF DOCUMENTS

US 20100287466A1
Filed: 07/20/2010
Published: 11/11/2010
Est. Priority Date: 07/02/2007
Status: Active Grant

First Claim

Patent Images

1. ) A method for organizing documents into nodes, in which a node represents a group of substantially equivalent documents, said method comprising:

(i) providing a plurality of original documents, each comprising a header and a body, and wherein said header comprises at least one parameter and wherein said body comprises text,(ii) selecting a document from among said documents and associating the document with a node, comparing at least a portion of the body text of said document to at least a portion of the body texts of other documents from amongst said plurality of documents, and in the case of a match, merging the node associated with said document with a node associated with the matching document,(iii) searching the body of said document to locate a first instance of header-type text, wherein said header-type text contains at least one header parameter;

(iv) constructing a presumed document comprising a header and a body, wherein said header of said presumed document comprises one or more parameters from said header-type text located within said body of said original document, and wherein said body of said presumed document substantially comprises the text located after said header-type text in said body of said original document, and associating said presumed document with a node;

(v) comparing at least a portion of the body text of the presumed document to at least a portion of the body texts of at least one other document from among said plurality of documents and in the case of a match, merging a node associated with said presumed document with a node associated with the matching document,(vi) if the comparison of (v) does not find a match, processing repeatedly the remainder of the body of said document for successive instances of header-type text, as stipulated in stages (iii)-(v), and for each instance, constructing a presumed document, comparing for any matching documents to the presumed document, and if found, merging the nodes associated with the matching documents, until no new instances of header-type text are found.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer product including a data structure for organizing of a plurality of documents, and capable of being utilized by a processor for manipulating data of the data structure and capable of displaying selected data on a display unit. The data structure includes a plurality of directionally interlinked nodes, each node being associated with one or more documents having a header and body text. All the documents are associated with a given node and have identical normalized body text. All documents that have identical normalized body text are associated with the same node. One or more of the nodes is associated with more than one document. For any node that is a descendent of another node, the normalized body text of each document associated with the node is inclusive of the normalized body text of a document that is associated with the other node.

Citations

32 Claims

1. ) A method for organizing documents into nodes, in which a node represents a group of substantially equivalent documents, said method comprising:
- (i) providing a plurality of original documents, each comprising a header and a body, and wherein said header comprises at least one parameter and wherein said body comprises text,(ii) selecting a document from among said documents and associating the document with a node, comparing at least a portion of the body text of said document to at least a portion of the body texts of other documents from amongst said plurality of documents, and in the case of a match, merging the node associated with said document with a node associated with the matching document,(iii) searching the body of said document to locate a first instance of header-type text, wherein said header-type text contains at least one header parameter;
  
  (iv) constructing a presumed document comprising a header and a body, wherein said header of said presumed document comprises one or more parameters from said header-type text located within said body of said original document, and wherein said body of said presumed document substantially comprises the text located after said header-type text in said body of said original document, and associating said presumed document with a node;
  
  (v) comparing at least a portion of the body text of the presumed document to at least a portion of the body texts of at least one other document from among said plurality of documents and in the case of a match, merging a node associated with said presumed document with a node associated with the matching document,(vi) if the comparison of (v) does not find a match, processing repeatedly the remainder of the body of said document for successive instances of header-type text, as stipulated in stages (iii)-(v), and for each instance, constructing a presumed document, comparing for any matching documents to the presumed document, and if found, merging the nodes associated with the matching documents, until no new instances of header-type text are found.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. ) The method of claim 1 further comprising:
    - (vii) storing at least a portion of the document, or a fingerprint thereof, for future comparison with other documents.
  - 3. ) The method of claim 1 wherein (ii) is applied to selected documents from amongst said plurality of documents.
  - 4. ) The method for organizing documents of claim 2, further comprising the step of linking nodes, in which linking implies that the text of a document on a first side of said link is substantially inclusive of the text of a document on a second side of said link, and wherein (v) further comprising linking the associated node to be a parent of the node stipulated in (ii);
    - and wherein (vi) comprising linking the associated node to be a parent of the associated node of the most recent iteration of (v).
  - 5. ) The method of claim 2, wherein (ii) and (v) comprising comparing both of at least a portion of the body text and a normalized subject parameter, with at least a portion of the body text and a normalized subject parameter of said other documents.
  - 6. ) The method of claim 5 further comprising displaying on the display unit symbols indicative of said nodes, and further comprising affiliating for each node a body text and subject parameter of at least one document associated with the node.
  - 7. ) The method of claim 6 further comprising affiliating each node with a plurality of header parameters from each document associated with the node, said plurality of header parameters being arranged in a table.
  - 8. ) The method of claim 7 wherein said documents are emails and wherein said at plurality of header parameter comprises two or more fields from the email header selected from the group of fields consisting of:
    - “
      
      To”
      
      , “
      
      From”
      
      , “
      
      Subject”
      
      , and “
      
      Date”
      
      .
  - 9. ) The method of claim 2 further comprising displaying the nodes;
    - and suppressing nodes associated with a presumed document from the display.
  - 10. ) The method of claim 9 further comprising affiliating each displayed node with header parameters of each document associated with said displayed node;
    - and affiliating header parameters of documents associated with suppressed nodes with a node associated with a document from which the presumed document associated with said suppressed node is constructed.
  - 11. ) The method of claim 2 wherein (ii) further comprises comparing for near-duplication at least a portion of the body text of said document to at least a portion of the body texts of other documents from amongst said plurality of documents.
  - 12. ) The method of claim 11 further comprising creating an association between nodes that are associated with documents found to near-duplicate to each other.
  - 13. ) The method of claim 12 further including enabling a user to define the degree of similarity between documents for documents to be considered near-duplicating.
  - 14. ) The method of claim 2 further comprising creating an association between nodes that are associated with documents having related Conversation ID or related Message ID indicators.
  - 15. ) The method of claim 2 further comprising removing at least one member of the group consisting of:
    - disclaimers, signatures, program added text and attachment notifications from the body text of documents, and replacing unique text of each removed member with a unique short text identifier prior to said comparing in (ii), wherein said comparing is applied to at least a portion of said body text after said replacing.

16. ) A method for reducing duplicate document display of a large number of documents, said method comprising:
- a) comparing a fingerprint of a document with previously stored document fingerprints, wherein a fingerprint is formed for each of at least a portion of the normalized body text and a normalized subject parameter of a document, wherein said comparison for detecting and indicating duplicating documents;
  
  b) searching the document for instances of header-type text, searching in text order through the normalized body text of the document, and if header-type text is found in said search,i) deriving a presumed document comprising a header and a body text, by treating parameters from the instance of header-type text in the document as parameters of a header for the presumed document, and by treating all ensuing body text of the normalized body text of the document as the body text of the presumed document, and applying step a) to the presumed documents, andii) if the fingerprint of the presumed document is unique, continuing to search the normalized body text of the document from which the presumed document is derived for further instances of header-type text, searching in text order through the normalized body text of the document, and if a further instance of header-type text is found in said search, applying step i) to derive and process an additional presumed document, andiii) repeating step ii) until no more instances of header-type text are found.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 17. ) The method of claim 16 wherein a) is applied to selected documents from amongst said large group of documents.
  - 18. ) The method of claim 16 further comprising providing a plurality of nodes, and associating each document having a unique fingerprint with a unique node, and associating each document detected as duplicating to a prior document with the node associated with the prior document.
  - 19. ) The method of claim 18 further comprising linking nodes to provide that a node associated with a first presumed document becomes the parent of the node associated with the document from which the first presumed document is derived, and to provide that the node associated with each sequentially derived presumed document derived from the same document becomes a parent of the node associated with the previously derived presumed document.
  - 20. ) The method of claim 19 further comprising removing each of disclaimers, signatures, program added text and attachment notifications from the body text of documents, and replacing each unique disclaimer, signature, program added text, and attachment notification with a unique short identifier prior to a), wherein said fingerprint of a) is a fingerprint of the body text after said replacement.
  - 21. ) The method of claim 19 further comprising displaying said nodes in a computer format, and affiliating each node with the body text and subject parameter of the document associated with the node.
  - 22. ) The method of claim 21 further comprising affiliating each node with a plurality of header parameters from each document associated with the node, said plurality of header parameters being arranged in a table.
  - 23. ) The method of claim 22 wherein said documents are emails and wherein said headers comprises fields from the email headers, including “
    - To”
      
      , “
      
      From”
      
      , “
      
      Subject”
      
      , and “
      
      Date”
      
      .
  - 24. ) The method of claim 19 further comprising displaying documents in a data structure able to be sorted according to one or more members of the group consisting of:
    - document identifier, document set, node address, inclusive flag, first copy of an inclusive flag.
  - 25. ) The method of claim 16 wherein said comparison of a) further for detecting and indicating near-duplicating documents.
  - 26. ) The method of claim 25 further comprising enabling a user to set the degree of similarity for documents to be considered as near-duplicating.
  - 27. ) The method of claim 25 further comprising associating documents with document sets by associating to a document set:
    - a first document, and documents that are associated with a node that is linked to the node associated with a document already associated with said document set, and documents that near-duplicate to a document already associated with said document set.
  - 28. ) The method of claim 27 further comprising associating to a document set documents that have related Conversation ID or related Message ID indicators with a document already associated with said same document set.
  - 29. ) The method of claim 16 further comprising:
    - c) storing the document fingerprint, for future comparison with other document fingerprints.
  - 30. ) The method of claim 29 further comprising:
    - d) if a document is found to be a duplicate of a prior document, suppressing step c).
  - 31. ) The method of claim 16 further comprising:
    - e) forming a subset of the large number of documents by including each document into the subset except for documents that duplicate to another document already in the subset, and except for documents that duplicate to a presumed document whereby only a single copy of inclusive documents are in the subset.
  - 32. ) The method of claim 31 further comprising affiliating each document in said subset with other documents that duplicate to said document and with documents that duplicate to a presumed document derived from said document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Israel Research And Development (2002) Ltd (Microsoft Corporation)
Original Assignee
Equivio Ltd. (Microsoft Corporation)
Inventors
RAVID, Yiftach, Milo, Amir

Granted Patent

US 8,938,461 B2
Time in Patent Office

Days
Field of Search
US Class Current

715/256
CPC Class Codes

G06F 16/248   Presentation of query results

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/93   Document management systems

G06F 40/194   Calculation of difference b...

G06V 30/416   Extracting the logical stru...

G06V 30/418   Document matching, e.g. of ...

H04L 51/216   Handling conversation histo...

METHOD FOR ORGANIZING LARGE NUMBERS OF DOCUMENTS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR ORGANIZING LARGE NUMBERS OF DOCUMENTS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links