Method and system for constructing a document redundancy graph

US 8,914,720 B2
Filed: 07/31/2009
Issued: 12/16/2014
Est. Priority Date: 07/31/2009
Status: Active Grant

First Claim

Patent Images

1. A method for constructing a document redundancy graph, said method comprising:

representing each paragraph associated with a document set as a node among a plurality of nodes, wherein each node among said plurality of nodes with respect to said redundancy graph represents a unique cluster of information related to said each paragraph;

providing said each paragraph with a unique paragraph identifier;

constructing a hash table of all paragraph identifiers comprising identifiers of all paragraphs reachable from said each paragraph;

merging said plurality of nodes associated with redundant information by configuring said hash table with respect to a pair of paragraph identifiers in association with a probability value, wherein said probability value sorts a plurality of information matches in an order of decreasing certainty of common content, wherein a pair of said paragraph identifiers associated with an increased certainty of common content are selected to merge; and

combining said plurality of nodes unique to a single document by expressing a pair of nodes with overlapping common content as a combined node, wherein said combined node comprises an empty intersection of said pair of nodes and comparing each paragraph identifier among said pair of paragraph identifiers to a probability value associated with an entry in said hash table in an order wherein said hash table eliminates inconsistency associated with said plurality of information matches.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for constructing a document redundancy graph with respect to a document set. The redundancy graph can be constructed with a node for each paragraph associated with the document set such that each node in the redundancy graph represents a unique cluster of information. The nodes can be linked in an order with respect to the information provided in the document set and bundles of redundant information from the document set can be mapped to individual nodes. A data structure (e.g., a hash table) of a paragraph identifier associated with a probability value can be constructed for eliminating inconsistencies with respect to node redundancy. Additionally, a sequence of unique nodes can also be integrated into the graph construction process. The nodes can be connected to the paragraphs associated with the document set via a hyperlink and/or via a label with respect to each node.

Citations

18 Claims

1. A method for constructing a document redundancy graph, said method comprising:
- representing each paragraph associated with a document set as a node among a plurality of nodes, wherein each node among said plurality of nodes with respect to said redundancy graph represents a unique cluster of information related to said each paragraph;
  
  providing said each paragraph with a unique paragraph identifier;
  
  constructing a hash table of all paragraph identifiers comprising identifiers of all paragraphs reachable from said each paragraph;
  
  merging said plurality of nodes associated with redundant information by configuring said hash table with respect to a pair of paragraph identifiers in association with a probability value, wherein said probability value sorts a plurality of information matches in an order of decreasing certainty of common content, wherein a pair of said paragraph identifiers associated with an increased certainty of common content are selected to merge; and
  
  combining said plurality of nodes unique to a single document by expressing a pair of nodes with overlapping common content as a combined node, wherein said combined node comprises an empty intersection of said pair of nodes and comparing each paragraph identifier among said pair of paragraph identifiers to a probability value associated with an entry in said hash table in an order wherein said hash table eliminates inconsistency associated with said plurality of information matches.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1 further comprising configuring at least one paragraph identifier among said pair of paragraph identifiers to include a list of identifiers associated with at least one information element.
  - 3. The method of claim 1 wherein merging said plurality of nodes associated with said redundant information further comprises:
    - combining said plurality of nodes into a single node if an intersection of said document set reachable from each node is empty.
  - 4. The method of claim 1 wherein merging said plurality of nodes associated with said redundant information further comprises:
    - updating said hash table that describes information combinations after combining a pair of nodes.
  - 5. The method of claim 1 wherein combining said plurality of nodes unique to said single document further comprises:
    - setting a flag to indicate said node is a combined node if said hash table comprises said node.
  - 6. The method of claim 1 wherein combining said plurality of nodes unique to said single document further comprises:
    - initiating a chain node if said node follows said combined node by checking said flag in order to thereafter clear said flag.
  - 7. The method of claim 6 wherein combining said plurality of nodes unique to said single document further comprises:
    - adding said node to said chain node if said paragraph does not follow said combined node.
  - 8. The method of claim 6 further comprising adding an edge to said redundant graph for every transition from said chain node to said combined node and vice versa.
  - 9. The method of claim 1 further comprising linking said plurality of nodes with respect to said at least one paragraph via a hyperlink.
  - 10. The method of claim 1 further comprising linking said plurality of nodes with respect to said at least one paragraph via a label.
  - 11. The method of claim 10 wherein said label comprises at least one of the following types of data:
    - a cryptic paragraph identifier;
      
      a summary associated with said paragraph;
      
      ora paragraph content.

12. A system for constructing a document redundancy graph, said system comprising:
- a processor;
  
  a data bus coupled to said processor; and
  
  a computer-usable mass storage device embodying computer code, said computer-usable medium being coupled to said data bus, said computer program code comprising instructions executable by said processor and configured for;
  
  representing each paragraph associated with a document set as a node among a plurality of nodes, wherein each node among said plurality of nodes with respect to said redundancy graph represents a unique cluster of information related to said each paragraph;
  
  providing said each paragraph with a unique paragraph identifier;
  
  constructing a hash table of all paragraph identifiers comprising identifiers of all paragraphs reachable from said each paragraph;
  
  merging said plurality of nodes associated with redundant information by configuring said hash table with respect to a pair of paragraph identifiers in association with a probability value, wherein said probability value sorts a plurality of information matches in an order of decreasing certainty of common content, wherein a pair of said paragraph identifiers associated with an increased certainty of common content are selected to merge; and
  
  combining said plurality of nodes unique to a single document by expressing a pair of nodes with overlapping common content as a combined node, wherein said combined node comprises an empty intersection of said pair of nodes and comparing each paragraph identifier among said pair of paragraph identifiers to a probability value associated with an entry in said hash table in an order wherein said hash table eliminates inconsistency associated with said plurality of information matches.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The system of claim 12 wherein said instructions are further configured for modifying at least one paragraph identifier among said pair of paragraph identifiers to include a list of identifiers associated with at least one information element.
  - 14. The system of claim 12 wherein said instructions are further configured for adding an edge to said redundant graph for every transition from said chain node to said combined node and vice versa.
  - 15. The system of claim 12 wherein said instructions are further configured for linking said plurality of nodes with respect to said at least one paragraph via a hyperlink.
  - 16. The system of claim 12 wherein said instructions are further configured for linking said plurality of nodes with respect to said at least one paragraph via a label.
  - 17. The system of claim 16 wherein said label comprises at least one of the following types of data:
    - a cryptic paragraph identifier;
      
      a summary associated with said paragraph;
      
      ora paragraph content.

18. A computer-usable mass storage for constructing a document redundancy graph, said computer-usable mass storage storing computer program code, said computer program code comprising program instructions executable by a processor, said program instructions comprising:
- program instructions to represent each paragraph associated with a document set as a node among a plurality of nodes, wherein each node among said plurality of nodes with respect to said redundancy graph represents a unique cluster of information related to said each paragraph;
  
  program instructions to provide said each paragraph with a unique paragraph identifier;
  
  program instructions to construct a hash table of all paragraph identifiers comprising identifiers of all paragraphs reachable from said each paragraph;
  
  program instructions to merge said plurality of nodes associated with redundant information by configuring said hash table with respect to a pair of paragraph identifiers in association with a probability value, wherein said probability value sorts a plurality of information matches in an order of decreasing certainty of common content, wherein a pair of said paragraph identifiers associated with an increased certainty of common content are selected to merge; and
  
  program instructions to combine said plurality of nodes unique to a single document by expressing a pair of nodes with overlapping common content as a combined node, wherein said combined node comprises an empty intersection of said pair of nodes and comparing each paragraph identifier among said pair of paragraph identifiers to a probability value associated with an entry in said hash table in an order wherein said hash table eliminates inconsistency associated with said plurality of information matches.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Harrington, Steven J.
Primary Examiner(s)
Paula, Cesar
Assistant Examiner(s)
Cortes, Howard

Application Number

US12/533,901
Publication Number

US 20110029952A1
Time in Patent Office

1,964 Days
Field of Search

715/254, 715/204
US Class Current

715/254
CPC Class Codes

G06F 16/345   Summarisation for human users

G06F 40/131   Fragmentation of text files...

G06F 40/194   Calculation of difference b...

Method and system for constructing a document redundancy graph

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for constructing a document redundancy graph

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links