Method and system for analyzing the logical structure of a document
First Claim
1. In a method for analyzing the logical structure of a document stored in readable form in a computer system in processing by said computer system,a method for analyzing the logical structure of a document comprising the steps of:
- (a) reading said document into the memory region of said computer system on a line-by-line basis;
(b) determining an attribute of said line comprising a start of a list, a continuation of a list, and the attributes of an ordinary line which is read into said memory region in accordance with a predetermined line attribute determination criterion to allow for the determination of a plurality of attributes and associate with teach determination a cost value representing a measure for the validity of the determination;
(c) in response to the completion of the attribute determination for all of the lines of said document, generating a starting node and then, for the first line of said document, generating as many nodes as the number of attribute determinations of said first line, and, in accordance with a predetermined attribute-based node linking rule, linking said starting node with the nodes generated for said first line;
(d) for the next line, generating as many nodes as the number of attribute determinations of said next line and in accordance with a predetermined node linking rule based on said attributes and cost value linking through links the nodes generated for the line preceding said next line with the nodes generated for said next line, and giving costs to said generated nodes and links;
(e) sequentially performing said step (d) until reaching the last line to construct a directed graph directed from said starting node to the ending node;
(f) traveling said directed graph from said starting node while summing the costs for the nodes and links en route to find a plurality of paths reaching the ending node; and
(g) ranking said plurality of paths found based on the sum of the costs associated with said individual paths.
1 Assignment
0 Petitions
Accused Products
Abstract
An input document is matched with predetermined patterns on a line-by-line basis, whereby it can be assigned a plurality of pairs of attributes and costs. When the process for the whole document is completed, in accordance with a rule specifying the combination of attributes between the adjacent lines, the nodes of a graph are generated, the nodes are linked with each other, and costs are given to the node and links. There is a plurality of paths for traveling the graph from the root node to the final node, and each of them means the interpretation of a possible logical structure of the document. By summing the costs for the traveled nodes and links, a total cost value can be associated with each path, and by prioritizing by this total cost value, a plurality of logical structure interpretations can be sequentially shown from the most plausible path (logical structure interpretation). A chosen logical structure is tagged as required.
-
Citations
20 Claims
-
1. In a method for analyzing the logical structure of a document stored in readable form in a computer system in processing by said computer system,
a method for analyzing the logical structure of a document comprising the steps of: -
(a) reading said document into the memory region of said computer system on a line-by-line basis; (b) determining an attribute of said line comprising a start of a list, a continuation of a list, and the attributes of an ordinary line which is read into said memory region in accordance with a predetermined line attribute determination criterion to allow for the determination of a plurality of attributes and associate with teach determination a cost value representing a measure for the validity of the determination; (c) in response to the completion of the attribute determination for all of the lines of said document, generating a starting node and then, for the first line of said document, generating as many nodes as the number of attribute determinations of said first line, and, in accordance with a predetermined attribute-based node linking rule, linking said starting node with the nodes generated for said first line; (d) for the next line, generating as many nodes as the number of attribute determinations of said next line and in accordance with a predetermined node linking rule based on said attributes and cost value linking through links the nodes generated for the line preceding said next line with the nodes generated for said next line, and giving costs to said generated nodes and links; (e) sequentially performing said step (d) until reaching the last line to construct a directed graph directed from said starting node to the ending node; (f) traveling said directed graph from said starting node while summing the costs for the nodes and links en route to find a plurality of paths reaching the ending node; and (g) ranking said plurality of paths found based on the sum of the costs associated with said individual paths. - View Dependent Claims (2, 3, 4, 5)
-
-
6. In a system for analyzing the logical structure of a document stored in readable form in a computer system in processing by said computer,
a system for analyzing the logical structure of a document comprising: -
(a) a memory means for storing a document in a form readable by said computer system on a line-by-line basis; (b) a reading means for reading out said document from said memory means on a line-by-line basis; (c) a text format analyzing means having a plurality of patterns and a pair of at least one attribute and a cost value associated with said individual patterns, having a pattern expression dictionary in which at least two or more pairs of an attribute and a cost value are associated with one pattern, and applying said plurality of patterns to the individual lines of said document read out by said reading means and in response to the occurrence of pattern matching, associating and storing all of the pairs of the attributes and cost values associated with the pattern for which said matching has occurred; (d) a graph generating means for performing the node generation of a graph and the linking between the nodes in accordance with said attributes and cost values and previously having a rule for giving costs to said nodes and internode links, and in response to the completion of the processing for all of the lines of said document by said text format analyzing means, applying said rule from the first line of said document to link the nodes generated for the preceding line with the nodes generated for the next line, thereby generating a directed graph; (e) a graph analyzing means for traveling said generated directed graph from the starting node until reaching the ending node while summing the costs given to said nodes and links to find a plurality of paths, and ranking said paths found based on the cost values calculated for them; and (f) a means for giving tags indicating the logical structure of the document based on said ranked paths of the graph. - View Dependent Claims (7, 8, 9)
-
-
10. In a method for analyzing the logical structure of a document printed on paper in a predetermined format by a computer system processing,
a method for analyzing the logical structure of a document comprising the steps of: -
(a) optically scanning the paper on which said document is printed, thereby to create a document image which can be processed by said computer system; (b) analyzing said document image to determine the margins and paragraphs of the document; (c) performing character column slicing and character recognition based on the determination of the margins and paragraphs of said document, and storing the result in a memory means in a form which can be processed by said computer system on a line-by-line basis for the document; (d) reading said document on a line-by-line basis into the memory region of said compute system; (e) determining attributes of line comprising the start of a list, the continuation of a list, and the attributes of an ordinary line read into said memory region in accordance with a predetermined line attribute determination criterion to allow for the determining of a plurality of attributes and associate with each determination a cost value indicating the validity of the determination; (f) in response to the completion of the attribute determination for all of the lines of said document, generating a starting node and then, for the first line of said document, generating as many nodes as the number of the attribute determinations of said first line, and, in accordance with a predetermined attribute-based node linking rule, linking said starting node with the nodes generated for said first line; (g) for the next line, generating as many nodes as the number of attribute determinations of said next line, and, in accordance with a predetermined node linking rule based on said attribute and cost value, linking through links the nodes generated for the line preceding said next line with the nodes generated for said next line, and giving costs to said generated nodes and links; (h) sequentially performing said step (g) until reaching the last line to construct a directed graph directed from said starting node t the ending node; (i) traveling said directed graph from said starting node while summing the costs for the nodes an links en route to find a plurality of paths reaching the ending node; and (j) ranking said plurality of paths found based on the sum of the costs associated with said individual paths. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. In a system for analyzing the logical structure of a document printed on paper in a predetermined format by a computer system processing,
a system for analyzing the logical structure of a document comprising: -
(a) an image scanner means for optically scanning the paper on which said document is printed, thereby to create a document image which can be processed by said computer system; (b) a means for analyzing said document image to determine the margins and paragraphs of the document, and storing the result in a form which can be processed by said computer system; (c) a means for performing character column slicing and character recognition based on the determination of the margins and paragraphs of said document, and storing the result in the memory means in a form which can be processed by said computer system on a line-by-line basis for the document; (d) a reading means for said document form said memory means on a line-by-line basis; (e) a text format analyzing means having a plurality of patterns and a pair of at least one attribute and a cost value associated with said individual patterns, having a pattern expression dictionary in which at least tow or more pairs of an attribute and a cost value are associated with at least one pattern, and applying said plurality of patterns to the individual lines of said document read out by said reading means and, in response to the occurrence of a pattern match, associating and storing all of the pairs of attribute and cost values associated with the pattern for which said match has occurred; (f) a graph generating means for performing the node generation of a graph and the linking between the nodes in accordance with said attributes and cost values and previously having a rule for giving costs to aid nodes and internode links, and, in response to the completion of the processing for all of the liens of said document by said text format analyzing means, applying said rule from the first line of said document to link the nodes generated for the preceding line with the nodes generated for the next line, thereby to generate a directed graph; (g) a graph analyzing means for traveling said generated directed graph from the starting node until reaching the ending node while summing the costs given to said nodes and links to find a plurality of paths, and ranking said paths found based on the cost values calculated for them; and (h) means for giving tags indicating the logical structure of the document based on said ranked paths of the graph. - View Dependent Claims (18, 19, 20)
-
Specification