System for document layout analysis
First Claim
1. A document layout analysis method for determining document structure data from input data including the content and characteristics of regions of at least one page forming the document, the method including the steps of:
- identifying particular types of sections within the page, wherein the step of identifying sections within the page further comprises the steps ofidentifying rows and columns within a table,Identifying headers and footers on a page, andclassifying the headers and footers based upon a horizontal location;
identifying captions, wherein the step of identifying captions further comprises identifying an association between a non-text region of the page and the caption identified; and
determining boundaries of at least one column on the page, wherein the step of determining boundaries of at least one column on the page further comprises the steps ofidentification of text regions,grouping of text regions on the page into sections,grouping of the text regions within a section into the at least one column,determining a column width for the at least one column on the page, anddetermining, using the column width, the number of columns in each section.
4 Assignments
0 Petitions
Accused Products
Abstract
The present invention is a system for providing information on the structure of a document page so as to complement the textual information provided in an optical character recognition system. The system employs a method that can be used to produce a file editable in a native word-processing environment from input data including the content and characteristics of regions of at least one page forming the document. The method includes the steps of: (a) identifying sections within the page; (b) identifying captions; (c) determining boundaries of at least one column on the page, and optionally (d) resizing at least one element of the page of the document so that all pages of the document are of a common size.
181 Citations
9 Claims
-
1. A document layout analysis method for determining document structure data from input data including the content and characteristics of regions of at least one page forming the document, the method including the steps of:
identifying particular types of sections within the page, wherein the step of identifying sections within the page further comprises the steps of identifying rows and columns within a table, Identifying headers and footers on a page, and classifying the headers and footers based upon a horizontal location; identifying captions, wherein the step of identifying captions further comprises identifying an association between a non-text region of the page and the caption identified; and determining boundaries of at least one column on the page, wherein the step of determining boundaries of at least one column on the page further comprises the steps of identification of text regions, grouping of text regions on the page into sections, grouping of the text regions within a section into the at least one column, determining a column width for the at least one column on the page, and determining, using the column width, the number of columns in each section. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. A document analysis method comprising the steps of:
-
digitizing, using an image input terminal, a document to produce rasterized data representing the document; processing the rasterized data representing the document with a character recognition engine, wherein the character recognition engine operates to identify characters and characteristics for each of a plurality of words identified within the rasterized data, and where output of the character recognition engine includes descriptions of characters, fonts, and locations on a page of the document; and determining document structure data from output of the character recognition engine including the content and characteristics of regions of at least one page forming the document, said step of determining document structure data including the steps of identifying sections within the page, wherein the step of identifying sections within the page further comprises the steps of identifying rows and columns within a table, Identifying headers and footers on a page, and classifying the headers and footers based upon a horizontal location, identifying captions, wherein the step of identifying captions further comprises the steps of training a neural net with a predefined set of features, employing a trained neural net to generate an algorithm based upon the predefined set of features, and applying the algorithm generated by the neural net to the input data to identify text regions determined to be captions; and determining boundaries of at least one column on the page, wherein the step of determining boundaries further comprises the steps of identification of text regions, grouping of text regions on the page into sections, grouping of the text regions within a section into the at least one column, determining a column width for the at least one column on the page, and determining, using the column width, the number of columns in each section. - View Dependent Claims (8, 9)
-
Specification