Method for graph-based table recognition
First Claim
Patent Images
1. A graph-based method for recognizing tables present in a document represented as digital image data, including the steps of:
- segmenting the digital image data to identify textual and image entities within the document;
building a layout graph of the document using the entities;
automatically tagging each of the text entities with a label from a document node alphabet to produce a labeled graph;
automatically rewriting the labeled graph, by manipulating the text entities therein, using at least one rewriting rule, to identify the logical structure of the document, wherein the logical structure includes an identified table and the labeled graph is a host graph and where the step of automatically rewriting the labeled graph by manipulating the text entities comprises the steps ofidentifying at least two entities in the layout graph that are isomorphic instances of subgraphs, andreplacing, in the host graph, the at least two entities that are isomorphic instances of the subgraphs with a nonterminal entity.
4 Assignments
0 Petitions
Accused Products
Abstract
The present invention is a method for bottom-up recognition of tables within a document. This method is based on the paradigm of graph-rewriting. First, the document image is transformed into a layout graph whose nodes and edges represent document entities and their interrelations respectively. This graph is subsequently rewritten using a set of rules designed based on apriori document knowledge and general formatting conventions. The resulting graph provides a logical view of the document content. It can be parsed to provide general format analysis information.
-
Citations
22 Claims
-
1. A graph-based method for recognizing tables present in a document represented as digital image data, including the steps of:
- segmenting the digital image data to identify textual and image entities within the document;
building a layout graph of the document using the entities; automatically tagging each of the text entities with a label from a document node alphabet to produce a labeled graph; automatically rewriting the labeled graph, by manipulating the text entities therein, using at least one rewriting rule, to identify the logical structure of the document, wherein the logical structure includes an identified table and the labeled graph is a host graph and where the step of automatically rewriting the labeled graph by manipulating the text entities comprises the steps of identifying at least two entities in the layout graph that are isomorphic instances of subgraphs, and replacing, in the host graph, the at least two entities that are isomorphic instances of the subgraphs with a nonterminal entity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- segmenting the digital image data to identify textual and image entities within the document;
-
21. A graph-based method for recognizing tables present in a document represented as digital image data, including the steps of:
-
segmenting the digital image data to identify textual and image entities within the document; building a layout graph of the document using the entities; automatically tagging each of the text entities with a label from a document node alphabet to produce a labeled graph; automatically rewriting the labeled graph, by manipulating the text entities therein, using at least one rewriting rule, to identify the logical structure of the document, wherein the logical structure includes an identified table and where the rewriting rule comprises a production rule, an embedding function, an attribute transfer functions, and an application condition.
-
-
22. An automated document structure recognition system for performing a graph-based method for recognizing tables present in a document represented as digital image data, including:
-
a segmentation subsystem for segmenting the digital image data to identify textual and image entities within the document; a graph construction module for building a layout graph of the document using the entities; an entity recognition subsystem for automatically tagging each of the text entities within the layout graph with a label from a document node alphabet to produce a labeled graph; and a graph rewriting system for automatically rewriting the labeled graph, by manipulating the text entities therein, using at least one rewriting rule, to identify the logical structure of the document, wherein the logical structure includes an identified table, wherein the labeled graph is a host graph and where the rewriting system identifies at least two entities in the layout graph that are isomorphic instances of subgraphs, and replaces, in the host graph, the at least two entities that are isomorphic instances of the subgraphs with a nonterminal entity.
-
Specification