Identification of tables in an unstructured document
First Claim
Patent Images
1. A non-transitory machine readable medium storing a program which when executed by at least one processing unit analyzes a document comprising a plurality of primitive elements, the program comprising sets of instructions for:
- identifying regions that comprise sets of primitive elements, the regions arranged in a hierarchy such that a first region at a first level of the hierarchy comprising a first set of primitive elements encompasses a second region at a second lower level of the hierarchy comprising a second set of primitive elements, the second set of primitive elements a subset of the first set of primitive elements, the first and second regions linked in the hierarchy;
determining that a particular region at a particular level in the hierarchy does not share a border with any linked region at a higher level than the particular level and encompasses at least two additional linked regions at a lower level in the hierarchy than the particular level;
identifying the particular region as a table and the encompassed regions as cells of the table; and
defining a tabular structural element for the table, the tabular structural element comprising a plurality of cells arranged in a plurality of rows and columns, each cell comprising an associated set of primitive elements.
1 Assignment
0 Petitions
Accused Products
Abstract
Some embodiments provide a method for analyzing an unstructured document that includes a number of glyphs. The method identifies boundaries between sets of glyphs. The method identifies that several of the boundaries form a table. The method defines a tabular structural element based on the table. The tabular structural element includes several cells arranged in a plurality of rows and columns, each of which includes an associated set of glyphs.
95 Citations
26 Claims
-
1. A non-transitory machine readable medium storing a program which when executed by at least one processing unit analyzes a document comprising a plurality of primitive elements, the program comprising sets of instructions for:
-
identifying regions that comprise sets of primitive elements, the regions arranged in a hierarchy such that a first region at a first level of the hierarchy comprising a first set of primitive elements encompasses a second region at a second lower level of the hierarchy comprising a second set of primitive elements, the second set of primitive elements a subset of the first set of primitive elements, the first and second regions linked in the hierarchy; determining that a particular region at a particular level in the hierarchy does not share a border with any linked region at a higher level than the particular level and encompasses at least two additional linked regions at a lower level in the hierarchy than the particular level; identifying the particular region as a table and the encompassed regions as cells of the table; and defining a tabular structural element for the table, the tabular structural element comprising a plurality of cells arranged in a plurality of rows and columns, each cell comprising an associated set of primitive elements. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method for defining a program for (i) analyzing a document comprising a plurality of primitive elements and (ii) generating structural elements that define structure in said document based on said analysis, the method comprising:
-
defining a module for identifying regions that comprise sets of primitive elements, the regions arranged in a hierarchy such that a first parent region comprising a first set of primitive elements encompasses a second child region comprising a second set of primitive elements, the second set of primitive elements a subset of the first set of primitive elements; defining a module for determining that a particular region (i) is rectangular, (ii) encompasses at least two additional rectangular child regions, and (iii) only shares borders with the additional rectangular child regions; defining a module for identifying the particular region as a table and the encompassed child regions as cells of the table; and defining a module for defining a tabular structural element based on the table, the tabular structural element comprising a plurality of cells arranged in a plurality of rows and columns, each cell comprising an associated set of primitive elements. - View Dependent Claims (22, 23)
-
-
24. A non-transitory machine readable medium storing a program which when executed by at least one processing unit analyzes a document comprising a plurality of primitive elements and generates structural elements that define structure in the document based on the analysis, the program comprising sets of instructions for:
-
identifying regions that comprise sets of primitive elements, the regions arranged in a hierarchy such that a first parent region comprising a first set of primitive elements encompasses a second child region comprising a second set of primitive elements, the second set of primitive elements a subset of the first set of primitive elements; determining that a particular region (i) is rectangular, (ii) encompasses at least two additional rectangular child regions, and (iii) only shares borders with the additional rectangular child regions; identifying the particular region as a table and the encompassed child regions as cells of the table; and defining a tabular structural element based on the table, the tabular structural element comprising a plurality of cells arranged in a plurality of rows and columns, each cell comprising an associated set of primitive elements. - View Dependent Claims (25, 26)
-
Specification