×

Identification of tables in an unstructured document

  • US 8,443,278 B2
  • Filed: 06/07/2009
  • Issued: 05/14/2013
  • Est. Priority Date: 01/02/2009
  • Status: Active Grant
First Claim
Patent Images

1. A non-transitory machine readable medium storing a program which when executed by at least one processing unit analyzes a document comprising a plurality of primitive elements, the program comprising sets of instructions for:

  • identifying regions that comprise sets of primitive elements, the regions arranged in a hierarchy such that a first region at a first level of the hierarchy comprising a first set of primitive elements encompasses a second region at a second lower level of the hierarchy comprising a second set of primitive elements, the second set of primitive elements a subset of the first set of primitive elements, the first and second regions linked in the hierarchy;

    determining that a particular region at a particular level in the hierarchy does not share a border with any linked region at a higher level than the particular level and encompasses at least two additional linked regions at a lower level in the hierarchy than the particular level;

    identifying the particular region as a table and the encompassed regions as cells of the table; and

    defining a tabular structural element for the table, the tabular structural element comprising a plurality of cells arranged in a plurality of rows and columns, each cell comprising an associated set of primitive elements.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×