×

Document layout extraction

  • US 8,250,469 B2
  • Filed: 12/03/2007
  • Issued: 08/21/2012
  • Est. Priority Date: 12/03/2007
  • Status: Active Grant
First Claim
Patent Images

1. One or more computer-readable storage media comprising a memory and computer-executable instructions embodied thereon that, when executed, perform a method for extracting information from a document in an electronic format to produce a representation containing structure and layout metadata, the method comprising:

  • receiving one or more textual data in the electronic format, the textual data in the electronic format including a set of available structure and layout information;

    converting the textual data from the electronic format to a generic, independent interface format, different from the electronic format by extracting the set of available structure and layout information from the textual data in the electronic format, the independent interface format including coordinates to one or more structural elements of the textual data and the independent interface format enables common analysis procedures to be carried out on textual data received in a variety of electronic formats;

    performing a structure and layout analysis of the textual data in the independent interface format to generate a set of additional structure and layout information by;

    having a set of set of specialized sub engines for extracting metadata;

    applying a subset of the set of specialized sub engines to the textual data of the independent interface format, each of the set of specialized sub engines extracting metadata from the textual data of the independent interface format;

    the subset of the set of specialized sub engines extracting only metadata for generating the set of additional structure and layout information that is not already identified in the set of available structure and layout information from the electronic format, thereby avoiding redundant extraction of metadata;

    converting the independent interface format into an enriched interface format by, integrating the additional structure and layout information that includes the metadata extracted from each of the specialized sub engines into the available structure and layout information of the independent interface format, andstoring the textual data, the additional structure and layout information, and the extracted set of available structure and layout information in the enriched interface format that is different from both the electronic format and the independent interface format, and the enriched interface format providing for search and navigation of the textual data.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×