Document layout extraction
First Claim
1. One or more computer-readable storage media comprising a memory and computer-executable instructions embodied thereon that, when executed, perform a method for extracting information from a document in an electronic format to produce a representation containing structure and layout metadata, the method comprising:
- receiving one or more textual data in the electronic format, the textual data in the electronic format including a set of available structure and layout information;
converting the textual data from the electronic format to a generic, independent interface format, different from the electronic format by extracting the set of available structure and layout information from the textual data in the electronic format, the independent interface format including coordinates to one or more structural elements of the textual data and the independent interface format enables common analysis procedures to be carried out on textual data received in a variety of electronic formats;
performing a structure and layout analysis of the textual data in the independent interface format to generate a set of additional structure and layout information by;
having a set of set of specialized sub engines for extracting metadata;
applying a subset of the set of specialized sub engines to the textual data of the independent interface format, each of the set of specialized sub engines extracting metadata from the textual data of the independent interface format;
the subset of the set of specialized sub engines extracting only metadata for generating the set of additional structure and layout information that is not already identified in the set of available structure and layout information from the electronic format, thereby avoiding redundant extraction of metadata;
converting the independent interface format into an enriched interface format by, integrating the additional structure and layout information that includes the metadata extracted from each of the specialized sub engines into the available structure and layout information of the independent interface format, andstoring the textual data, the additional structure and layout information, and the extracted set of available structure and layout information in the enriched interface format that is different from both the electronic format and the independent interface format, and the enriched interface format providing for search and navigation of the textual data.
2 Assignments
0 Petitions
Accused Products
Abstract
Computer-readable media, systems, and methods for document layout extraction are described. In embodiments, textual data in an electronic format is received and the textual data is converted from the electronic format to an independent interface format, the independent interface format including coordinates to one or more structural elements of the textual data. Further, in embodiments, a structure and layout analysis of the textual data is performed to generate a set of structure and layout information. Still further, in embodiments, the textual data and the set of structure and layout information is stored in an enriched interface format, the enriched interface format providing for search and navigation of the textual data.
-
Citations
9 Claims
-
1. One or more computer-readable storage media comprising a memory and computer-executable instructions embodied thereon that, when executed, perform a method for extracting information from a document in an electronic format to produce a representation containing structure and layout metadata, the method comprising:
-
receiving one or more textual data in the electronic format, the textual data in the electronic format including a set of available structure and layout information; converting the textual data from the electronic format to a generic, independent interface format, different from the electronic format by extracting the set of available structure and layout information from the textual data in the electronic format, the independent interface format including coordinates to one or more structural elements of the textual data and the independent interface format enables common analysis procedures to be carried out on textual data received in a variety of electronic formats; performing a structure and layout analysis of the textual data in the independent interface format to generate a set of additional structure and layout information by; having a set of set of specialized sub engines for extracting metadata; applying a subset of the set of specialized sub engines to the textual data of the independent interface format, each of the set of specialized sub engines extracting metadata from the textual data of the independent interface format; the subset of the set of specialized sub engines extracting only metadata for generating the set of additional structure and layout information that is not already identified in the set of available structure and layout information from the electronic format, thereby avoiding redundant extraction of metadata; converting the independent interface format into an enriched interface format by, integrating the additional structure and layout information that includes the metadata extracted from each of the specialized sub engines into the available structure and layout information of the independent interface format, and storing the textual data, the additional structure and layout information, and the extracted set of available structure and layout information in the enriched interface format that is different from both the electronic format and the independent interface format, and the enriched interface format providing for search and navigation of the textual data. - View Dependent Claims (2, 3, 4)
-
-
5. A computerized system for extracting information from a document in an electronic format to produce a representation containing structure and layout metadata, the system comprising:
-
a processor; a receiving component configured to receive textual data in the electronic format, the textual data in the electronic format including a set of available structure and layout information; a converting component configured to convert the textual data from the electronic format to a generic independent interface format, different from the electronic format by extracting the set of available structure and layout information from the textual data in the electronic format and, the independent interface format including coordinates to one or more structural elements of the textual data, and the converting component is configured to convert textual data received in a variety of electronic formats into the independent interface format such that textual data received in a first electronic format and textual data received in a second electronic format are converted to the independent interface format; a processing component configured to analyze the textual data in the independent interface format to generate a set of additional structure and layout information and enable common analysis procedures to be carried out on textual data received in a variety of electronic formats, the processing component including one or more specialized sub engines configured to extract a set of metadata from the textual data, a managing component configured to manage operation of the one or more specialized sub engines by applying only a subset of the one or more specialized sub engines to the textual data, the subset of the specialized sub engines extracting only metadata for generating the set of additional structure and layout information that is not already identified in the set of available structure and layout information from the electronic format, thereby avoiding redundant extraction of metadata, and an integration component configured to convert the independent interface format into an enriched interface format by, integrating the additional structure and layout information that includes the metadata extracted from each of the specialized sub engines into the available structure and layout information of the independent interface format, and a storing component configured to store the textual data, the additional structure and layout information, and the extracted set of available structure and layout information in the enriched interface format that is different from both the electronic format and the independent interface format, and the enriched interface format providing for search and navigation of the textual data. - View Dependent Claims (6)
-
-
7. A method for converting a document in an electronic format into a representation containing structure and layout metadata, the method comprising:
-
sending textual data, including a set of available structure and layout information, in a first electronic format to a layout extraction engine, the layout extraction engine configured to convert the textual data from the first electronic format to an independent interface format different from the first electronic format by extracting the set of available structure and layout information from the textual data in the first electronic format, the independent interface format including coordinates to one or more structural elements of the textual data the independent interface format enabling common analysis procedures to be carried out on textual data received in a variety of electronic formats, the layout extraction engine configured to perform a structure and layout analysis of the textual data to generate a set of additional structure and layout information by;
having a set of specialized sub engines for extracting metadata, applying a subset of the set of specialized sub engines to the textual data of the independent interface format, each of the specialized sub engines extracting only metadata for generating the set of additional structure and layout information that is not already identified in the set of available structure and layout information from the first electronic format, thereby avoiding redundant extraction of metadata, and integrating the additional structure and layout information metadata extracted from each of the specialized sub engines into the available structure and layout information of the independent interface format to convert the independent interface format into an enriched interface format;sending textual data in a second electronic format to the layout extraction engine, wherein the layout extraction engine is configured to convert the textual data from the second electronic format to the independent interface format different from both the first electronic format and the second electronic format; and receiving the textual data, the additional structure and layout information, and the extracted set of available structure and layout information in the enriched interface format, the enriched interface format providing for search and navigation of the textual data. - View Dependent Claims (8, 9)
-
Specification