Object extraction from presentation-oriented documents using a semantic and spatial approach
First Claim
1. A method for automatically extracting objects in a presentation-oriented document, comprising:
- receiving as input by a computer the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements spatially arranged in a given layout organization for presenting contents to human users, wherein the computer is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors define the objects to extract from the POD by defining both spatial relationships between the objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects;
using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified content elements based on the descriptors;
creating a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements;
extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and
performing at least one of;
i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository.
1 Assignment
0 Petitions
Accused Products
Abstract
Automatic extraction of objects in a presentation-oriented document comprises receiving the presentation-oriented document (POD) in which content elements are spatially arranged in a given layout organization for presenting contents to human users; receiving a set of descriptors that semantically define the objects to extract from the POD based on attributes comprising the objects; using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified elements based on the descriptors; creating a semantic and spatial document model (SSDM) containing spatial structures of the identified content elements in the POD and the semantic annotations assigned to the identified contents elements; extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and performing at least one of: i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository.
-
Citations
22 Claims
-
1. A method for automatically extracting objects in a presentation-oriented document, comprising:
-
receiving as input by a computer the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements spatially arranged in a given layout organization for presenting contents to human users, wherein the computer is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors define the objects to extract from the POD by defining both spatial relationships between the objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects; using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified content elements based on the descriptors; creating a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements; extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and performing at least one of;
i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An executable software product stored on a non-transitory computer-readable storage medium containing program instructions for automatically extracting objects from a presentation-oriented document, the program instructions for:
-
receiving as input by a computer the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements are spatially arranged in a given layout organization for presenting contents to human users, wherein the computer is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors that define the objects to extract from the POD by defining both spatial relationships between objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects; using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified content elements based on the descriptors; creating a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements; extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; performing at least one of;
i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A system, comprising:
-
a memory; a processor coupled to the memory; and one or more software components executed by the processor that is configured to; receive as input the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements are spatially arranged in a given layout organization for presenting contents to human users, wherein the processor is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors that define the objects to extract from the POD by defining both spatial relationships between objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects; use the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assign semantic annotations to the identified content elements based on the descriptors; create a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements; extract the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and perform at least one of;
i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository. - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
Specification