Object extraction from presentation-oriented documents using a semantic and spatial approach

US 9,582,494 B2
Filed: 02/22/2013
Issued: 02/28/2017
Est. Priority Date: 02/22/2013
Status: Active Grant

First Claim

Patent Images

1. A method for automatically extracting objects in a presentation-oriented document, comprising:

receiving as input by a computer the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements spatially arranged in a given layout organization for presenting contents to human users, wherein the computer is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors define the objects to extract from the POD by defining both spatial relationships between the objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects;

using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified content elements based on the descriptors;

creating a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements;

extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and

performing at least one of;

i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Automatic extraction of objects in a presentation-oriented document comprises receiving the presentation-oriented document (POD) in which content elements are spatially arranged in a given layout organization for presenting contents to human users; receiving a set of descriptors that semantically define the objects to extract from the POD based on attributes comprising the objects; using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified elements based on the descriptors; creating a semantic and spatial document model (SSDM) containing spatial structures of the identified content elements in the POD and the semantic annotations assigned to the identified contents elements; extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and performing at least one of: i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository.

Citations

22 Claims

1. A method for automatically extracting objects in a presentation-oriented document, comprising:
- receiving as input by a computer the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements spatially arranged in a given layout organization for presenting contents to human users, wherein the computer is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors define the objects to extract from the POD by defining both spatial relationships between the objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects;
  
  using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified content elements based on the descriptors;
  
  creating a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements;
  
  extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and
  
  performing at least one of;
  
  i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein the set of descriptors comprise:
    - annotation descriptors that annotate content elements found in the POD in order to identify basic pieces of information that constitute the attributes of the objects to extract; and
      
      object descriptors comprising object schemas that define which attributes defined by at least one of the annotation descriptors and other object descriptors, compose each of the objects to extract, and a set of semantic and spatial constraints and preferences that define expected spatial arrangement of the attributes defined in the object descriptors.
  - 3. The method of claim 1, wherein receiving the set of descriptors further comprises:
    - compiling the set of descriptors into a set of compiled annotation descriptors and compiled object descriptors.
  - 4. The method of claim 3, wherein using the set of descriptors to identify content elements in the POD further comprises:
    - applying the set of compiled annotation descriptors to the POD to create the SSDM.
  - 5. The method of claim 3, wherein creating the SSDM further comprises:
    - using a visualizer to automatically extract visual presentation features from the POD to create a document model;
      
      using the document model and the visualizer to extract spatial features from the POD and to create a spatial document model, where each of the extracted spatial features expresses a visualization area assigned by the visualizer; and
      
      analyzing the spatial document model based on the attributes defined by the descriptors to semantically annotate the content elements in the spatial document model according to attribute rules.
  - 6. The method of claim 1, further comprising:
    - representing the SSDM as at least one of a graph, hash mapping, and an R-tree.
  - 7. The method of claim 1, wherein the descriptors are created through a graphical user interface.
  - 8. The method of claim 2, further comprising:
    - creating object instances by searching for annotated content elements in the SSDM that compose given objects as defined by the object descriptors.

9. An executable software product stored on a non-transitory computer-readable storage medium containing program instructions for automatically extracting objects from a presentation-oriented document, the program instructions for:
- receiving as input by a computer the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements are spatially arranged in a given layout organization for presenting contents to human users, wherein the computer is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors that define the objects to extract from the POD by defining both spatial relationships between objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects;
  
  using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified content elements based on the descriptors;
  
  creating a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements;
  
  extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances;
  
  performing at least one of;
  
  i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The executable software product of claim 9 wherein the set of descriptors comprise:
    - annotation descriptors that annotate content elements found in the POD in order to identify basic pieces of information that constitute the attributes of the objects to extract; and
      
      object descriptors comprising object schemas that define which attributes defined by at least one of the annotation descriptors and other object descriptors, compose each of the objects to extract, and a set of semantic and spatial constraints and preferences that define expected spatial arrangement of the attributes defined in the object descriptors.
  - 11. The executable software product of claim 9, wherein receiving the set of descriptors further comprises program instructions for:
    - compiling the set of descriptors into a set of compiled annotation descriptors and compiled object descriptors.
  - 12. The executable software product of claim 11, wherein using the set of descriptors to identify content elements in the POD further comprises program instructions for:
    - applying the set of compiled annotation descriptors to the POD to create the SSDM.
  - 13. The executable software product of claim 11, wherein creating the SSDM further comprises program instructions for:
    - using a visualizer to automatically extract visual presentation features from the POD to create a document model;
      
      using the document model and the visualizer to extract spatial features from the POD and to create a spatial document model, where each of the extracted spatial features expresses a visualization area assigned by the visualizer; and
      
      analyzing the spatial document model based on the attributes defined by the descriptors to semantically annotate the content elements in the spatial document model according to attribute rules.
  - 14. The executable software product of claim 9, further comprising program instructions for:
    - representing the SSDM as at least one of a graph, hash mapping, and an R-tree.
  - 15. The executable software product of claim 10, further comprising program instructions for:
    - creating object instances by searching for annotated content elements in the SSDM that compose given objects as defined by the object descriptors.

16. A system, comprising:
- a memory;
  
  a processor coupled to the memory; and
  
  one or more software components executed by the processor that is configured to;
  
  receive as input the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements are spatially arranged in a given layout organization for presenting contents to human users, wherein the processor is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors that define the objects to extract from the POD by defining both spatial relationships between objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects;
  
  use the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assign semantic annotations to the identified content elements based on the descriptors;
  
  create a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements;
  
  extract the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and
  
  perform at least one of;
  
  i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The system of claim 16 wherein the set of descriptors comprise:
    - annotation descriptors that annotate content elements found in the POD in order to identify basic pieces of information that constitute the attributes of the objects to extract; and
      
      object descriptors comprising object schemas that define which attributes defined by at least one of the annotation descriptors and other object descriptors, compose each of the objects to extract, and a set of semantic and spatial constraints and preferences that define expected spatial arrangement of the attributes defined in the object descriptors.
  - 18. The system of claim 16, wherein the set of descriptors are compiled into a set of compiled annotation descriptors and compiled object descriptors.
  - 19. The system of claim 18, wherein the set of compiled annotation descriptors are applied to the POD to create the SSDM.
  - 20. The system of claim 18, wherein one or more software components create the SSDM by:
    - using a visualizer to automatically extract visual presentation features from the POD to create a document model;
      
      using the document model and the visualizer to extract spatial features from the POD and to create a spatial document model, where each of the extracted spatial features expresses a visualization area assigned by the visualizer; and
      
      analyzing the spatial document model based on the attributes defined by the descriptors to semantically annotate the content elements in the spatial document model according to attribute rules.
  - 21. The system of claim 16, wherein the SSDM is represented as at least one of a graph, hash mapping, and an R-tree.
  - 22. The system of claim 17, wherein the object instances are created by searching for annotated content elements in the SSDM that compose given objects as defined by the object descriptors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Altilia SRL
Original Assignee
Altilia SRL
Inventors
Oro, Ermelinda, Ruffolo, Massimo
Primary Examiner(s)
Baderman, Scott
Assistant Examiner(s)
McVicker, Matthew G

Application Number

US13/774,289
Publication Number

US 20140245122A1
Time in Patent Office

1,467 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 40/30 Semantic analysis

Object extraction from presentation-oriented documents using a semantic and spatial approach

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Object extraction from presentation-oriented documents using a semantic and spatial approach

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links