×

Object extraction from presentation-oriented documents using a semantic and spatial approach

  • US 9,582,494 B2
  • Filed: 02/22/2013
  • Issued: 02/28/2017
  • Est. Priority Date: 02/22/2013
  • Status: Active Grant
First Claim
Patent Images

1. A method for automatically extracting objects in a presentation-oriented document, comprising:

  • receiving as input by a computer the presentation-oriented document (POD) and a set of descriptors, the POD comprising content elements spatially arranged in a given layout organization for presenting contents to human users, wherein the computer is configured to process PODs having different formats, including webpage formats and Portable Document Format (PDF) formats, and wherein the set of descriptors define the objects to extract from the POD by defining both spatial relationships between the objects in the POD as well as semantics of the objects expressed as zero more attributes comprising the objects;

    using the set of descriptors to identify content elements in the POD that match the attributes in the set of descriptors defining the objects, and assigning semantic annotations to the identified content elements based on the descriptors;

    creating a semantic and spatial document model (SSDM) containing one or more of spatial structures of the identified content elements, presentation/visual features of the identified content elements, and the semantic annotations assigned to the identified contents elements;

    extracting the identified content elements from the POD based on the set of descriptors and the SSDM to create a set of object instances; and

    performing at least one of;

    i) using the object instances to generate semantic and spatial wrappers that can be reused on a different POD, and ii) storing the object instances in a data repository.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×