×

Structured metadata extraction

  • US 8,954,438 B1
  • Filed: 05/31/2012
  • Issued: 02/10/2015
  • Est. Priority Date: 05/31/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • accessing, using processing equipment, one or more documents from which to extract structured metadata from each of a plurality of hosts;

    extracting, using the processing equipment, a plurality of entity names from one or more documents from a first host of the plurality of hosts using an entity name pattern;

    determining whether to extract a first element list based at least in part on one or more heuristic rules or based at least in part on one or more pattern-based rules;

    in response to determining to extract the first element list based at least in part on one or more heuristic rules, extracting, using the processing equipment, the first element list from the one or more documents based at least in part on a first entity name of the plurality of entity names and based at least in part on one or more heuristic rules;

    validating the first element list based at least in part on a comparison of the first element list with one or more reference lists, wherein the first element list and the one or more reference lists are each associated with a same one of the plurality of entity names;

    generating, using the processing equipment, an element list pattern based at least in part on the first element list and on the structured metadata of at least the first host, wherein generating the element list pattern includes determining an element node pattern, wherein the element node pattern is based at least in part on a document object model tree path from a document root node to an element node;

    determining whether to extract a second element list based at least in part on one or more heuristic rules or based at least in part on one or more pattern-based rules; and

    in response to determining to extract the second element list based at least in part on one or more pattern-based rules, extracting, using the processing equipment, the second element list from the one or more documents based at least in part on a second entity name and one or more pattern-based rules comprising the generated element list pattern.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×