Structured metadata extraction
First Claim
Patent Images
1. A method comprising:
- accessing, using processing equipment, one or more documents from which to extract structured metadata from each of a plurality of hosts;
extracting, using the processing equipment, a plurality of entity names from one or more documents from a first host of the plurality of hosts using an entity name pattern;
determining whether to extract a first element list based at least in part on one or more heuristic rules or based at least in part on one or more pattern-based rules;
in response to determining to extract the first element list based at least in part on one or more heuristic rules, extracting, using the processing equipment, the first element list from the one or more documents based at least in part on a first entity name of the plurality of entity names and based at least in part on one or more heuristic rules;
validating the first element list based at least in part on a comparison of the first element list with one or more reference lists, wherein the first element list and the one or more reference lists are each associated with a same one of the plurality of entity names;
generating, using the processing equipment, an element list pattern based at least in part on the first element list and on the structured metadata of at least the first host, wherein generating the element list pattern includes determining an element node pattern, wherein the element node pattern is based at least in part on a document object model tree path from a document root node to an element node;
determining whether to extract a second element list based at least in part on one or more heuristic rules or based at least in part on one or more pattern-based rules; and
in response to determining to extract the second element list based at least in part on one or more pattern-based rules, extracting, using the processing equipment, the second element list from the one or more documents based at least in part on a second entity name and one or more pattern-based rules comprising the generated element list pattern.
3 Assignments
0 Petitions
Accused Products
Abstract
Structured metadata extraction may include accessing one or more documents from which to extract the structured metadata from each of a plurality of hosts. A plurality of entity names can be extracted from the one or more documents from one of the plurality of hosts using an entity name pattern. A first element list can be extracted from the one or more documents based at least in part on the plurality of entity names and based at least in part on one or more heuristic rules. An element list pattern may be generated based at least in part on the first element list, and a second element list may be extracted from the one or more documents based at least in part on the element list pattern.
-
Citations
16 Claims
-
1. A method comprising:
-
accessing, using processing equipment, one or more documents from which to extract structured metadata from each of a plurality of hosts; extracting, using the processing equipment, a plurality of entity names from one or more documents from a first host of the plurality of hosts using an entity name pattern; determining whether to extract a first element list based at least in part on one or more heuristic rules or based at least in part on one or more pattern-based rules; in response to determining to extract the first element list based at least in part on one or more heuristic rules, extracting, using the processing equipment, the first element list from the one or more documents based at least in part on a first entity name of the plurality of entity names and based at least in part on one or more heuristic rules; validating the first element list based at least in part on a comparison of the first element list with one or more reference lists, wherein the first element list and the one or more reference lists are each associated with a same one of the plurality of entity names; generating, using the processing equipment, an element list pattern based at least in part on the first element list and on the structured metadata of at least the first host, wherein generating the element list pattern includes determining an element node pattern, wherein the element node pattern is based at least in part on a document object model tree path from a document root node to an element node; determining whether to extract a second element list based at least in part on one or more heuristic rules or based at least in part on one or more pattern-based rules; and in response to determining to extract the second element list based at least in part on one or more pattern-based rules, extracting, using the processing equipment, the second element list from the one or more documents based at least in part on a second entity name and one or more pattern-based rules comprising the generated element list pattern. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising:
one or more computers configured to; access, using processing equipment, one or more documents from which to extract structured metadata from each of a plurality of hosts; extract, using the processing equipment, a plurality of entity names from one or more documents from a first host of the plurality of hosts using an entity name pattern; determine whether to extract a first element list based at least in part on one or more heuristic rules or based at least in part on one or more pattern-based rules; in response to the determination to extract the first element list based at least in part on one or more heuristic rules, extract, using the processing equipment, a first element list from the one or more documents based at least in part on a first entity name of the plurality of entity names and based at least in part on one or more heuristic rules; validate the first element list based at least in part on a comparison of the first element list with one or more reference lists, wherein the first element list and the one or more reference lists are each associated with a same one of the plurality of entity names; generate, using the processing equipment, an element list pattern based at least in part on the first element list and on the structured metadata of at least the first host, wherein generating the element list pattern includes determining an element node pattern, wherein the element node pattern is based at least in part on a document object model tree path from a document root node to an element node; determine whether to extract a second element list based at least in part on one or more heuristic rules or based at least in part on one or more pattern-based rules; and in response to determining to extract the second element list based at least in part on one or more pattern-based rules, extract, using the processing equipment, a second element list from the one or more documents based at least in part on a second entity name and one or more pattern-based rules comprising the generated element list pattern. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
Specification