APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION
First Claim
1. A method of extracting structured information from web content, comprising:
- representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content;
extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and
(ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; and
storing the extracted structured data instance as structured output records in a database.
3 Assignments
0 Petitions
Accused Products
Abstract
Disclosed are methods and apparatus for extracting (or annotating) structured information from web content. Web content of interest from a particular domain is represented as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances. The particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content. A structured data instance that conforms to the concept schema is extracted from the one or more tree instances based on the domain knowledge for the particular domain. Extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments. The extracted structured data instance is stored as structured output records in a database.
89 Citations
21 Claims
-
1. A method of extracting structured information from web content, comprising:
-
representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content; extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and
(ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; andstoring the extracted structured data instance as structured output records in a database. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus comprising at least a processor and a memory, wherein the processor and/or memory are configured to perform the following operations:
-
representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content; extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and
(ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; andstoring the extracted structured data instance as structured output records in a database. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. At least one computer readable storage medium having computer program instructions stored thereon that are arranged to perform the following operations:
-
representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content; extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and
(ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; andstoring the extracted structured data instance as structured output records in a database. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification