APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION

US 20100241639A1
Filed: 03/20/2009
Published: 09/23/2010
Est. Priority Date: 03/20/2009
Status: Abandoned Application

First Claim

Patent Images

1. A method of extracting structured information from web content, comprising:

representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content;

extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and

(ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; and

storing the extracted structured data instance as structured output records in a database.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are methods and apparatus for extracting (or annotating) structured information from web content. Web content of interest from a particular domain is represented as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances. The particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content. A structured data instance that conforms to the concept schema is extracted from the one or more tree instances based on the domain knowledge for the particular domain. Extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments. The extracted structured data instance is stored as structured output records in a database.

89 Citations

View as Search Results

21 Claims

1. A method of extracting structured information from web content, comprising:
- representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content;
  
  extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and
  
  (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; and
  
  storing the extracted structured data instance as structured output records in a database.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as recited in claim 1, wherein the concept schema corresponds to a labeled tree having a plurality of nodes representing named concepts and a plurality of leaf nodes representing atomic concepts.
  - 3. The method as recited in claim 1, wherein the concept schema specifies one or more atomic concepts or attributes that form part of a particular entity record and the extracted structured data instances are in the form of a list of records that each have a plurality of attribute and value pairs.
  - 4. The method as recited in claim 1, wherein the one or more presentation rulesets each specify hierarchical and ordering relations between instances of a plurality of presentation groups in the one or more tree instances that facilitate mapping of annotated atomic values to concepts in the concept schema.
  - 5. The method as recited in claim 4, wherein the presentation rulesets are specified as a set of context-free grammars with respect to the one or more tree instances.
  - 6. The method as recited in claim 1, wherein using the locally adaptive concept annotator is accomplished by:
    - generating a candidate pool of annotatable segments of the one or more tree instances for the concept schema;
      
      identifying a set of predictive local features of the annotatable segments;
      
      learning a model for a locally adaptive concept annotator based on the identified set of predictive local features and the annotated segments; and
      
      executing the learned model on the candidate annotatable segments.
  - 7. The method as recited in claim 1, wherein the extraction is accomplished by:
    - (a) choosing a set of selected informative queries for annotations;
      
      (b) selecting and executing a current extraction operator from a plurality of operators for receiving the selected set of informative queries and producing a set of current annotations, wherein the selection of the current extraction operation is based on which operators have their input conditions met by a current annotated state of the tree instances and can produce the annotations of the selected informative queries;
      
      (c) repeating operations (a) and (b) until a structured data instance that conforms to the concept schema is obtained.

8. An apparatus comprising at least a processor and a memory, wherein the processor and/or memory are configured to perform the following operations:
- representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content;
  
  extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and
  
  (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; and
  
  storing the extracted structured data instance as structured output records in a database.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The apparatus as recited in claim 8, wherein the concept schema corresponds to a labeled tree having a plurality of nodes representing named concepts and a plurality of leaf nodes representing atomic concepts.
  - 10. The apparatus as recited in claim 8, wherein the concept schema specifies one or more atomic concepts or attributes that form part of a particular entity record and the extracted structured data instances are in the form of a list of records that each have a plurality of attribute and value pairs.
  - 11. The apparatus as recited in claim 8, wherein the one or more presentation rulesets each specify hierarchical and ordering relations between instances of a plurality of presentation groups in the one or more tree instances that facilitate mapping of annotated atomic values to concepts in the concept schema.
  - 12. The apparatus as recited in claim 11, wherein the presentation rulesets are specified as a set of context-free grammars with respect to the one or more tree instances.
  - 13. The apparatus as recited in claim 8, wherein using the locally adaptive concept annotator is accomplished by:
    - generating a candidate pool of annotatable segments of the one or more tree instances for the concept schema;
      
      identifying a set of predictive local features of the annotatable segments;
      
      learning a model for a locally adaptive concept annotator based on the identified set of predictive local features and the annotated segments; and
      
      executing the learned model on the candidate annotatable segments.
  - 14. The apparatus as recited in claim 8, wherein the extraction is accomplished by:
    - (a) choosing a set of selected informative queries for annotations;
      
      (b) selecting and executing a current extraction operator from a plurality of operators for receiving the selected set of informative queries and producing a set of current annotations, wherein the selection of the current extraction operation is based on which operators have their input conditions met by a current annotated state of the tree instances and can produce the annotations of the selected informative queries;
      
      (c) repeating operations (a) and (b) until a structured data instance that conforms to the concept schema is obtained.

15. At least one computer readable storage medium having computer program instructions stored thereon that are arranged to perform the following operations:
- representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content;
  
  extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and
  
  (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; and
  
  storing the extracted structured data instance as structured output records in a database.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The least one computer readable storage medium as recited in claim 15, wherein the concept schema corresponds to a labeled tree having a plurality of nodes representing named concepts and a plurality of leaf nodes representing atomic concepts.
  - 17. The least one computer readable storage medium as recited in claim 15, wherein the concept schema specifies one or more atomic concepts or attributes that form part of a particular entity record and the extracted structured data instances are in the form of a list of records that each have a plurality of attribute and value pairs.
  - 18. The least one computer readable storage medium as recited in claim 15, wherein the one or more presentation rulesets each specify hierarchical and ordering relations between instances of a plurality of presentation groups in the one or more tree instances that facilitate mapping of annotated atomic values to concepts in the concept schema.
  - 19. The least one computer readable storage medium as recited in claim 18 wherein the presentation rulesets are specified as a set of context-free grammars with respect to the one or more tree instances.
  - 20. The least one computer readable storage medium as recited in claim 18, wherein using the locally adaptive concept annotator is accomplished by:
    - generating a candidate pool of annotatable segments of the one or more tree instances for the concept schema;
      
      identifying a set of predictive local features of the annotatable segments;
      
      learning a model for a locally adaptive concept annotator based on the identified set of predictive local features and the annotated segments; and
      
      executing the learned model on the candidate annotatable segments.
  - 21. The least one computer readable storage medium as recited in claim 18, wherein the extraction is accomplished by:
    - (a) choosing a set of selected informative queries for annotations;
      
      (b) selecting and executing a current extraction operator from a plurality of operators for receiving the selected set of informative queries and producing a set of current annotations, wherein the selection of the current extraction operation is based on which operators have their input conditions met by a current annotated state of the tree instances and can produce the annotations of the selected informative queries;
      
      (c) repeating operations (a) and (b) until a structured data instance that conforms to the concept schema is obtained.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oath Inc. (Verizon Communications Inc.)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Jain, Ankur, Ramakrishnan, Raghu, Kifer, Daniel, Selvaraj, Sathiya Keerthi, Merugu, Srujana, Bohannon, Philip L., Kirpal, Alok S.

Application Number

US12/408,450
Publication Number

US 20100241639A1
Time in Patent Office

Days
Field of Search
US Class Current

707/754
CPC Class Codes

G06F 16/313 Selection or weighting of t...

G06F 16/345 Summarisation for human users

APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

89 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

89 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links