×

APPARATUS AND METHODS FOR CONCEPT-CENTRIC INFORMATION EXTRACTION

  • US 20100241639A1
  • Filed: 03/20/2009
  • Published: 09/23/2010
  • Est. Priority Date: 03/20/2009
  • Status: Abandoned Application
First Claim
Patent Images

1. A method of extracting structured information from web content, comprising:

  • representing web content of interest from a particular domain as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances, wherein the particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain-specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content;

    extracting, from the one or more tree instances, a structured data instance that conforms to the concept schema based on the domain knowledge for the particular domain, wherein extraction of the structured data instances is accomplished by (i) using the domain-specific concept labeler to annotate a subset of nodes of the tree instances; and

    (ii) using a locally adaptive concept annotator to extract the structured data instances based on the annotated segments and the local properties associated with such annotated segments; and

    storing the extracted structured data instance as structured output records in a database.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×