Method for automatic wrapper repair
First Claim
1. A method of information extraction from a Web page using an initial wrapper which has become partially inoperative, comprising:
- wherein the initial wrapper comprises an initial set of rules for extracting information and for assigning labels from a wrapper set of labels to the extracted information;
extracting strings from the Web page parsed in forward direction using the initial set of rules;
analyzing the extracted strings according to the initial set of rules for assigning labels associated with the wrapper;
assigning labels to those strings which satisfy the label rules;
extracting strings from the Web page in backward/(opposite) direction using the initial set of rules;
analyzing the extracted strings according to the set of rules for assigning labels associated with the wrappers; and
assigning labels to those unlabeled strings from which satisfy the label rules;
wherein the initial wrapper W comprises a triple (T, L, R), where T is an input tokenizer, L is a semantic label set and R is a set of extraction rules R={n}, where each rule n, is a triple (p,s,l), where p ε
Sl and s ε
Su are prefix and suffix, and ε
L.
0 Assignments
0 Petitions
Accused Products
Abstract
A method of information extraction from a Web page using an initial wrapper which has become partially inoperative, wherein the initial wrapper comprises an initial set of rules for extracting information and for assigning labels from a wrapper set of labels to the extracted information, includes using the initial set of rules to extract strings from the Web page parsed in forward direction; analyzing the extracted strings according to the initial set of rules for assigning labels associated with the wrapper; assigning labels to those strings which satisfy the label rules; using the initial set of rules to extract strings from the Web page in backward/(opposite) direction; analyzing the extracted strings according to the set of rules for assigning labels associated with the wrappers; and assigning labels to those unlabeled strings from which satisfy the label rules.
-
Citations
4 Claims
-
1. A method of information extraction from a Web page using an initial wrapper which has become partially inoperative, comprising:
-
wherein the initial wrapper comprises an initial set of rules for extracting information and for assigning labels from a wrapper set of labels to the extracted information; extracting strings from the Web page parsed in forward direction using the initial set of rules; analyzing the extracted strings according to the initial set of rules for assigning labels associated with the wrapper; assigning labels to those strings which satisfy the label rules; extracting strings from the Web page in backward/(opposite) direction using the initial set of rules; analyzing the extracted strings according to the set of rules for assigning labels associated with the wrappers; and assigning labels to those unlabeled strings from which satisfy the label rules; wherein the initial wrapper W comprises a triple (T, L, R), where T is an input tokenizer, L is a semantic label set and R is a set of extraction rules R={n}, where each rule n, is a triple (p,s,l), where p ε
Sl and s ε
Su are prefix and suffix, and ε
L. - View Dependent Claims (2, 3, 4)
-
Specification