Method for learning and combining global and local regularities for information extraction and classification
First Claim
1. In a data processing system, a method for creating a database from information found on a plurality of web pages, said information comprising global regularities and local regularities, said global regularities being patterns that are expected to be found in all said web pages, and said local regularities being patterns which are not expected to be found in all said web pages, said method comprising:
- a) providing a global classifier using said global regularities;
b) identifying a candidate subset of the web pages expected to have said local regularities;
thereafter c) tentatively identifying and tagging, in said candidate subset of the web pages, elements having said global regularities, by using said global classifier to obtain first tentative labels;
d) training a local classifier using said first tentative labels, said local classifier using said local regularities for its classification;
e) tentatively identifying elements having specific combinations of said global regularities and said local regularities using said global classifier and said local classifier to obtain second tentative labels for said elements of said candidate subset; and
thereafter f) outputting said second tentative labels as permanent labels associated with said elements of said candidate subset of web pages.
5 Assignments
0 Petitions
Accused Products
Abstract
A method is provided for information extraction and classification which combines aspects of local regularities formulation with global regularities formulation. A candidate subset is identified. Then tentative labels are created so they can be associated with elements in the subset that have the global regularities, and the initial tentative labels are attached onto the identified elements of the candidate subset. The attached tentative labels are employed to formulate or “learn” initial local regularities. Further tentative labels are created so they can be associated with elements in the subset that have a combination of global and local regularities, and the further tentative labels are attached onto the identified elements of the candidate subset. Each new dataset is processed with reference to an increasingly-refined set of global regularities, and the output data with their associated confidence labels can be readily evaluated as to import and relevance.
79 Citations
15 Claims
-
1. In a data processing system, a method for creating a database from information found on a plurality of web pages, said information comprising global regularities and local regularities, said global regularities being patterns that are expected to be found in all said web pages, and said local regularities being patterns which are not expected to be found in all said web pages, said method comprising:
-
a) providing a global classifier using said global regularities;
b) identifying a candidate subset of the web pages expected to have said local regularities;
thereafterc) tentatively identifying and tagging, in said candidate subset of the web pages, elements having said global regularities, by using said global classifier to obtain first tentative labels;
d) training a local classifier using said first tentative labels, said local classifier using said local regularities for its classification;
e) tentatively identifying elements having specific combinations of said global regularities and said local regularities using said global classifier and said local classifier to obtain second tentative labels for said elements of said candidate subset; and
thereafterf) outputting said second tentative labels as permanent labels associated with said elements of said candidate subset of web pages. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. In a data processing system, a method for learning and combining global regularities and local regularities for information extraction and classification, said global regularities being patterns which may be found over an entire dataset and said local regularities being patterns found in less than the entire dataset, said method comprising the steps of
a) identifying a candidate subset of the dataset in which said local regularities may be found; - thereafter
b) tentatively identifying elements having said global regularities in a candidate subset to obtain first tentative labels, said first tentative labels being useful for tagging information having identifiable similarities;
thereafterc) attaching said first tentative labels onto said identified elements of said candidate subset;
thereafterd) employing said attached first tentative labels via one of a class of inductive operations to formulate first local regularities;
thereaftere) tentatively identifying elements having specific combinations of said global regularities and said first local regularities to obtain attached second tentative labels;
thereafterf) rating confidence of said attached second tentative labels and converting selected ones of said attached second tentative labels to confidence labels upon achieving a preselected confidence level; and
thenoutputting data with said confidence labels;
otherwiseg) employing said second tentative labels via said operation on said candidate subset to formulate second local regularities, and h) repeating from step e) until said confidence labels have been fully developed. - View Dependent Claims (9, 10, 11, 12, 13, 14)
- thereafter
-
15. In a data processing system, a method for learning and combining regularities of a first level and regularities of at least a second level and a third level for information extraction and classification, said first, second and third levels having a hierarchy from most global to most specific, said method comprising the steps of:
-
a) beginning at the most global level, training a classifier at the selected level by initially providing descriptions of regularities at said selected level to a working database, said selected level regularities being patterns which are to be found over a selected portion of a selected dataset corresponding to the selected level;
thereafterb) identifying a candidate subset of each selected dataset in which next more specific regularities may be found;
thereafterc) tentatively identifying elements having said selected regularities in said candidate subset to obtain first tentative labels, said first tentative labels being useful for tagging mutually similar information;
thereafterd) attaching said first tentative labels onto said identified elements of said candidate subset;
thereaftere) employing said attached first tentative labels via one of a class of inductive operations to formulate first local regularities;
thereafterf) tentatively identifying elements having specific combinations of said global regularities and said local regularities to obtain attached second tentative labels;
thereafterg) rating confidence of said attached second tentative labels and converting selected ones of said attached second tentative labels to confidence labels upon achieving a preselected confidence level; and
thenh) outputting data with said confidence labels;
otherwisei) employing said second tentative labels via said operation on said candidate subset to formulate second more specific regularities;
k) repeating from step f); and
l) repeating from step a) for each successive more selective level of regularity.
-
Specification