Method for learning and combining global and local regularities for information extraction and classification

US 6,892,189 B2
Filed: 01/26/2001
Issued: 05/10/2005
Est. Priority Date: 01/26/2001
Status: Expired due to Term

First Claim

Patent Images

1. In a data processing system, a method for creating a database from information found on a plurality of web pages, said information comprising global regularities and local regularities, said global regularities being patterns that are expected to be found in all said web pages, and said local regularities being patterns which are not expected to be found in all said web pages, said method comprising:

a) providing a global classifier using said global regularities;

b) identifying a candidate subset of the web pages expected to have said local regularities;

thereafter c) tentatively identifying and tagging, in said candidate subset of the web pages, elements having said global regularities, by using said global classifier to obtain first tentative labels;

d) training a local classifier using said first tentative labels, said local classifier using said local regularities for its classification;

e) tentatively identifying elements having specific combinations of said global regularities and said local regularities using said global classifier and said local classifier to obtain second tentative labels for said elements of said candidate subset; and

thereafter f) outputting said second tentative labels as permanent labels associated with said elements of said candidate subset of web pages.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is provided for information extraction and classification which combines aspects of local regularities formulation with global regularities formulation. A candidate subset is identified. Then tentative labels are created so they can be associated with elements in the subset that have the global regularities, and the initial tentative labels are attached onto the identified elements of the candidate subset. The attached tentative labels are employed to formulate or “learn” initial local regularities. Further tentative labels are created so they can be associated with elements in the subset that have a combination of global and local regularities, and the further tentative labels are attached onto the identified elements of the candidate subset. Each new dataset is processed with reference to an increasingly-refined set of global regularities, and the output data with their associated confidence labels can be readily evaluated as to import and relevance.

79 Citations

View as Search Results

15 Claims

1. In a data processing system, a method for creating a database from information found on a plurality of web pages, said information comprising global regularities and local regularities, said global regularities being patterns that are expected to be found in all said web pages, and said local regularities being patterns which are not expected to be found in all said web pages, said method comprising:
- a) providing a global classifier using said global regularities;
  
  b) identifying a candidate subset of the web pages expected to have said local regularities;
  
  thereafter c) tentatively identifying and tagging, in said candidate subset of the web pages, elements having said global regularities, by using said global classifier to obtain first tentative labels;
  
  d) training a local classifier using said first tentative labels, said local classifier using said local regularities for its classification;
  
  e) tentatively identifying elements having specific combinations of said global regularities and said local regularities using said global classifier and said local classifier to obtain second tentative labels for said elements of said candidate subset; and
  
  thereafter f) outputting said second tentative labels as permanent labels associated with said elements of said candidate subset of web pages.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 further including:
    - g) deciding whether to retrain said local classifier with said second tentative labels.
  - 3. The method according to claim 2, further including:
    - h) training the local classifier using said second tentative labels.
  - 4. The method according to claim 2, further includingh) collecting said permanent labels associated with said elements of said candidate subset of web pages;
    - i) training said global classifier in response to said permanent labels.
  - 5. The method according to claim 1 wherein said second classifier treats selected global regularities differently than said global classifier treats said global regularities such that said local regularities contradict said global regularities.
  - 6. The method according to claim 5 wherein said outputting step further includes ignoring training results of said global classifier.
  - 7. The method according to claim 5 wherein said outputting step further includes combining training results of said global classifier and said local classifier.

8. In a data processing system, a method for learning and combining global regularities and local regularities for information extraction and classification, said global regularities being patterns which may be found over an entire dataset and said local regularities being patterns found in less than the entire dataset, said method comprising the steps ofa) identifying a candidate subset of the dataset in which said local regularities may be found;
- thereafter b) tentatively identifying elements having said global regularities in a candidate subset to obtain first tentative labels, said first tentative labels being useful for tagging information having identifiable similarities;
  
  thereafter c) attaching said first tentative labels onto said identified elements of said candidate subset;
  
  thereafter d) employing said attached first tentative labels via one of a class of inductive operations to formulate first local regularities;
  
  thereafter e) tentatively identifying elements having specific combinations of said global regularities and said first local regularities to obtain attached second tentative labels;
  
  thereafter f) rating confidence of said attached second tentative labels and converting selected ones of said attached second tentative labels to confidence labels upon achieving a preselected confidence level; and
  
  then outputting data with said confidence labels;
  
  otherwise g) employing said second tentative labels via said operation on said candidate subset to formulate second local regularities, and h) repeating from step e) until said confidence labels have been fully developed.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method according to claim 8 wherein said initial global regularity providing step comprises manually inputting descriptions of said global regularities.
  - 10. The method according to claim 8 wherein said initial global regularity providing step comprises obtaining said global regularities from a further one of said class of said inductive operations that has been applied to a subset of said dataset, said subset of said dataset having been manually labeled.
  - 11. The method according to claim 10 further including developing refined global regularities comprising the steps of:
    - i) collecting confidence labels from at least one of said candidate subsets to obtain global confidence labels;
      
      j) employing said global confidence labels on candidate subsets along with said manually labeled dataset via one of said class of inductive operations to formulate said refined global regularities;
      
      k) providing descriptions of said refined global regularities to said working database;
      
      thereafter l) identifying a next candidate subset of the dataset in which local regularities may be found;
      
      thereafter m) tentatively identifying elements having said refined global regularities in said candidate subset to obtain next tentative labels;
      
      thereafter n) attaching said next tentative labels onto said identified elements of said next candidate subset;
      
      thereafter o) employing said attached next tentative labels via one of the class of inductive operations to formulate next local regularities;
      
      thereafter p) tentatively identifying elements having specific combinations of said refined global regularities and said next local regularities to obtain attached next second tentative labels;
      
      thereafter q) rating confidence of said attached next second tentative labels and converting selected ones of said attached next second tentative labels to confidence labels upon achieving a preselected confidence level; and
      
      then r) outputting data with said confidence labels;
      
      otherwise s) employing said next second tentative labels via said operation on said candidate subset to formulate next second local regularities, and t) repeating from step o).
  - 12. The method according to claim 11 further including the steps of:
    - applying said data with confidence labels to further subsets of said dataset to investigate further subsets for local regularities.
  - 13. The method according to claim 8 further including developing refined global regularities comprising the steps of:
    - i) collecting confidence labels from at least one of said candidate subsets to obtain global confidence labels;
      
      j) employing said global confidence labels on candidate subsets via one of said class of inductive operations to formulate said refined global regularities;
      
      k) providing descriptions of said refined global regularities to said working database;
      
      thereafter l) identifying a next candidate subset of the dataset in which local regularities may be found;
      
      thereafter m) tentatively identifying elements having said refined global regularities in said candidate subset to obtain next tentative labels;
      
      thereafter n) attaching said next tentative labels onto said identified elements of said next candidate subset;
      
      thereafter o) employing said attached next tentative labels via one of the class of inductive operations to formulate next local regularities;
      
      thereafter p) tentatively identifying elements having specific combinations of said refined global regularities and said next local regularities to obtain attached next second tentative labels;
      
      thereafter q) rating confidence of said attached next second tentative labels and converting selected ones of said attached next second tentative labels to confidence labels upon achieving a preselected confidence level; and
      
      then r) outputting data with said confidence labels;
      
      otherwise s) employing said next second tentative labels via said operation on said candidate subset to formulate next second local regularities; and
      
      t) repeating from step o).
  - 14. The method according to claim 13 further including the steps of:
    - applying said data with confidence levels to further subsets of said dataset to investigate further subsets for local regularities.

15. In a data processing system, a method for learning and combining regularities of a first level and regularities of at least a second level and a third level for information extraction and classification, said first, second and third levels having a hierarchy from most global to most specific, said method comprising the steps of:
- a) beginning at the most global level, training a classifier at the selected level by initially providing descriptions of regularities at said selected level to a working database, said selected level regularities being patterns which are to be found over a selected portion of a selected dataset corresponding to the selected level;
  
  thereafter b) identifying a candidate subset of each selected dataset in which next more specific regularities may be found;
  
  thereafter c) tentatively identifying elements having said selected regularities in said candidate subset to obtain first tentative labels, said first tentative labels being useful for tagging mutually similar information;
  
  thereafter d) attaching said first tentative labels onto said identified elements of said candidate subset;
  
  thereafter e) employing said attached first tentative labels via one of a class of inductive operations to formulate first local regularities;
  
  thereafter f) tentatively identifying elements having specific combinations of said global regularities and said local regularities to obtain attached second tentative labels;
  
  thereafter g) rating confidence of said attached second tentative labels and converting selected ones of said attached second tentative labels to confidence labels upon achieving a preselected confidence level; and
  
  then h) outputting data with said confidence labels;
  
  otherwise i) employing said second tentative labels via said operation on said candidate subset to formulate second more specific regularities;
  
  k) repeating from step f); and
  
  l) repeating from step a) for each successive more selective level of regularity.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAP America Incorporated (SAP SE)
Original Assignee
Inxight Software, Inc. (SAP SE)
Inventors
Cohen, William, Mitchell, Tom M., Quass, Dallan W., McCallum, Andrew K.
Primary Examiner(s)
Knight, Anthony
Assistant Examiner(s)
Holmes, Michael B.

Application Number

US09/771,008
Publication Number

US 20020103775A1
Time in Patent Office

1,565 Days
Field of Search

706 15- 60, 706/12, 707 1- 6
US Class Current

706/12
CPC Class Codes

G06F 16/972   Access to data in other rep...

G06F 2216/09   Obsolescence

G06N 20/00   Machine learning

Method for learning and combining global and local regularities for information extraction and classification

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

79 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method for learning and combining global and local regularities for information extraction and classification

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

79 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links