Training a statistical parser on noisy data by filtering

US 20060277028A1
Filed: 06/01/2005
Published: 12/07/2006
Est. Priority Date: 06/01/2005
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method of creating training data to train a parser in a selected domain, comprising:

parsing unannotated text of the selected domain using a first parser to obtain parsed text;

identifying in the parsed text a subset thereof that is more appropriate than other portions for obtaining an improved parsing model in the selected domain; and

creating the improved parsing model using the subset of parsed text and a training module.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A filtering or identifying approach is disclosed and applied to the task of unsupervised adaptation of a parsing model to a selected domain. In particular, unannotated text data from the selected domain is parsed using a first parser. A subset of the parsed text is then selected and used to train an improved model using a training module which can be of the type that outputs a parsing model that is usable by the first parser or can be of the type that outputs a parsing model that is usable by another type of parser.

Citations

20 Claims

1. A computer-implemented method of creating training data to train a parser in a selected domain, comprising:
- parsing unannotated text of the selected domain using a first parser to obtain parsed text;
  
  identifying in the parsed text a subset thereof that is more appropriate than other portions for obtaining an improved parsing model in the selected domain; and
  
  creating the improved parsing model using the subset of parsed text and a training module.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The computer-implemented method of claim 1 wherein identifying comprises filtering the parsed text to obtain the subset thereof.
  - 3. The computer-implemented method of claim 2 wherein identifying comprises using a ranking function.
  - 4. The computer-implemented method of claim 3 wherein using a ranking function comprises using a ranking function based on informativeness of text items in the parsed text.
  - 5. The computer-implemented method of claim 3 wherein using a ranking function comprises using a ranking function based on accuracy of text items in the parsed text.
  - 6. The computer-implemented method of claim 3 wherein using a ranking function comprises using a ranking function based on discrimination of text items in the parsed text.
  - 7. The computer-implemented method of claim 6 wherein wherein using a ranking function comprises using a ranking function based on uncertainty.
  - 8. The computer-implemented method of claim 7 wherein wherein using a ranking function comprises using a ranking function based on an entropy function.
  - 9. The computer-implemented method of claim 1 wherein at least one of parsing and identifying comprises using a pre-existing model in the selected domain.
  - 10. The computer-implemented method of claim 5 wherein the first parser and a parser that utilizes the improved parsing model are identical.
  - 11. The computer-implemented method of claim 3 wherein identifying comprises identifying sentences.
  - 12. The computer-implemented method of claim 3 wherein identifying comprises identifying word pairs.
  - 13. The computer-implemented method of claim 1 wherein creating the improved parsing model comprises using known accurate textual data in addition to the subset of parsed text.
  - 14. The computer-implemented method of claim 11 wherein the known accurate textual data comprises data in the selected domain.
  - 15. The computer-implemented method of claim 11 wherein the known accurate textual data comprises out-of-domain data relative to the selected domain.

16. A computer readable medium having instructions which when performed by a computer create training data for training a parser, the instructions comprising:
- parsing unannotated text of the selected domain using a first parser to obtain parsed text;
  
  ranking portions of the parsed text to identify a subset thereof that is more appropriate than other portions for obtaining an improved parsing model in the selected domain; and
  
  creating the improved parsing model using the subset of parsed text and a training module.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer readable medium of claim 16 wherein ranking comprises using a ranking function based on informativeness of text items in the parsed text.
  - 18. The computer readable medium of claim 16 wherein ranking comprises using a ranking function based on accuracy of text items in the parsed text.
  - 19. The computer readable medium of claim 16 wherein ranking comprises using a ranking function based on discrimination of text items in the parsed text.
  - 20. The computer readable medium of claim 19 wherein ranking comprises using a ranking function based on an entropy function.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Jiang, Jinjing, Chen, John T.

Application Number

US11/142,703
Publication Number

US 20060277028A1
Time in Patent Office

Days
Field of Search
US Class Current

704/4
CPC Class Codes

G06F 40/216 using statistical methods

Training a statistical parser on noisy data by filtering

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Training a statistical parser on noisy data by filtering

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links