Method for fast semi-automatic semantic annotation

US 7,610,191 B2
Filed: 10/06/2004
Issued: 10/27/2009
Est. Priority Date: 10/06/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method in a data processing system for fast semi-automatic semantic annotation, the computer-implemented method comprising:

dividing a data set of sentences into a plurality of corpuses, wherein each of the plurality of corpuses includes an equal number of sentences;

learning, by a processor, a structure of each sentence of a first corpus using a plurality of trainers, wherein the structure is a parse tree that includes a tag, a label, and connections for each word of each sentence of the first corpus, wherein the plurality of trainers comprises a parser trainer, wherein the parser trainer is a decision tree-based statistical parser, and wherein the parser trainer fits a complete parse tree to each sentence of the first corpus;

forming, by the processor, a model based on the structure;

using the model in a set of engines to annotate new sentences, wherein each of the set of engines uses a corresponding model to output the parse tree, wherein the parse tree comprises a unique set of tags, labels, and connections for each word of each sentence of the first corpus, wherein using the model in the set of engines to annotate the new sentences further comprises;

sending each sentence of a second corpus to the set of engines;

sending the parse tree from each of the set of engines to a rover;

determining in the rover a best set of tags, labels, and connections for each word of each sentence of the second corpus based on a comparison of the unique sets of tags, labels, and connections from the each of the set of engines, wherein determining in the rover the best set of tags, labels, and connections further comprises;

responsive to the set of engines agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of a set of agreed engines; and

responsive to the set of engines disagreeing on the parse tree, selecting the unique set of tags, labels, and connections from a support vector machines engine, and wherein the support vector machines engine determines the tag and the label of the word to be annotated by using a tag classifier built using a tag feature vector for the word; and

responsive to a parser engine and a similarity engine agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of the parser engine and the similarity engine;

annotating each word of each sentence of the second corpus using the best set of tags, labels, and connections, wherein the similarity engine determines the tag, a label, and connections of the word to be annotated by finding a best reference sentence containing the word to be annotated using a bilingual evaluation understudy score and assigning corresponding tag, label, and connections of the word in the best reference sentence as the tag, the label, and the connections of a word to be annotated; and

tagging each sentence of the second corpus as reliable or unreliable, wherein tagging each sentence of the second corpus as reliable or unreliable further comprises;

responsive to the set of engines agreeing on the same parse tree of the annotated sentences, tagging the annotated sentence as reliable; and

responsive to the set of engines disagreeing on the same parse tree, tagging the annotated sentence as unreliable;

adding correctly annotated sentences of the second corpus to a set of training data, wherein the set of training data includes the correctly annotated sentences and sentences annotated by a human annotator for the first corpus;

annotating each sentence of a third corpus using the set of training data; and

automatically annotating, by a processor, each sentence of subsequent corpuses using the set of training data, wherein the set of training data includes correctly annotated sentences from each round of annotation.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, apparatus and computer instructions is provided for fast semi-automatic semantic annotation. Given a limited annotated corpus, the present invention assigns a tag and a label to each word of the next limited annotated corpus using a parser engine, a similarity engine, and a SVM engine. A rover then combines the parse trees from the three engines and annotates the next chunk of limited annotated corpus with confidence, such that the efforts required for human annotation is reduced.

Citations

1 Claim

1. A computer-implemented method in a data processing system for fast semi-automatic semantic annotation, the computer-implemented method comprising:
- dividing a data set of sentences into a plurality of corpuses, wherein each of the plurality of corpuses includes an equal number of sentences;
  
  learning, by a processor, a structure of each sentence of a first corpus using a plurality of trainers, wherein the structure is a parse tree that includes a tag, a label, and connections for each word of each sentence of the first corpus, wherein the plurality of trainers comprises a parser trainer, wherein the parser trainer is a decision tree-based statistical parser, and wherein the parser trainer fits a complete parse tree to each sentence of the first corpus;
  
  forming, by the processor, a model based on the structure;
  
  using the model in a set of engines to annotate new sentences, wherein each of the set of engines uses a corresponding model to output the parse tree, wherein the parse tree comprises a unique set of tags, labels, and connections for each word of each sentence of the first corpus, wherein using the model in the set of engines to annotate the new sentences further comprises;
  
  sending each sentence of a second corpus to the set of engines;
  
  sending the parse tree from each of the set of engines to a rover;
  
  determining in the rover a best set of tags, labels, and connections for each word of each sentence of the second corpus based on a comparison of the unique sets of tags, labels, and connections from the each of the set of engines, wherein determining in the rover the best set of tags, labels, and connections further comprises;
  
  responsive to the set of engines agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of a set of agreed engines; and
  
  responsive to the set of engines disagreeing on the parse tree, selecting the unique set of tags, labels, and connections from a support vector machines engine, and wherein the support vector machines engine determines the tag and the label of the word to be annotated by using a tag classifier built using a tag feature vector for the word; and
  
  responsive to a parser engine and a similarity engine agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of the parser engine and the similarity engine;
  
  annotating each word of each sentence of the second corpus using the best set of tags, labels, and connections, wherein the similarity engine determines the tag, a label, and connections of the word to be annotated by finding a best reference sentence containing the word to be annotated using a bilingual evaluation understudy score and assigning corresponding tag, label, and connections of the word in the best reference sentence as the tag, the label, and the connections of a word to be annotated; and
  
  tagging each sentence of the second corpus as reliable or unreliable, wherein tagging each sentence of the second corpus as reliable or unreliable further comprises;
  
  responsive to the set of engines agreeing on the same parse tree of the annotated sentences, tagging the annotated sentence as reliable; and
  
  responsive to the set of engines disagreeing on the same parse tree, tagging the annotated sentence as unreliable;
  
  adding correctly annotated sentences of the second corpus to a set of training data, wherein the set of training data includes the correctly annotated sentences and sentences annotated by a human annotator for the first corpus;
  
  annotating each sentence of a third corpus using the set of training data; and
  
  automatically annotating, by a processor, each sentence of subsequent corpuses using the set of training data, wherein the set of training data includes correctly annotated sentences from each round of annotation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Gao, Yuqing, Sarikaya, Ruhi, Picheny, Michael Alan
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
SAINT CYR, LEONARD

Application Number

US10/959,523
Publication Number

US 20060074634A1
Time in Patent Office

1,847 Days
Field of Search

None
US Class Current

704/9
CPC Class Codes

G06F 40/211 Syntactic parsing, e.g. bas...

G06F 40/268 Morphological analysis

Method for fast semi-automatic semantic annotation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

1 Claim

Specification

Solutions

Use Cases

Quick Links

Method for fast semi-automatic semantic annotation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

1 Claim

Specification

Subscription Required

Solutions

Use Cases

Quick Links