×

Method for fast semi-automatic semantic annotation

  • US 7,610,191 B2
  • Filed: 10/06/2004
  • Issued: 10/27/2009
  • Est. Priority Date: 10/06/2004
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method in a data processing system for fast semi-automatic semantic annotation, the computer-implemented method comprising:

  • dividing a data set of sentences into a plurality of corpuses, wherein each of the plurality of corpuses includes an equal number of sentences;

    learning, by a processor, a structure of each sentence of a first corpus using a plurality of trainers, wherein the structure is a parse tree that includes a tag, a label, and connections for each word of each sentence of the first corpus, wherein the plurality of trainers comprises a parser trainer, wherein the parser trainer is a decision tree-based statistical parser, and wherein the parser trainer fits a complete parse tree to each sentence of the first corpus;

    forming, by the processor, a model based on the structure;

    using the model in a set of engines to annotate new sentences, wherein each of the set of engines uses a corresponding model to output the parse tree, wherein the parse tree comprises a unique set of tags, labels, and connections for each word of each sentence of the first corpus, wherein using the model in the set of engines to annotate the new sentences further comprises;

    sending each sentence of a second corpus to the set of engines;

    sending the parse tree from each of the set of engines to a rover;

    determining in the rover a best set of tags, labels, and connections for each word of each sentence of the second corpus based on a comparison of the unique sets of tags, labels, and connections from the each of the set of engines, wherein determining in the rover the best set of tags, labels, and connections further comprises;

    responsive to the set of engines agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of a set of agreed engines; and

    responsive to the set of engines disagreeing on the parse tree, selecting the unique set of tags, labels, and connections from a support vector machines engine, and wherein the support vector machines engine determines the tag and the label of the word to be annotated by using a tag classifier built using a tag feature vector for the word; and

    responsive to a parser engine and a similarity engine agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of the parser engine and the similarity engine;

    annotating each word of each sentence of the second corpus using the best set of tags, labels, and connections, wherein the similarity engine determines the tag, a label, and connections of the word to be annotated by finding a best reference sentence containing the word to be annotated using a bilingual evaluation understudy score and assigning corresponding tag, label, and connections of the word in the best reference sentence as the tag, the label, and the connections of a word to be annotated; and

    tagging each sentence of the second corpus as reliable or unreliable, wherein tagging each sentence of the second corpus as reliable or unreliable further comprises;

    responsive to the set of engines agreeing on the same parse tree of the annotated sentences, tagging the annotated sentence as reliable; and

    responsive to the set of engines disagreeing on the same parse tree, tagging the annotated sentence as unreliable;

    adding correctly annotated sentences of the second corpus to a set of training data, wherein the set of training data includes the correctly annotated sentences and sentences annotated by a human annotator for the first corpus;

    annotating each sentence of a third corpus using the set of training data; and

    automatically annotating, by a processor, each sentence of subsequent corpuses using the set of training data, wherein the set of training data includes correctly annotated sentences from each round of annotation.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×