Method for fast semi-automatic semantic annotation
First Claim
Patent Images
1. A computer-implemented method in a data processing system for fast semi-automatic semantic annotation, the computer-implemented method comprising:
- dividing a data set of sentences into a plurality of corpuses, wherein each of the plurality of corpuses includes an equal number of sentences;
learning, by a processor, a structure of each sentence of a first corpus using a plurality of trainers, wherein the structure is a parse tree that includes a tag, a label, and connections for each word of each sentence of the first corpus, wherein the plurality of trainers comprises a parser trainer, wherein the parser trainer is a decision tree-based statistical parser, and wherein the parser trainer fits a complete parse tree to each sentence of the first corpus;
forming, by the processor, a model based on the structure;
using the model in a set of engines to annotate new sentences, wherein each of the set of engines uses a corresponding model to output the parse tree, wherein the parse tree comprises a unique set of tags, labels, and connections for each word of each sentence of the first corpus, wherein using the model in the set of engines to annotate the new sentences further comprises;
sending each sentence of a second corpus to the set of engines;
sending the parse tree from each of the set of engines to a rover;
determining in the rover a best set of tags, labels, and connections for each word of each sentence of the second corpus based on a comparison of the unique sets of tags, labels, and connections from the each of the set of engines, wherein determining in the rover the best set of tags, labels, and connections further comprises;
responsive to the set of engines agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of a set of agreed engines; and
responsive to the set of engines disagreeing on the parse tree, selecting the unique set of tags, labels, and connections from a support vector machines engine, and wherein the support vector machines engine determines the tag and the label of the word to be annotated by using a tag classifier built using a tag feature vector for the word; and
responsive to a parser engine and a similarity engine agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of the parser engine and the similarity engine;
annotating each word of each sentence of the second corpus using the best set of tags, labels, and connections, wherein the similarity engine determines the tag, a label, and connections of the word to be annotated by finding a best reference sentence containing the word to be annotated using a bilingual evaluation understudy score and assigning corresponding tag, label, and connections of the word in the best reference sentence as the tag, the label, and the connections of a word to be annotated; and
tagging each sentence of the second corpus as reliable or unreliable, wherein tagging each sentence of the second corpus as reliable or unreliable further comprises;
responsive to the set of engines agreeing on the same parse tree of the annotated sentences, tagging the annotated sentence as reliable; and
responsive to the set of engines disagreeing on the same parse tree, tagging the annotated sentence as unreliable;
adding correctly annotated sentences of the second corpus to a set of training data, wherein the set of training data includes the correctly annotated sentences and sentences annotated by a human annotator for the first corpus;
annotating each sentence of a third corpus using the set of training data; and
automatically annotating, by a processor, each sentence of subsequent corpuses using the set of training data, wherein the set of training data includes correctly annotated sentences from each round of annotation.
2 Assignments
0 Petitions
Accused Products
Abstract
A method, apparatus and computer instructions is provided for fast semi-automatic semantic annotation. Given a limited annotated corpus, the present invention assigns a tag and a label to each word of the next limited annotated corpus using a parser engine, a similarity engine, and a SVM engine. A rover then combines the parse trees from the three engines and annotates the next chunk of limited annotated corpus with confidence, such that the efforts required for human annotation is reduced.
-
Citations
1 Claim
-
1. A computer-implemented method in a data processing system for fast semi-automatic semantic annotation, the computer-implemented method comprising:
-
dividing a data set of sentences into a plurality of corpuses, wherein each of the plurality of corpuses includes an equal number of sentences; learning, by a processor, a structure of each sentence of a first corpus using a plurality of trainers, wherein the structure is a parse tree that includes a tag, a label, and connections for each word of each sentence of the first corpus, wherein the plurality of trainers comprises a parser trainer, wherein the parser trainer is a decision tree-based statistical parser, and wherein the parser trainer fits a complete parse tree to each sentence of the first corpus; forming, by the processor, a model based on the structure; using the model in a set of engines to annotate new sentences, wherein each of the set of engines uses a corresponding model to output the parse tree, wherein the parse tree comprises a unique set of tags, labels, and connections for each word of each sentence of the first corpus, wherein using the model in the set of engines to annotate the new sentences further comprises; sending each sentence of a second corpus to the set of engines; sending the parse tree from each of the set of engines to a rover; determining in the rover a best set of tags, labels, and connections for each word of each sentence of the second corpus based on a comparison of the unique sets of tags, labels, and connections from the each of the set of engines, wherein determining in the rover the best set of tags, labels, and connections further comprises; responsive to the set of engines agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of a set of agreed engines; and responsive to the set of engines disagreeing on the parse tree, selecting the unique set of tags, labels, and connections from a support vector machines engine, and wherein the support vector machines engine determines the tag and the label of the word to be annotated by using a tag classifier built using a tag feature vector for the word; and responsive to a parser engine and a similarity engine agreeing on the same parse tree, selecting the unique set of tags, labels, and connections from one of the parser engine and the similarity engine; annotating each word of each sentence of the second corpus using the best set of tags, labels, and connections, wherein the similarity engine determines the tag, a label, and connections of the word to be annotated by finding a best reference sentence containing the word to be annotated using a bilingual evaluation understudy score and assigning corresponding tag, label, and connections of the word in the best reference sentence as the tag, the label, and the connections of a word to be annotated; and tagging each sentence of the second corpus as reliable or unreliable, wherein tagging each sentence of the second corpus as reliable or unreliable further comprises; responsive to the set of engines agreeing on the same parse tree of the annotated sentences, tagging the annotated sentence as reliable; and responsive to the set of engines disagreeing on the same parse tree, tagging the annotated sentence as unreliable; adding correctly annotated sentences of the second corpus to a set of training data, wherein the set of training data includes the correctly annotated sentences and sentences annotated by a human annotator for the first corpus; annotating each sentence of a third corpus using the set of training data; and automatically annotating, by a processor, each sentence of subsequent corpuses using the set of training data, wherein the set of training data includes correctly annotated sentences from each round of annotation.
-
Specification