Modification of annotated bilingual segment pairs in syntax-based machine translation
First Claim
1. A method for generating a tree to string annotated bilingual segment pair, the method comprising:
- receiving a first annotated bilingual segment pair comprising two or more words in a source language, two or more words in a target language, and an alignment between the two or more words in the source language and the two or more words in the target language;
processing, by a computer, the first annotated bilingual segment pair to generate a target forest including a plurality of trees, each tree representing an alternative annotated bilingual segment pair including an alignment rule sequence that can be used to express a relationship between the words in the bilingual segment pair, the processing of the first annotated bilingual segment pair by the computer including re-labeling, re-structuring, and re-aligning the first annotated bilingual segment pair;
building a derivation forest from the target forest, the derivation forest including a plurality of trees;
deriving, by the computer, a plurality of rule sequences for the plurality of trees in the derivation forest, each tree in the derivation forest including a set of rule sequences derived from a tree in the target forest;
selecting one of the derived rule sequences based on a probability that the selected rule sequence is more likely than the other derived rule sequences, using an expectation-maximization algorithm;
generating, by the computer, a second annotated bilingual segment pair based on the selected rule sequence, wherein the second annotated bilingual segment pair has an alignment; and
extracting translation rules from the second annotated bilingual segment pair based on the alignment of the second annotated bilingual segment pair.
4 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for automatically modifying an annotated bilingual segment pair are provided. An annotated bilingual segment pair (“Pair”) may be modified to generate improved translation rules used in machine translation of documents from a source language to a target language. Because a single Pair may be used to translate a phrase, many Pairs are used in a machine translation system and manual correction of each model is impractical. Each Pair may be modified by re-labeling syntactic categories within the Pair, re-structuring a tree within the Pair, and/or re-aligning source words to target words within the Pair. In exemplary embodiments, many alternate Pairs (or portions thereof) are generated automatically, rule sequences corresponding to each are derived, and one or more rule sequences are selected. Using the selected rule sequence, a modified Pair is distilled.
431 Citations
27 Claims
-
1. A method for generating a tree to string annotated bilingual segment pair, the method comprising:
-
receiving a first annotated bilingual segment pair comprising two or more words in a source language, two or more words in a target language, and an alignment between the two or more words in the source language and the two or more words in the target language; processing, by a computer, the first annotated bilingual segment pair to generate a target forest including a plurality of trees, each tree representing an alternative annotated bilingual segment pair including an alignment rule sequence that can be used to express a relationship between the words in the bilingual segment pair, the processing of the first annotated bilingual segment pair by the computer including re-labeling, re-structuring, and re-aligning the first annotated bilingual segment pair; building a derivation forest from the target forest, the derivation forest including a plurality of trees; deriving, by the computer, a plurality of rule sequences for the plurality of trees in the derivation forest, each tree in the derivation forest including a set of rule sequences derived from a tree in the target forest; selecting one of the derived rule sequences based on a probability that the selected rule sequence is more likely than the other derived rule sequences, using an expectation-maximization algorithm; generating, by the computer, a second annotated bilingual segment pair based on the selected rule sequence, wherein the second annotated bilingual segment pair has an alignment; and extracting translation rules from the second annotated bilingual segment pair based on the alignment of the second annotated bilingual segment pair. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for generating a tree to string annotated bilingual segment pair, the system comprising:
-
a processor configured to; receive a first annotated bilingual segment pair to generate a plurality of trees, the first annotated bilingual segment pair comprising two or more words in a source language, two or more words in a target language, and an initial alignment between the two or more words in the source language and the two or more words in the target language, and process the first annotated bilingual segment pair, the processing including; re-labeling, restructuring, and re-aligning the first annotated bilingual segment pair, to generate a plurality of target trees, each tree representing an alternative annotated bilingual segment pair including an alignment rule sequence that can be used to express a relationship between the words in the bilingual segment pair; a derivation module configured to derive a plurality of derivation trees from the plurality of the target trees, each derivation tree including a rule sequence derived from a target tree; a training module configured to select one of the derived rule sequences, based on a probability that the selected rule sequence is more likely than the other derived rule sequences, using an expectation-maximization algorithm; and a distillation module configured to generate a second annotated bilingual segment pair based on the selected rule sequence, and produce translation rules from the second annotated bilingual segment pair. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A non-transitory computer readable storage medium having embodied thereon a program, the program being executable by a processor for performing a method for generating an annotated bilingual segment pair, the method comprising:
-
receiving a first annotated bilingual segment pair comprising a source string including two or more words in a source language, a target phrase represented as a first tree that includes two or more words in a target language, and an alignment between the source string and the first tree; processing the first annotated bilingual segment pair, the processing of the first annotated bilingual pair including; re-labeling one or more nodes of a tree representing the first annotated bilingual segment pair, re-structuring the tree representing the annotated bilingual segment pair, including adding parent nodes to the tree, re-aligning the first annotated bilingual segment pair, and generating a target forest including a plurality of target trees based on the re-labeling, re-structuring, and re-aligning; building a derivation forest of a plurality of derivation trees based on the target forest, each derivation tree representing a rule sequence that corresponds to target tree; deriving a plurality of rule sequences from the plurality of the derivation trees; calculating a probability using an expectation-maximization algorithm that the derived rule sequence is correct for each of the derived rule sequences; selecting the rule sequences having the highest probability; generating a second annotated bilingual segment pair based on the selected rule sequence, wherein the second annotated bilingual segment pair has an alignment; and extracting translation rules from the second annotated bilingual segment pair based on the alignment of the second annotated bilingual segment pair. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
Specification