Estimation of parameters for machine translation without in-domain parallel data
First Claim
1. A method for estimating parameters for features of a translation scoring function and for scoring candidate translations in a target domain comprising:
- receiving a monolingual source corpus for a target domain and deriving n-gram counts from the monolingual source corpus or receiving n-gram counts derived only from the monolingual source corpus, the monolingual source corpus comprising sentences in a source language;
generating a multi-model for the target domain based on a phrase table for each of a set of comparative domains and a measure of similarity between the n-gram counts derived only from the source corpus for the target domain and the phrase tables for the comparative domains, each of the phrase tables storing a value for each of a set of features for each of a set of biphrases, the generated target domain multi-model being a weighted combination of two or more of the phrase tables for the comparative domains;
for the target domain, computing a measure of similarity between the monolingual source corpus and the target domain multi-model;
for each of a plurality of the comparative domains, computing a measure of similarity between a source corpus for the comparative domain and a respective comparative domain multi-model that is derived from phrase tables for others of the set of the comparative domains, each of the plurality of comparative domains being associated with parameters for at least some of the features of the translation scoring function;
estimating the parameters of the translation scoring function for the target domain based on the computed measure of similarity between the source corpus and the target domain multi-model, the computed measures of similarity for the comparative domains, and the parameters for the scoring function for the comparative domains; and
with a statistical machine translation component, scoring a translation with the translation scoring function,wherein the generating of the target domain multi-model, computing the measure of similarity between the source corpus and the target domain multi-model, computing the measure of similarity between a source corpus for the comparative domains and the respective comparative domain multi-models, and the estimating the parameters for the translation scoring function are performed with a computer processor.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for estimating parameters for features of a translation scoring function for scoring candidate translations in a target domain are provided. Given a source language corpus for a target domain, a similarity measure is computed between the source corpus and a target domain multi-model, which may be a phrase table derived from phrase tables of comparative domains, weighted as a function of similarity with the source corpus. The parameters of the log-linear function for these comparative domains are known. A mapping function is learned between similarity measure and parameters of the scoring function for the comparative domains. Given the mapping function and the target corpus similarity measure, the parameters of the translation scoring function for the target domain are estimated. For parameters where a mapping function with a threshold correlation is not found, another method for obtaining the target domain parameter can be used.
213 Citations
22 Claims
-
1. A method for estimating parameters for features of a translation scoring function and for scoring candidate translations in a target domain comprising:
-
receiving a monolingual source corpus for a target domain and deriving n-gram counts from the monolingual source corpus or receiving n-gram counts derived only from the monolingual source corpus, the monolingual source corpus comprising sentences in a source language; generating a multi-model for the target domain based on a phrase table for each of a set of comparative domains and a measure of similarity between the n-gram counts derived only from the source corpus for the target domain and the phrase tables for the comparative domains, each of the phrase tables storing a value for each of a set of features for each of a set of biphrases, the generated target domain multi-model being a weighted combination of two or more of the phrase tables for the comparative domains; for the target domain, computing a measure of similarity between the monolingual source corpus and the target domain multi-model; for each of a plurality of the comparative domains, computing a measure of similarity between a source corpus for the comparative domain and a respective comparative domain multi-model that is derived from phrase tables for others of the set of the comparative domains, each of the plurality of comparative domains being associated with parameters for at least some of the features of the translation scoring function; estimating the parameters of the translation scoring function for the target domain based on the computed measure of similarity between the source corpus and the target domain multi-model, the computed measures of similarity for the comparative domains, and the parameters for the scoring function for the comparative domains; and with a statistical machine translation component, scoring a translation with the translation scoring function, wherein the generating of the target domain multi-model, computing the measure of similarity between the source corpus and the target domain multi-model, computing the measure of similarity between a source corpus for the comparative domains and the respective comparative domain multi-models, and the estimating the parameters for the translation scoring function are performed with a computer processor. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A system for estimating parameters for features of a translation scoring function for performing machine translation in a target domain comprising:
-
memory which stores a monolingual source corpus for a target domain or n-grams present the monolingual source corpus, the monolingual source corpus comprising sentences in a source language; a similarity computation component which computes a measure of similarity between the target domain monolingual source corpus and a phrase table for each of a set of comparative domains by comparing n-grams present the monolingual source corpus and source language phrases in the phrase table; a multi-model computation component which generates a multi-model for the target domain based on the phrase tables for the comparative domains and the computed measures of similarity, the generated target domain multi-model being a weighted combination of two or more of the phrase tables for the comparative domains; the similarity computation component further computing, for the target domain, a measure of similarity between the source corpus and the target domain multi-model; the similarity computation component further computing a measure of similarity for each of the comparative domains between a respective comparative domain source corpus and a respective comparative domain multi-model that is derived from phrase tables for others of the set of the comparative domains, each of the plurality of comparative domains being associated with parameters for at least some of the features of the translation scoring function; a parameter computation component which estimates the parameters of the translation scoring function for the target domain based on the computed measure of similarity between the source corpus and the target domain multi-model, the computed measures of similarity for the comparative domains, and the parameters for the scoring function for the comparative domains; a statistical machine translation component which scores translations of source text with the translation scoring function, at least some of the features of the translation scoring function being computed based on the target domain multi-model; and a processor for implementing the similarity computation component, multi-model computation component, and parameter computation component.
-
-
22. A method for estimating parameters for features of a translation scoring function for scoring candidate translations in a target domain comprising:
-
for each of a plurality of parameters of the translation scoring function, learning a mapping function which maps a similarity measure to the parameter of the translation scoring function, the similarity measure being computed between a source corpus for one domain and a respective multi-model derived from phrase tables of other domains; receiving a source corpus for a target domain; generating a multi-model for the target domain based on phrase tables of comparative domains, the multi-model for the target domain being a phrase table which includes feature values for each of a set of such biphrases, the multi-model for the target domain being formed by combining at least two of the phrase tables of the comparative domains; computing a measure of similarity between the target domain source corpus and the target domain multi-model; based on the computed measure of similarity and the mapping functions, estimating the plurality of parameters for the translation scoring function for the target domain; incorporating the translation scoring function into a statistical machine translation system, wherein the learning of the mapping function, generating of the target domain multi-model, computing the measure of similarity, and the estimating of the set of parameters for the translation scoring function are performed with a computer processor.
-
Specification