Selection of domain-adapted translation subcorpora

US 8,838,433 B2
Filed: 02/08/2011
Issued: 09/16/2014
Est. Priority Date: 02/08/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented selection system, comprising:

linguistic data corpora that include an in-domain corpus and an out-domain corpus for domain adaptation for machine translation model training, the in-domain corpus and the out-domain corpus including multi-lingual data translated to the corpora in parallel;

a relevance component that selects relevant multi-lingual data from the out-domain corpus based on a similarity measure, the similarity measure considering a difference of cross-entropy scores according to an in-domain language model and an out-domain language model, the relevant multi-lingual data utilized in combination with the in-domain corpus or in isolation without the in-domain corpus; and

a processor that executes computer-executable instructions associated with at least the relevance component.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An architecture is discussed that provides the capability to subselect the most relevant data from an out-domain corpus to use either in isolation or in combination conjunction with in-domain data. The architecture is a domain adaptation for machine translation that selects the most relevant sentences from a larger general-domain corpus of parallel translated sentences. The methods for selecting the data include monolingual cross-entropy measure, monolingual cross-entropy difference, bilingual cross entropy, and bilingual cross-entropy difference. A translation model is trained on both the in-domain data and an out-domain subset, and the models can be interpolated together to boost performance on in-domain translation tasks.

33 Citations

View as Search Results

20 Claims

1. A computer-implemented selection system, comprising:
- linguistic data corpora that include an in-domain corpus and an out-domain corpus for domain adaptation for machine translation model training, the in-domain corpus and the out-domain corpus including multi-lingual data translated to the corpora in parallel;
  
  a relevance component that selects relevant multi-lingual data from the out-domain corpus based on a similarity measure, the similarity measure considering a difference of cross-entropy scores according to an in-domain language model and an out-domain language model, the relevant multi-lingual data utilized in combination with the in-domain corpus or in isolation without the in-domain corpus; and
  
  a processor that executes computer-executable instructions associated with at least the relevance component.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the relevant multi-lingual data is selected based on the similarity measure that considers the difference of cross-entropy scores according to the in-domain language model and the out-domain language model on a source side and a target side.
  - 3. The system of claim 1, wherein the relevant multi-lingual data is selected based on the similarity measure that combines cross-entropy scores according to the in-domain language model on each of a source side and a target side.
  - 4. The system of claim 1, wherein the relevant multi-lingual data is selected based on the similarity measure that considers the difference of the cross-entropy score according to the in-domain language model and cross-entropy score according to the out-domain language model.
  - 5. The system of claim 1, wherein the relevant multi-lingual data is selected based on the similarity measure that includes a cross-entropy score according to the in-domain language model on each of a source side and a target side.
  - 6. The system of claim 1, wherein the multi-lingual data is sentences that are ranked based on the similarity measure for selection as the relevant multi-lingual data.
  - 7. The system of claim 1, wherein the selection is based on ranking and scoring techniques that are applied to at least one of a source side language or a target side language, and bilingual sentence pairs are selected from the out-domain corpus.

8. A computer-implemented selection method, comprising acts of:
- receiving a set of trained in-domain language models, one for each language of multi-lingual sentences based on an in-domain corpus and a set of trained out-domain language models, one for each language of multi-lingual sentences based on an out-domain corpus;
  
  computing similarity scores for each of the sentences of the out-domain corpus, the scores obtained using a similarity measure as applied to the sentences against the in-domain language model and the out-domain language model;
  
  ranking the sentences based on the scores;
  
  selecting a set of sentences from the out-domain corpus based on the ranked scores;
  
  building a translation model based on either the set selected from the out-domain corpus, or a combination of the set selected from the out-domain corpus and the in-domain corpus; and
  
  utilizing a processor that executes instructions stored in memory to perform at least one of the acts of computing, ranking, selecting, or building.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method of claim 8, further comprising ranking the sentences for selection according to similarity scores, the scores obtained as a combination of a difference of the similarity scores according to the in-domain language model and the out-domain language model on each of a source side and a target side.
  - 10. The method of claim 8, further comprising ranking the sentences for selection according to similarity scores, the scores obtained as a combination of the similarity scores according to the in-domain language model on each of a source side and a target side.
  - 11. The method of claim 8, further comprising ranking the sentences for selection according to similarity scores, the scores obtained as a difference of the similarity scores according to the in-domain language model and, a similarity score according to the out-domain language model on a source side or a target side.
  - 12. The method of claim 8, further comprising ranking sentences by similarity score for selection according to the in-domain language model.
  - 13. The method of claim 8, further comprising:
    - generating an in-domain machine translation system from the in domain corpus; and
      
      combining the in-domain machine translation system and a subselected out-domain translation system to create a domain adapted machine translation system.
  - 14. The method of claim 13, further comprising tuning the combined in-domain machine translation system and the subselected out-domain translation system using an in-domain tuning corpus.

15. A computer-implemented selection method, comprising acts of:
- receiving an in-domain corpus of bilingual sentences and an out-domain corpus of bilingual sentences;
  
  generating an in-domain machine translation system from the in-domain corpus;
  
  training an in-domain language model based on the in-domain corpus and training an out-domain language model based on the out-domain corpus;
  
  applying a similarity measure to a sentence of the out-domain corpus and the in-domain language model, and to the sentence and the out-domain language model, to obtain similarity scores;
  
  selecting relevant sentences from the out-domain corpus based on the scores to create a subselected out-domain translation system;
  
  combining the in-domain machine translation system and the subselected out-domain translation system to create a domain adapted machine translation system; and
  
  utilizing a processor that executes instructions stored in memory to perform at least one of the acts of generating, training, applying, selecting, or combining.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15, further comprising ranking the sentences for selection according to similarity scores, the scores obtained as a combination of a difference of the similarity scores according to the in-domain language model and the out-domain language model on each of a source side and a target side.
  - 17. The method of claim 15, further comprising ranking the sentences for selection according to similarity scores, the scores obtained as a combination of the similarity scores according to the in-domain language model on each of a source side and a target side.
  - 18. The method of claim 15, further comprising ranking the sentences for selection according to similarity scores, the scores obtained as a difference of the similarity scores according to the in-domain language model and, a similarity score according to the out-domain language model on a source side or a target side.
  - 19. The method of claim 15, further comprising ranking sentences by similarity score for selection according to the in-domain language model.
  - 20. The method of claim 15, further comprising training an out-domain machine translation system on the selected out-domain sentences to create the subselected out-domain translation system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Axelrod, Amittai, Gao, Jianfeng, He, Xiaodong
Primary Examiner(s)
Lerner, Martin

Application Number

US13/022,633
Publication Number

US 20120203539A1
Time in Patent Office

1,316 Days
Field of Search

704/2, 704/8, 704/256.3, 704/277
US Class Current

704/2
CPC Class Codes

G06F 40/42 Data-driven translation

Selection of domain-adapted translation subcorpora

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

33 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Selection of domain-adapted translation subcorpora

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others