Semi-supervised training for statistical word alignment

US 8,433,556 B2
Filed: 11/02/2006
Issued: 04/30/2013
Est. Priority Date: 11/02/2006
Status: Active Grant

First Claim

Patent Images

1. A method for aligning words in parallel segments, the method comprising:

calculating a first probability distribution, utilizing a processor and a memory, according to a model estimate of word alignments within a first corpus comprising word-level unaligned parallel segments, the model estimate comprising an N-best list of one or more sub-models;

modifying the model estimate according to the first probability distribution;

discriminatively re-ranking one or more sub-models associated with the modified model estimate according to word-level annotated parallel segments; and

calculating a second probability distribution of the word alignments within the first corpus according to the re-ranked sub-models associated with the modified model estimate;

wherein discriminatively re-ranking one or more sub-models within the modified model estimate according to manual alignments further comprises;

adding manual alignments to hypothesized alignments within the first corpus;

comparing the manual alignments to the hypothesized alignments; and

weighting the one or more sub-models according to the comparison; and

wherein the comparing of the manual alignments to the hypothesized alignments comprises;

comparing an updated weighting factor for each sub-model derived using the first corpus to randomly generated weighting factors; and

selecting one of the updated weighting factor and the randomly generated weighting factor that generates a least amount of error.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for aligning words in parallel segments is provided. A first probability distribution of word alignments within a first corpus comprising unaligned word-level parallel segments according to a model estimate is calculated. The model estimate is modified according to the first probability distribution. One or more sub-models associated with the modified model estimate are discriminatively re-ranked according to word-level annotated parallel segments. A second probability distribution of the word alignments within the first corpus is calculated according to the re-ranked sub-models associated with the modified model estimate.

341 Citations

14 Claims

1. A method for aligning words in parallel segments, the method comprising:
- calculating a first probability distribution, utilizing a processor and a memory, according to a model estimate of word alignments within a first corpus comprising word-level unaligned parallel segments, the model estimate comprising an N-best list of one or more sub-models;
  
  modifying the model estimate according to the first probability distribution;
  
  discriminatively re-ranking one or more sub-models associated with the modified model estimate according to word-level annotated parallel segments; and
  
  calculating a second probability distribution of the word alignments within the first corpus according to the re-ranked sub-models associated with the modified model estimate;
  
  wherein discriminatively re-ranking one or more sub-models within the modified model estimate according to manual alignments further comprises;
  
  adding manual alignments to hypothesized alignments within the first corpus;
  
  comparing the manual alignments to the hypothesized alignments; and
  
  weighting the one or more sub-models according to the comparison; and
  
  wherein the comparing of the manual alignments to the hypothesized alignments comprises;
  
  comparing an updated weighting factor for each sub-model derived using the first corpus to randomly generated weighting factors; and
  
  selecting one of the updated weighting factor and the randomly generated weighting factor that generates a least amount of error.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method recited in claim 1, wherein the word-level annotated parallel segments comprise annotations indicating manual alignments.
  - 3. The method recited in claim 1, further comprising determining whether a first error associated with the re-ranked modified model estimate converges with a second error associated with the model estimate.
  - 4. The method recited in claim 1, further comprising determining a number of iterations to perform, the iterations comprising the steps of:
    - calculating a third probability distribution according to the re-ranked modified model estimate within the first corpus;
      
      further modifying the re-ranked modified model estimate according to the third probability distribution; and
      
      further discriminatively re-ranking one or more sub-models associated with the re-ranked modified model estimate according to the word-level annotated parallel segments.
  - 5. The method recited in claim 1, wherein the first corpus is larger than a second corpus.
  - 6. The method recited in claim 1, further comprising initializing the model estimate.
  - 7. The method recited in claim 1, wherein:
    - the weighting of the one or more sub-models according to the comparison is according to at least one weighting factor; and
      
      the discriminative re-ranking of the one or more sub-models within the modified model estimate according to manual alignments further comprises refining at least one of the at least one weighting factors using a one-dimensional error minimization until there is no further error reduction.
  - 8. The method recited in claim 7, wherein the refining of the at least one weighting factor further comprises calculating a piecewise constant function that evaluates an error of the word alignments selected by a best word alignment equation keeping the at least one weighting factor for each of the one or more sub-models constant except for one of the at least one weighting factor for the sub-model being evaluated.

9. A computer program embodied on a non-transitory computer readable medium having instructions for aligning words in parallel segments comprising:
- calculating a first probability distribution of word alignments within a first corpus comprising unaligned parallel segments according to a model estimate, the model estimate comprising an N-best list of one or more sub-models;
  
  modifying the model estimate according to the probability distribution;
  
  discriminatively re-ranking one or more sub-models within the modified model estimate according to annotated parallel segments; and
  
  calculating a second probability distribution of the word alignments within the first corpus according to the re-ranked modified model estimate;
  
  wherein discriminatively re-ranking one or more sub-models within the modified model estimate according to manual alignments further comprises;
  
  adding manual alignments to hypothesized alignments within the first corpus;
  
  comparing the manual alignments to the hypothesized alignments; and
  
  weighting the one or more sub-models according to the comparison;
  
  wherein;
  
  the weighting of the one or more sub-models according to the comparison is according to at least one weighting factor; and
  
  the discriminative re-ranking of the one or more sub-models within the modified model estimate according to manual alignments further comprises refining at least one of the at least one weighting factors using a one-dimensional error minimization until there is no further error reduction; and
  
  wherein the refining of the at least one weighting factor further comprises calculating a piecewise constant function that evaluates an error of the word alignments selected by a best word alignment equation keeping the at least one weighting factor for each of the one or more sub-models constant except for one of the at least one weighting factor for the sub-model being evaluated.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The computer program recited in claim 9, wherein the annotated parallel segments comprise annotations indicating manual alignments.
  - 11. The computer program recited in claim 9, further comprising an instruction for determining whether a first error due to the re-ranked modified model estimate converges with a second error due to the model estimate.
  - 12. The computer program recited in claim 9, further comprising an instruction for determining a number of iterations to perform, the iterations comprising the steps of:
    - calculating a third probability distribution according to the re-ranked modified model estimate within the first corpus;
      
      further modifying the re-ranked modified model estimate according to the third probability distribution; and
      
      further discriminatively re-ranking one or more sub-models associated with the re-ranked modified model estimate according to the word-level annotated parallel segments.
  - 13. The computer program recited in claim 9, wherein the first corpus is larger than a second corpus.
  - 14. The computer program recited in claim 9, further comprising an instruction for initializing the model estimate.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
University of Southern California
Original Assignee
University of Southern California
Inventors
Fraser, Alexander, Marcu, Daniel
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
BAKER, MATTHEW H

Application Number

US11/592,450
Publication Number

US 20080109209A1
Time in Patent Office

2,371 Days
Field of Search

704 1- 10
US Class Current

704/4
CPC Class Codes

G06F 40/45 Example-based machine trans...

Semi-supervised training for statistical word alignment

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

341 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Semi-supervised training for statistical word alignment

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

341 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links