Parallel fragment extraction from noisy parallel corpora

US 8,504,354 B2
Filed: 06/02/2008
Issued: 08/06/2013
Est. Priority Date: 06/02/2008
Status: Active Grant

First Claim

Patent Images

1. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language on a computer having a processor, the method comprising:

executing on the processor instructions configured to;

for respective elements of the first corpus, calculate;

a monolingual probability of the element with respect to preceding elements of the first corpus, anda bilingual probability of the element with respect to an aligned element of the second corpus;

for respective elements of the first corpus, identify candidate fragments of the first corpus comprising respective elements of the first corpus having a greater bilingual probability of the element with aligned elements of the second corpus than only the monolingual probability of the element with respect to preceding elements of the first corpus to align elements of the first corpus with elements of the second corpus; and

extract parallel fragments respectively comprising;

the first corpus elements of a candidate fragment, andthe second corpus elements aligned with the first corpus elements of the candidate fragment.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Machine translation algorithms for translating between a first language and a second language are often trained using parallel fragments, comprising a first language corpus and a second language corpus comprising an element-for-element translation of the first language corpus. Such training may involve large training sets that may be extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events; however, extracted fragments may be comparatively “noisy,” with extra elements inserted in each corpus. Extraction techniques may be devised that can differentiate between “bilingual” elements represented in both corpora and “monolingual” elements represented in only one corpus, and for extracting cleaner parallel fragments of bilingual elements. Such techniques may involve conditional probability determinations on one corpus with respect to the other corpus, or joint probability determinations that concurrently evaluate both corpora for bilingual elements.

19 Citations

View as Search Results

20 Claims

1. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language on a computer having a processor, the method comprising:
- executing on the processor instructions configured to;
  
  for respective elements of the first corpus, calculate;
  
  a monolingual probability of the element with respect to preceding elements of the first corpus, anda bilingual probability of the element with respect to an aligned element of the second corpus;
  
  for respective elements of the first corpus, identify candidate fragments of the first corpus comprising respective elements of the first corpus having a greater bilingual probability of the element with aligned elements of the second corpus than only the monolingual probability of the element with respect to preceding elements of the first corpus to align elements of the first corpus with elements of the second corpus; and
  
  extract parallel fragments respectively comprising;
  
  the first corpus elements of a candidate fragment, andthe second corpus elements aligned with the first corpus elements of the candidate fragment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1:
    - the method comprising;
      
      preparing a bilingual coincidence data set associating elements of the first language with elements of the second language according to a bilingual coincidence; and
      
      the aligning further comprising;
      
      identifying maximally coincident elements of the second corpus according to the bilingual coincidence data set for the first corpus element and the second corpus element.
  - 3. The method of claim 1, the aligning further comprising:
    - upon identifying in the first corpus a structural element of the first language that is not translatable into the second language, aligning the structural element with a null element.
  - 4. The method of claim 3, the calculating for an element aligned with the null element comprising:
    - calculating the monolingual probability of the element with respect to preceding elements of the first corpus.
  - 5. The method of claim 1, the identifying comprising:
    - generating a hidden Markov model representing transition probabilities between a bilingual generation mode and a monolingual generation mode;
      
      setting a transition parameter to one of a bilingual generation mode and a monolingual generation mode; and
      
      for respective elements of the first corpus, updating the transition parameter comprising one of the bilingual generation mode and the monolingual generation mode based on the generation state of a preceding element and the hidden Markov model.
  - 6. The method of claim 5, the monolingual probability calculating comprising:
    - selecting the transition probability between the monolingual generation mode of the element and the generation mode of a preceding element.
  - 7. The method of claim 6, the bilingual probability computed according to the mathematical formula:
  - 8. The method of claim 1, the calculating performed according to at least one of:
    - a state search, a dynamic programming search, and a pathfinding search.
  - 9. The method of claim 1, the extracting comprising:
    - extracting parallel fragments respectively comprising;
      
      the first corpus elements of a candidate fragment, andthe second corpus elements aligned with the first corpus elements of the candidate fragment,where the parallel fragment satisfies at least one parallel fragment condition.
  - 10. The method of claim 9, the at least one parallel fragment condition comprising at least one of:
    - a parallel fragment length of at least three first corpus elements;
      
      fewer than 30% of the first corpus elements and the second corpus elements aligned with the null element; and
      
      fewer than 70% of the first corpus elements and the second corpus elements comprising a structural element.
  - 11. The method of claim 9, the at least one parallel fragment condition comprising at least one of:
    - a maximum fragment size of twelve elements; and
      
      a fragment size of one fragment with respect to the other fragment having a ratio between 0.5 and 2.0.

12. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language on a computer having a processor, the method comprising:
- executing on the processor instructions configured to;
  
  for at least one first corpus element and at least one second corpus element, calculate;
  
  a first corpus monolingual probability of the at least one first corpus element with respect to preceding elements of the first corpus, anda second corpus monolingual probability of the at least one second corpus element with respect to preceding elements of the second corpus, anda bilingual probability of the at least one first corpus elements and the at least one second corpus elements with respect to one another;
  
  align the first corpus elements and the second corpus elements to identify candidate fragments comprising;
  
  a sequence of first corpus elements having a greater bilingual probability than only a first corpus monolingual property, anda sequence of second corpus elements aligned with the first corpus elements having a greater bilingual probability than only a second corpus monolingual property; and
  
  extract parallel fragments respectively comprising first corpus elements of a candidate fragment and aligned second corpus elements of the candidate fragment.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The method of claim 12, comprising:
    - preparing a bilingual coincidence data set associating elements of the first language with elements of the second language according to a bilingual coincidence.
  - 14. The method of claim 12, the bilingual probability calculating comprising:
    - retrieving the bilingual coincidence from the bilingual coincidence data set for the at least one first corpus element and the at least one second corpus element.
  - 15. The method of claim 12, the calculating comprising:
    - iteratively calculating a fragment probability for candidate fragments comprising at least one fragment element of at least one of the first corpus and the second corpus.
  - 16. The method of claim 15, the identifying comprising:
    - identifying candidate fragments comprising a sequence of elements having a maximal total of fragment probabilities for the first corpus and the second corpus.
  - 17. The method of claim 16, the fragment sequence probability calculated according to the mathematical formula:
    - δ
      
      [j,l]=max_{0≦
      
      i≦
      
      j,0≦
      
      k≦
      
      l}{δ
      
      [i,l]·
      
      A[i,j],δ
      
      [j,k]·
      
      B[k,l],δ
      
      [i,k]·
      
      E[i,j,k,l]}wherein;
      
      δ
      
      [j,l] represents the probability of the candidate sequence of fragments beginning at element 0 of the first corpus and 0 of the second corpus and ending at element j of the first corpus and element l of the second corpus;
      
      A[i,j] represents the monolingual probability of monolingually generating elements i through j of the first corpus, calculated according to the mathematical formula;
  - 18. The method of claim 16, the calculating performed according to at least one of:
    - a state search, a dynamic programming search, and a pathfinding search.
  - 19. The method of claim 12, the extracting comprising:
    - extracting parallel fragments respectively comprising;
      
      the first corpus elements of a candidate fragment, andthe second corpus elements aligned with the first corpus elements of the candidate fragment,where the parallel fragment satisfies at least one parallel fragment condition.

20. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language on a computer having a processor, the method comprising:
- executing on the processor instructions configured to;
  
  prepare a bilingual coincidence data set associating elements of the first language with elements of the second language according to a bilingual coincidence;
  
  generate a hidden Markov model representing transition probabilities between a bilingual generation mode and a monolingual generation mode;
  
  set a transition parameter to one of a bilingual generation mode and a monolingual generation mode; and
  
  align elements of the first corpus with elements of the second corpus by;
  
  for respective elements of the first corpus, identifying maximally coincident elements of the second corpus according to the bilingual coincidence data set for the first corpus element and the second corpus element, andupon identifying in the first corpus a structural element of the first language that is not translatable into the second language, aligning the structural element with a null element;
  
  for respective elements of the first corpus, calculating;
  
  a monolingual probability of the element with respect to preceding elements of the first corpus, calculated with respect to the transition probability between the monolingual generation mode of the element and the generation mode of a preceding element;
  
  for elements not aligned with the null element of the second corpus, a bilingual probability of the element with respect to an aligned element of the second corpus based on the bilingual coincidence, and calculated according to the mathematical formula;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Quirk, Christopher B., Udupa, Raghavendra U.
Primary Examiner(s)
PULLIAS, JESSE SCOTT

Application Number

US12/131,144
Publication Number

US 20090299729A1
Time in Patent Office

1,891 Days
Field of Search

None
US Class Current

704/8
CPC Class Codes

G06F 40/44 Statistical methods, e.g. p...

G06F 40/45 Example-based machine trans...

Parallel fragment extraction from noisy parallel corpora

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

19 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Parallel fragment extraction from noisy parallel corpora

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links