Example based machine translation system

US 7,353,165 B2
Filed: 06/28/2002
Issued: 04/01/2008
Est. Priority Date: 06/28/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of performing machine translation of a source language (SL) input to a translation output in a target language (TL), comprising:

matching fragments of the SL input to SL fragments of examples in an example base;

identifying all matched blocks in the SL input as blocks of terms in the SL input that are matched by one or more SL fragments in an example;

selecting block combinations of the matched blocks to cover one or more fragments of the SL input;

for each block in the selected block combinations, identifying an example associated with the block;

aligning TL portions of the identified example with SL portions of the identified example that match the one or more fragments of the SL input; and

providing the translation output based on the aligned portions wherein identifying an example associated with a block comprises;

calculating a block score corresponding to each example containing the block by calculating the block score as follows;

$\begin{matrix} {similarity}_{j} = \sum_{k = 1}^{K} {TFIDF}_{kj} \end{matrix}$ Where,TFIDF is term frequency inverse document frequency;

K=a total number of common terms included both in example j and the SL input;

TFIDF_kj=Term k'"'"'s TF/IDF weight in example j; and

Similarity_j=matching weight between the example j and the SL input; and

identifying the example associated with the block based on the block score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention performs machine translation by matching fragments of a source language sentence to be translated to source language portions of an example in example base. When all relevant examples have been identified in the example base, the examples are subjected to phrase alignment in which fragments of the target language sentence in each example are aligned against the matched fragments of the source language sentence in the same example. A translation component then substitutes the aligned target language phrases from the matched examples for the matched fragments in the source language sentence.

108 Citations

View as Search Results

21 Claims

1. A method of performing machine translation of a source language (SL) input to a translation output in a target language (TL), comprising:
- matching fragments of the SL input to SL fragments of examples in an example base;
  
  identifying all matched blocks in the SL input as blocks of terms in the SL input that are matched by one or more SL fragments in an example;
  
  selecting block combinations of the matched blocks to cover one or more fragments of the SL input;
  
  for each block in the selected block combinations, identifying an example associated with the block;
  
  aligning TL portions of the identified example with SL portions of the identified example that match the one or more fragments of the SL input; and
  
  providing the translation output based on the aligned portions wherein identifying an example associated with a block comprises;
  
  calculating a block score corresponding to each example containing the block by calculating the block score as follows;
  
  $\begin{matrix} {similarity}_{j} = \sum_{k = 1}^{K} {TFIDF}_{kj} \end{matrix}$ Where,TFIDF is term frequency inverse document frequency;
  
  K=a total number of common terms included both in example j and the SL input;
  
  TFIDF_kj=Term k'"'"'s TF/IDF weight in example j; and
  
  Similarity_j=matching weight between the example j and the SL input; and
  
  identifying the example associated with the block based on the block score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 wherein providing the translation output comprises:
    - outputting a plurality of possible translation outputs.
  - 3. The method of claim 2 and further comprising:
    - calculating a confidence measure for each translation output.
  - 4. The method of claim 3 wherein calculating comprises:
    - calculating the confidence measure as a translation confidence level as follows;
5. The method of claim 3 and further comprising:
- identifying portions of the translation output that require a user'"'"'s attention.
6. The method of claim 1 wherein matching fragments of the SL input to fragments of examples comprises:
- identifying bi-terms in the SL input; and
  
  accessing a bi-term index of the example base that includes example identifiers identifying examples that contain indexed bi-terms.
7. The method of claim 6 wherein accessing a bi-term index comprises:
- accessing a bi-term index of the example base that includes word position information indicative of a word position in the example where the bi-term resides.
8. The method of claim 7 wherein accessing a bi-term index comprises:
- accessing a bi-term index of the example base that includes a score indicative of a term frequency/inverse document frequency (TF/IDF) score for the bi-term in the example.
9. The method of claim 8 wherein accessing a bi-term index comprises:
- accessing a bi-term index of the example base that includes a corpus score indicative of a representative TF/IDF score for the bi-term across the example base.
10. The method of claim 1 wherein aligning TL portions of the example with the SL portions comprises:
- performing word alignment to identify anchor alignment points between the SL portion and the TL portion of the example;
  
  finding all continuous alignments between the TL portion and the SL portion based on the anchor alignment points; and
  
  finding all non-continuous alignments between the TL portion and the SL portion based on the anchor alignment points.
11. The method of claim 1 wherein selecting block combinations comprises:
- calculating a block combination score for different combinations of the identified blocks; and
  
  identifying N best block combinations based on the block combination scores.
12. The method of claim 11 wherein calculating a block combination score comprises:
- $\begin{matrix} {EdgeLen}_{i} = {\begin{matrix} \frac{1}{\sum_{k = m}^{n} {TFIDF}_{k}}, & if n > m \\ 10, & if n == m \end{matrix} \end{matrix}$ where,i=an “
  
  edge”
  
  (block) index number in the SL input;
  
  m=a word indexing number of the “
  
  edge”
  
  i'"'"'s starting point;
  
  n=a word indexing number of the “
  
  edge”
  
  i'"'"'s ending point;
  
  k=a word indexing number of the “
  
  edge”
  
  i'"'"'s each term;
  
  TFIDF_k=term k'"'"'s average TF/IDF weight in the example base; and
  
  EdgeLen_i=a weight of block i.

13. A method of performing machine translation of a source language (SL) input to a translation output in a target language (TL), comprising:
- matching fragments of the SL input to SL fragments of examples in an example base;
  
  identifying all matched blocks in the SL input as blocks of terms in the SL input that are matched by one or more SL fragments in an example;
  
  selecting block combinations of the matched blocks to cover one or more fragments of the SL input;
  
  for each block in the selected block combinations calculating a block score corresponding to each example containing the block, and, identifying an example associated with the block based on the block score;
  
  aligning TL portions of the identified example with SL portions of the identified example that match the one or more fragments of the SL input;
  
  providing the translation output as a plurality of possible translation outputs based on the aligned portions; and
  
  calculating a confidence measure for each translation output, as a translation confidence level, as follows;
  
  $\begin{matrix} \begin{matrix} ConL = c_{1} \times \log (AlignCon \times 10) + c_{2} \times \log (TransPercent \times 10) + \\ c_{3} \times \log (10 / Example_num) + c_{4} \times \log (10 / Valid_block_num) \end{matrix} \\ \begin{matrix} AlignCon = \sum_{\begin{matrix} w_{i} \in PhrSL, w_{j} \in PhrTL \\ i…j; are connected \end{matrix}}^{} Conf (C_{ij}) / \langle PhrTL \rangle \\ (0 \leq AlignCon \leq 1, 0 \leq TransPercent \leq 1, \sum_{i = 1}^{4} c_{i} = 1) \end{matrix} \end{matrix}$ where,ConL;
  
  is the translation confidence level;
  
  c₁,c₂, . . . ,c₄;
  
  are constants,AlignCon;
  
  is an alignment confidence level;
  
  TransPercent;
  
  is a weighted translation percentage;
  
  Example₁₃num;
  
  is an employed example number identifying the identified example;
  
  Valid₁₃block₁₃num;
  
  is a fragment number in a possible TL Translation under consideration;
  
  PhrSL;
  
  is a SL phrase that relates to a given input string;
  
  PhrTL;
  
  is a TL correspondence in the possible translation of the SL input;
  
  |PhrTL|;
  
  is a word number of PhrTL;
  
  C_i. . . _j;
  
  is a connection between SL word i and TL word j; and
  
  Conf(C_i. . . _j);
  
  is the translation confidence level of word Alignment.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The method of claim 13 and further comprising:
    - identifying portions of the translation output that require a user'"'"'s attention.
  - 15. The method of claim 13 wherein matching fragments of the SL input to fragments of examples comprises:
    - identifying bi-terms in the SL input; and
      
      accessing a bi-term index of the example base that includes example identifiers identifying examples that contain indexed bi-terms.
  - 16. The method of claim 15 wherein accessing a bi-term index comprises:
    - accessing a bi-term index of the example base that includes word position information indicative of a word position in the example where the bi-term resides.
  - 17. The method of claim 16 wherein accessing a bi-term index comprises:
    - accessing a bi-term index of the example base that includes a score indicative of a term frequency/inverse document frequency (TF/IDF) score for the bi-term in the example.
  - 18. The method of claim 17 wherein accessing a bi-term index comprises:
    - accessing a bi-term index of the example base that includes a corpus score indicative of a representative TF/IDF score for the bi-term across the example base.
  - 19. The method of claim 13 wherein aligning TL portions of the example with the SL portions comprises:
    - performing word alignment to identify anchor alignment points between the SL portion and the TL portion of the example;
      
      finding all continuous alignments between the TL portion and the SL portion based on the anchor alignment points; and
      
      finding all non-continuous alignments between the TL portion and the SL portion based on the anchor alignment points.
  - 20. The method of claim 13 wherein selecting block combinations comprises:
    - calculating a block combination score for different combinations of the identified blocks; and
      
      identifying N best block combinations based on the block combination scores.
  - 21. The method of claim 20 wherein calculating a block combination score comprises:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Huang, Chang Ning (Tom), Huang, Jin-Xia, Wang, Wei, Zhou, Ming
Primary Examiner(s)
Edouard; Patrick N.
Assistant Examiner(s)
Wozniak; James S.

Application Number

US10/185,376
Publication Number

US 20040002848A1
Time in Patent Office

2,104 Days
Field of Search

704/2, 704/1, 704 4- 5
US Class Current

704/5
CPC Class Codes

G06F 40/45 Example-based machine trans...

Example based machine translation system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

108 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Example based machine translation system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

108 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links