Apparatus and methods for aligning words in bilingual sentences

US 20060190241A1
Filed: 05/26/2005
Published: 08/24/2006
Est. Priority Date: 02/22/2005
Status: Active Grant

First Claim

Patent Images

1. A method for aligning words of natural language sentences, comprising:

receiving a corpus of aligned source sentences f=f₁. . . f_i. . . f_lcomposed of source words f₁, . . . f_l, and target sentences e=e₁. . . e_j. . . e_Jcomposed of target words e₁, . . . e_J;

the source sentences being in a first natural language and the target sentences being in a second natural language;

producing a translation matrix M with association measures m_ij;

each association measure m_ijin the translation matrix providing a valuation of association strength between each source word f_iand each target word e_j;

producing one or more of an alignment matrix A and cepts that link aligned source and target words;

the alignment matrix and cepts defining a proper N;

M alignment between source and target words by satisfying coverage and transitive closure;

wherein coverage is satisfied when each source word is aligned with at least one target word and each target word is aligned to at least one source word; and

wherein transitive closure is satisfied if when source word f_iis aligned to target words e_jand e_l, and source word f_kis aligned to target word e_l, then source word f_kis also aligned to target word e_j.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods are disclosed for performing proper word alignment that satisfy constraints of coverage and transitive closure. Initially, a translation matrix which defines word association measures between source and target words of a corpus of bilingual translations of source and target sentences is computed. Subsequently, in a first method, the association measures in the translation matrix are factorized and orthogonalized to produce cepts for the source and target words, which resulting matrix factors may then be, optionally, multiplied to produce an alignment matrix. In a second method, the association measures in the translation matrix are thresholded, and then closed by transitivity, to produce an alignment matrix, which may then be, optionally, factorized to produce cepts. The resulting cepts or alignment matrices may then be used by any number of natural language applications for identifying words that are properly aligned.

Citations

21 Claims

1. A method for aligning words of natural language sentences, comprising:
- receiving a corpus of aligned source sentences f=f₁. . . f_i. . . f_lcomposed of source words f₁, . . . f_l, and target sentences e=e₁. . . e_j. . . e_Jcomposed of target words e₁, . . . e_J;
  
  the source sentences being in a first natural language and the target sentences being in a second natural language;
  
  producing a translation matrix M with association measures m_ij;
  
  each association measure m_ijin the translation matrix providing a valuation of association strength between each source word f_iand each target word e_j;
  
  producing one or more of an alignment matrix A and cepts that link aligned source and target words;
  
  the alignment matrix and cepts defining a proper N;
  
  M alignment between source and target words by satisfying coverage and transitive closure;
  
  wherein coverage is satisfied when each source word is aligned with at least one target word and each target word is aligned to at least one source word; and
  
  wherein transitive closure is satisfied if when source word f_iis aligned to target words e_jand e_l, and source word f_kis aligned to target word e_l, then source word f_kis also aligned to target word e_j.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method according to claim 1, wherein said producing the cepts further comprises:
    - factorizing the translation matrix into factor matrices; and
      
      orthogonalizing the factor matrices.
  - 3. The method according to claim 2, wherein said producing the alignment matrix further comprises computing the alignment matrix by multiplying the orthogonalized factor matrices.
  - 4. The method according to claim 3, wherein said producing the translation matrix further comprises:
    - training a first statistical machine translation model to translate from the source language to the target language;
      
      training a second statistical machine translation model to translate from the target language to the source language;
      
      generating a set of translations by computing hypothetical translation paths between translation pairs in the sentence aligned corpus using the first and second translation models;
      
      summing the occurrence of each word-pair that appears in each translation path generated using the first and second translation models to define the association measures m_ijin the translation matrix.
  - 5. The method according to claim 4, further comprising outputting the one or more of the alignment matrix and the cepts to a natural language application.
  - 6. The method according to claim 5, wherein the natural language application comprises one or more of a combination of:
    - multilingual information retrieval, bilingual terminology extraction, machine translation, projection of linguistic features, annotation, cross-lingual categorization, and multilingual lexical extraction.
  - 7. The method according to claim 1, wherein said producing the alignment matrix further comprises:
    - thresholding the translation matrix; and
      
      closing the thresholded matrix by transitivity.
  - 8. The method according to claim 7, wherein said producing the alignment matrix further comprises computing word-to-cept alignments from the proper alignment matrix by factorizing the alignment matrix into word-to-cept alignments.
  - 9. The method according to claim 8, wherein said factorizing the alignment matrix further comprises:
    - determining how many cepts correspond to the alignment matrix; and
      
      finding which words aligns with which cept.
  - 10. The method according to claim 9, wherein said producing the translation matrix further comprises:
    - training a first statistical machine translation model to translate from the source language to the target language;
      
      training a statistical second machine translation model to translate from the target language to the source language;
      
      generating a set of hypothetical translation paths between translation pairs in the sentence aligned corpus using the first and second translation models;
      
      summing the occurrence of each word-pair that appears in each translation path generated using the first and second translation models to define the association measures m_ijin the translation matrix.
  - 11. The method according to claim 10, further comprising outputting the one or more of the alignment matrix and the cepts to a natural language application.
  - 12. The method according to claim 11, wherein the natural language application comprises one or more of a combination of:
    - multilingual information retrieval, bilingual terminology extraction, machine translation, projection of linguistic features, annotation, cross-lingual categorization, and multilingual lexical extraction.

13. An apparatus for aligning words of natural language sentences, comprising:
- a memory for storing natural language processing instructions of the apparatus; and
  
  a processor coupled to the memory for executing the natural language processing instructions of the apparatus;
  
  the processor in executing the natural language processing instructions;
  
  receiving a corpus of aligned source sentences f=f₁. . . f_i. . . f_lcomposed of source words f₁, . . . f_l, and target sentences e=e₁. . . e_j. . . e_Jcomposed of target words e₁, . . . e_J;
  
  the source sentences being in a first natural language and the target sentences being in a second natural language;
  
  producing a translation matrix M with association measures m_ij;
  
  each association measure m_ijin the translation matrix providing a valuation of association strength between each source word f_iand each target word e_j;
  
  producing one or more of an alignment matrix A and cepts that link aligned source and target words;
  
  the alignment matrix and cepts defining a proper N;
  
  M alignment between source and target words by satisfying coverage and transitive closure;
  
  wherein coverage is satisfied when each source word is aligned with at least one target word and each target word is aligned to at least one source word; and
  
  wherein transitive closure is satisfied if when source word f_iis aligned to target words e_jand e_l, and source word f_kis aligned to target word e_l, then source word f_kis also aligned to target word e_j.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The apparatus to claim 13, wherein the processor in executing the natural language processing instructions further comprises outputting the one or more of the alignment matrix and the cepts to a natural language application.
  - 15. The apparatus according to claim 14, wherein the natural language application comprises one or more of a combination of:
    - multilingual information retrieval, bilingual terminology extraction, machine translation, projection of linguistic features, annotation, cross-lingual categorization, and multilingual lexical extraction.
  - 16. The apparatus according to claim 13, wherein the processor in executing the natural language processing instructions when producing the cepts further comprises:
    - factorizing the translation matrix into factor matrices; and
      
      orthogonalizing the factor matrices.
  - 17. The apparatus according to claim 16, wherein the processor in executing the natural language processing instructions when producing the alignment matrix further comprises computing the alignment matrix by multiplying the orthogonalized factor matrices.
  - 18. The apparatus according to claim 13, wherein the processor in executing the natural language processing instructions when producing the alignment matrix further comprises:
    - thresholding the translation matrix; and
      
      closing the thresholded matrix by transitivity.
  - 19. The apparatus according to claim 18, wherein the processor in executing the natural language processing instructions when producing the alignment matrix further comprises computing word-to-cept alignments from the proper alignment matrix by factorizing the alignment matrix into word-to-cept alignments.
  - 20. The apparatus according to claim 19, wherein the processor in executing the natural language processing instructions when factorizing the alignment matrix further comprises:
    - determining how many cepts correspond to the alignment matrix; and
      
      finding which words aligns with which cept.

21. An article of manufacture for use in a machine, comprising:
- a memory;
  
  instructions stored in the memory a method for aligning words of natural language sentences, the method comprising;
  
  receiving a corpus of aligned source sentences f=f₁. . . f_i. . . f_lcomposed of source words f₁, . . . f_l, and target sentences e=e₁. . . e_j. . . e_Jcomposed of target words e₁, . . . e_J;
  
  the source sentences being in a first natural language and the target sentences being in a second natural language;
  
  producing a translation matrix M with association measures m_ij;
  
  each association measure m_ijin the translation matrix providing a valuation of association strength between each source word f_iand each target word e_j;
  
  producing one or more of an alignment matrix A and cepts that link aligned source and target words;
  
  the alignment matrix and cepts defining a proper N;
  
  M alignment between source and target words by satisfying coverage and transitive closure;
  
  wherein coverage is satisfied when each source word is aligned with at least one target word and each target word is aligned to at least one source word; and
  
  wherein transitive closure is satisfied if when source word f_iis aligned to target words e_jand e_l, and source word f_kis aligned to target word e_l, then source word f_kis also aligned to target word e_j.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Yamada, Kenji, Goutte, Cyril, Simard, Michel, Gaussier, Eric, Mauser, Arne

Granted Patent

US 7,672,830 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/2
CPC Class Codes

G06F 40/45 Example-based machine trans...

Apparatus and methods for aligning words in bilingual sentences

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and methods for aligning words in bilingual sentences

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links