METHOD FOR ALIGNING SENTENCES AT THE WORD LEVEL ENFORCING SELECTIVE CONTIGUITY CONSTRAINTS

US 20080300857A1
Filed: 06/01/2007
Published: 12/04/2008
Est. Priority Date: 05/10/2006
Status: Active Grant

First Claim

Patent Images

1. An alignment method comprising:

for a source sentence in a source language, identifying whether the sentence includes at least one candidate term comprising a contiguous subsequence of words of the source sentence;

aligning a target sentence in a target language with the source sentence, comprising;

developing a probabilistic model which models conditional probability distributions for alignments between words of the source sentence and words of the target sentence; and

generating an optimal alignment based on the probabilistic model, including, where the source sentence includes the at least one candidate term, enforcing a contiguity constraint which requires that all the words of the target sentence which are aligned with an identified candidate term form a contiguous subsequence of the target sentence.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An alignment method includes, for a source sentence in a source language, identifying whether the sentence includes at least one candidate term comprising a contiguous subsequence of words of the source sentence. A target sentence in a target language is aligned with the source sentence. This includes developing a probabilistic model which models conditional probability distributions for alignments between words of the source sentence and words of the target sentence and generating an optimal alignment based on the probabilistic model, including, where the source sentence includes the at least one candidate term, enforcing a contiguity constraint which requires that all the words of the target sentence which are aligned with an identified candidate term form a contiguous subsequence of the target sentence.

149 Citations

23 Claims

1. An alignment method comprising:
- for a source sentence in a source language, identifying whether the sentence includes at least one candidate term comprising a contiguous subsequence of words of the source sentence;
  
  aligning a target sentence in a target language with the source sentence, comprising;
  
  developing a probabilistic model which models conditional probability distributions for alignments between words of the source sentence and words of the target sentence; and
  
  generating an optimal alignment based on the probabilistic model, including, where the source sentence includes the at least one candidate term, enforcing a contiguity constraint which requires that all the words of the target sentence which are aligned with an identified candidate term form a contiguous subsequence of the target sentence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The alignment method of claim 1, wherein the optimal alignment accepts at least some reordering of the words in the target sentence which are aligned to the words in the source sentence.
  - 3. The alignment method of claim 1, wherein the identifying of the at least one candidate term includes determining whether a contiguous sequence of words meet criteria defining a class of terms.
  - 4. The alignment method of claim 1, wherein the at least one candidate term is a noun phrase.
  - 5. The alignment method of claim 1, wherein the probabilistic model is a Hidden Markov Model.
  - 6. The alignment method of claim 5, wherein the Hidden Markov Model models a combination of emission probabilities and transition probabilities, the emission probabilities each being based on the probability that word of the target sentence is a translation of a word of the source sentence with which it is aligned, the transition probabilities each being based on the probability that a word of the target sentence will be aligned with a word of the source sentence given that the previous word of the target sentence is aligned with a specific word of the source sentence.
  - 7. The method of claim 6, wherein in modeling the transition probabilities, the model takes into account whether a word of the target language is being aligned with one of the at least one candidate terms or with a word outside the at least one candidate term.
  - 8. The alignment method of claim 5, when the model favors alignments between words which maintain a monotonic alignment between the source and target sentences.
  - 9. The method of claim 1, wherein the generating of the optimal alignment includes generating a first automaton based on the probabilistic model and applying the first automaton in combination with a second automaton which enforces the contiguity constraint.
  - 10. The method of claim 9, wherein the first and second automata are weighted finite-state transducers.
  - 11. The method of claim 9, wherein the generating of the optimal alignment includes repeating the following steps until an alignment respecting the contiguity constraint is produced:
    - applying a first automaton to produce an alignment between the target sentence and the source sentence;
      
      determining whether the alignment meets the contiguity constraint, and, where the alignment does not meet the contiguity constraint, identifying which of the at least one candidate terms for which the contiguity constraint is not met; and
      
      generating a second automaton to enforce the contiguity constraint for those candidate terms for which the contiguity constraint is not met.
  - 12. The method of claim 1, wherein the generating the optimal alignment includes:
    - applying a constraint which requires that in the alignment, for each of the identified candidate terms, at least one word of the target sentence is aligned to the candidate term.
  - 13. The method of claim 12, wherein the generating of the optimal alignment includes:
    - applying a constraint which requires that in the alignment, for each of the identified candidate terms, the number of words of the target sentence aligned to the term at least equals the number of source words in the candidate term.
  - 14. The method of claim 12, further comprising, prior to the aligning of the target sentence with the source sentence, identifying compound words in at least one of the source and target sentences and decomposing the identified compound words.
  - 15. The method of claim 1, wherein the generating of the optimal alignment includes:
    - applying a constraint which limits reordering to local reordering.
  - 16. The method of claim 1, further comprising outputting the words of the target sentence which are aligned with the at least one candidate term.
  - 17. The method of claim 16, further comprising adding the candidate term and output words of the target sentence to a bilingual terminological lexicon.
  - 18. The method of claim 1, wherein the method operates without identification of terms in the target sentence.
  - 19. A computer program product which, when executed on a computer, performs the method of claim 1.
  - 20. A system which includes processing components which are configured for performing the method of claim 1.

21. A system comprising:
- a sentence aligner which aligns sentences of a target document in a target language with respective sentences of a source document in a source language;
  
  a source term tagger which tags terms of each source sentence which meet criteria for at least one class of candidate terms, each of the candidate terms comprising a contiguous subsequence of words of the source sentence;
  
  a word aligner which, for a pair of sentences aligned by the sentence aligner, generates an alignment between the words of a target sentence and the words of the source sentence, the word aligner using a probabilistic model which models conditional probability distributions for alignments between words of the source sentence and words of the target sentence and generating an optimal alignment based on the probabilistic model, the word aligner enforcing a contiguity constraint which requires that all the words of the target sentence which are aligned with one of the candidate terms identified by the term tagger form a contiguous subsequence of the target sentence.
- View Dependent Claims (22)
- - 22. The system of claim 21, further comprising:
    - a term extractor which extracts contiguous subsequences of the target sentences that are aligned to a common source candidate term; and
      
      a filter which, filters the extracted contiguous subsequences to remove contiguous subsequences which are less probable translations of the common candidate term.

23. A method of generating a terminological lexicon comprising:
- providing a parallel corpus, the corpus comprising source sentences in a source language and target sentences in target language;
  
  providing for the identifying of noun phrases in the source sentences but not in the target sentences;
  
  for each of plurality of source sentences, generating an alignment in which words of a respective target sentence are aligned with words of the source sentence, whereby words of the target sentence which are aligned with a selected noun phrase are identified, wherein in generating the alignment, a contiguity constraint is enforced which requires that all the words of the target sentence which are aligned with the selected noun phrase form a contiguous subsequence of words of the target sentence;
  
  optionally, where a plurality of contiguous subsequences of aligned target sentences are aligned with a common noun phrase, filtering the contiguous subsequences to remove contiguous subsequences which are less probable translations of the noun phrase; and
  
  incorporating the noun phrase together with at least one identified contiguous sequence which has been aligned with the noun phrase in a terminological lexicon.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Cancedda, Nicola, Dance, Christopher R., Gaussier, Eric, Fazekas, Szilard Zsolt, Barbaiani, Madalina, Gaal, Tamas

Granted Patent

US 9,020,804 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/4
CPC Class Codes

G06F 40/45 Example-based machine trans...

METHOD FOR ALIGNING SENTENCES AT THE WORD LEVEL ENFORCING SELECTIVE CONTIGUITY CONSTRAINTS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

149 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR ALIGNING SENTENCES AT THE WORD LEVEL ENFORCING SELECTIVE CONTIGUITY CONSTRAINTS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

149 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links