STATISTICAL MACHINE TRANSLATION SYSTEM AND METHOD FOR TRANSLATION OF TEXT INTO LANGUAGES WHICH PRODUCE CLOSED COMPOUND WORDS

US 20110178791A1
Filed: 01/20/2010
Published: 07/21/2011
Est. Priority Date: 01/20/2010
Status: Active Grant

First Claim

Patent Images

1. A machine translation method for translating source text from a first language to target text in a second language, comprising:

receiving the source text in the first language;

accessing a library of bi-phrases, each of the bi-phrases including a text fragment from the first language and a text fragment from the second language, at least some of the bi-phrases comprising words tagged with restricted part of speech tags, at least one of the restricted part of speech tags configured for identifying a word from the second language as being one which also forms a part of a known closed compound word;

retrieving text fragments in the second language from the library corresponding to text fragments in the source text;

generating at least one target hypothesis, each of the target hypotheses comprising text fragments selected from the retrieved text fragments in the second language; and

evaluating the target hypothesis based at least in part on combinations of restricted part of speech tags; and

based on the evaluation, outputting one of the at least one target hypothesis as the optimal hypothesis for forming the translation.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A translation system and method for translating source text from a first language to target text in a second language are disclosed. A library of bi-phrases is accessed to retrieve bi-phrases which each match a part of the source text. Each of the bi-phrases includes respective text fragments from the first and second language. Words of some (or all) of the bi-phrases are tagged with restricted part of speech (RPOS) tags. At least one of the RPOS tags is configured for identifying a word from the second language as being one which also forms a part of a closed compound word in the library. At least one target hypothesis is generated from the bi-phrases, which includes text fragments in the second language. The target hypothesis or hypotheses are evaluated, based at least in part on combinations of the restricted part of speech tags. Based on the evaluation, one of the at least one target hypothesis is output as the optimal hypothesis for forming the translation.

Citations

25 Claims

1. A machine translation method for translating source text from a first language to target text in a second language, comprising:
- receiving the source text in the first language;
  
  accessing a library of bi-phrases, each of the bi-phrases including a text fragment from the first language and a text fragment from the second language, at least some of the bi-phrases comprising words tagged with restricted part of speech tags, at least one of the restricted part of speech tags configured for identifying a word from the second language as being one which also forms a part of a known closed compound word;
  
  retrieving text fragments in the second language from the library corresponding to text fragments in the source text;
  
  generating at least one target hypothesis, each of the target hypotheses comprising text fragments selected from the retrieved text fragments in the second language; and
  
  evaluating the target hypothesis based at least in part on combinations of restricted part of speech tags; and
  
  based on the evaluation, outputting one of the at least one target hypothesis as the optimal hypothesis for forming the translation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The method of claim 1, wherein a first of the restricted part of speech tags identifies a) a word which appears in a known closed compound word in other than a head position and a second of the restricted part of speech tags identifies at least one of b) a word which appears in a known closed compound word in the head position and c) another word, where the words a), and at least one of b) and c) identified by the first and second tags are all same part of speech.
  - 3. The method of claim 1, wherein the restricted part of speech tags comprise at least one of the following sets:
    - a) a first set comprising a first tag (NP) which identifies a word which appears in a known closed noun compound word in other than a head position and a second tag (N) which identifies at least one of a word which appears in a known closed noun compound word in the head position and another noun;
      
      b) a second set comprising a first tag (VP) which identifies a word which appears in a known closed verb compound word in other than a head position and a second tag (V) which identifies at least one of a word which appears in a known closed verb compound word in the head position and another verb;
      
      c) a third set comprising a first tag (AP) which identifies a word which appears in a known closed adjective compound word in other than a head position and a second tag (A) which identifies at least one of a word which appears in a known closed adjective compound word in the head position and another adjective; and
      
      d) a fourth set comprising a first tag (AdP) which identifies a word which appears in a known closed adverb compound word in other than a head position and a second tag (Ad) which identifies at least one of a word which appears in a known closed adverb compound word in the head position and another adverb.
  - 4. The method of claim 3, wherein the restricted part of speech tags further comprise at least one of a tag which identifies all other words than those in a selected one of the four sets and a tag which denotes the end of a sentence.
  - 5. The method of claim 3, wherein the evaluation includes counting occurrences of a combination of at least one first tag followed by a second tag of the same set.
  - 6. The method of claim 5, wherein the evaluation further includes counting occurrences of a combination of two first tags of the same set.
  - 7. The method of claim 4, wherein the restricted part of speech tags comprise a tag which denotes the end of a sentence.
  - 8. The method of claim 7, wherein the evaluation includes counting occurrences of a combination of a first tag with the tag which denotes the end of a sentence.
  - 9. The method of claim 1, wherein the evaluating of the target hypothesis comprises evaluating the target hypothesis with a translation scoring function which scores the target hypothesis according to a plurality of feature functions, at least one of the feature functions taking into account the restricted part of speech tags.
  - 10. The method of claim 9, wherein the at least one feature function which takes into account the restricted part of speech tags scores at least one specified combination of consecutive restricted part of speech tags in the target hypothesis differently from another combination of consecutive restricted part of speech tags.
  - 11. The method of claim 9, wherein the evaluation of the hypothesis includes, for each of the at least one specified combination of consecutive restricted part of speech tags, identifying occurrences of the specified combination in the hypothesis and computing a score for the hypothesis based on the number of occurrences.
  - 12. The method of claim 9, wherein at least one feature function is a boost function which favors occurrences of at least one specified combination of consecutive restricted part of speech tags which enables generation of a closed compound.
  - 13. The method of claim 12, wherein the boost feature function scores a target hypothesis based on:
    - a) a number of occurrences of a combination NP-N, andb) a number of occurrences of a combination NP-NP, where;
      
      NP represents a word which appears in a known closed noun compound word in other than a head position, andN represents at least one of a word which appears in a known closed noun compound word in the head position and another noun.
  - 14. The method of claim 9, wherein at least one feature function is a punish function which penalizes occurrences of at least one specified combination of consecutive restricted part of speech tags which limits generation of a closed compound.
  - 15. The method of claim 14, wherein the punish feature function scores a target hypothesis based on at least one of:
    - a) a number of occurrences of a combination NP-X, andb) a number of occurrences of a combination NP-<
      
      \s>
      
      , where;
      
      NP represents a word which appears in a known closed noun compound word in other than a head position,X represents a word which is not a noun, and<
      
      \s>
      
      represents the end of a sentence.
  - 16. The method of claim 9, wherein the translation scoring function comprises a log-linear translation scoring function in which weights are assigned to each of the feature functions and wherein the evaluation of the at least one hypothesis includes selecting a hypothesis from a plurality of hypotheses which optimizes the log-linear translation scoring function.
  - 17. The method of claim 16, wherein the log-linear translation scoring function outputs a probability of a target sentence e and a hidden alignment variable a given a source sentence f of the general form:
  - 18. The method of claim 16, wherein the translation scoring function outputs a translation for which the log-linear scoring function is optimized.
  - 19. The method of claim 16, wherein one of the feature functions comprises a language model which treats a target hypothesis as a sequence of restricted part of speech tags and computes a probability of the target hypothesis based on the conditional probabilities of subsequences of its restricted part of speech tags, the conditional probabilities being determined for subsequences of a fixed length from a training corpus of text in the second language.
  - 20. The method of claim 1, wherein the only restricted part of speech tags used in the evaluation are:
    - N, NP, and X, orN, NP, X, and <
      
      \s>
      
      .
  - 21. A machine translation system for translating source text from a first language to target text in a second language, comprising:
    - memory which stores instructions for performing the method of claim 1; and
      
      a processor which executes the instructions.
  - 22. A computer program product comprising tangible media encoding instructions which, when executed by a computer, perform the method of claim 1.

23. A machine translation system for translating source text from a first language to target text in a second language, comprising:
- memory which stores a library of bi-phrases, each of the bi-phrases including a text fragment from the first language and a text fragment from the second language, words of at least some of the bi-phrases being tagged with restricted part of speech tags, at least one of the restricted part of speech tags configured for identifying a word of a text fragment from the second language as being one which also forms a part of a known closed compound word; and
  
  a processor which executes instructions stored in memory for retrieving text fragments in the second language from the library which correspond to text fragments in the source text, generating at least one target hypothesis, each of the target hypotheses comprising text fragments selected from the retrieved fragments in the second language, evaluating each of the target hypotheses with a translation scoring function which scores the hypothesis according to a plurality of features, at least one of the features comprising a feature which favors hypotheses comprising consecutive text fragments with restricted part of speech tags which indicate that the consecutive text fragments are ordered for forming a closed compound word, and, based on the evaluation, outputting a translation based on one of the target hypotheses.
- View Dependent Claims (24)
- - 24. The system of claim 23, further comprising a display for displaying the output translation.

25. A machine translation method for translating source text from a first language to target text in a second language, comprising:
- receiving the source text in the first language;
  
  accessing a library of bi-phrases, each of the bi-phrases including a text fragment from the first language and a text fragment from the second language, at least some of the bi-phrases being tagged with restricted part of speech tags, the restricted part of speech tags including an NP tag which identifies a text fragment from the second language as being one which also forms a part of a known closed noun compound word other than in a head position of the closed noun compound word and an N tag which identifies at least one of a text fragment which appears in a closed noun compound word in the head position and another noun;
  
  retrieving text fragments from the second language from the library corresponding to text fragments in the source text;
  
  generating at least one target hypothesis, each of said target hypotheses comprising text fragments selected from the second language; and
  
  evaluating the target hypothesis based at least in part on combinations of restricted part of speech tags, the evaluating including at least one of;
  
  a) counting at least one of i) occurrences of combinations of NP-N and NP-NP which favor formation of closed compound words and ii) occurrences of NP immediately followed by a restricted part of speech tag other than N or NP, which disfavor formation of closed compound words,b) retrieving conditional probabilities of occurrence for subsequences of restricted part of speech tags in the target hypothesis and computing a combined probability based thereon; and
  
  based on the evaluation, outputting one of the at least one target hypothesis as the optimal hypothesis for forming the translation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
STYMNE, Sara, Gaál, Tamás, Cancedda, Nicola

Granted Patent

US 8,548,796 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/2
CPC Class Codes

G06F 40/268   Morphological analysis

G06F 40/44   Statistical methods, e.g. p...

G06F 40/45   Example-based machine trans...

STATISTICAL MACHINE TRANSLATION SYSTEM AND METHOD FOR TRANSLATION OF TEXT INTO LANGUAGES WHICH PRODUCE CLOSED COMPOUND WORDS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

STATISTICAL MACHINE TRANSLATION SYSTEM AND METHOD FOR TRANSLATION OF TEXT INTO LANGUAGES WHICH PRODUCE CLOSED COMPOUND WORDS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links