Learning translation relationships among words
First Claim
1. A method of calculating translation relationships among words, comprising:
- calculating word association scores for word pairs based on co-occurrences of words in each of a plurality of sets of aligned, bilingual units in a corpus;
identifying hypothesized compounds in the units based on the word association scores;
re-calculating the word association scores, given the hypothesized compounds; and
obtaining translation relationships based on the re-calculated word association scores.
1 Assignment
0 Petitions
Accused Products
Abstract
A parallel bilingual training corpus is parsed into its content words. Word association scores for each pair of content words consisting of a word of language L1 that occurs in a sentence aligned in the bilingual corpus to a sentence of language L2 in which the other word occurs. A pair of words is considered “linked” in a pair of aligned sentences if one of the words is the most highly associated, of all the words in its sentence, with the other word. The occurrence of compounds is hypothesized in the training data by identifying maximal, connected sets of linked words in each pair of aligned sentences in the processed and scored training data. Whenever one of these maximal, connected sets contains more than one word in either or both of the languages, the subset of the words in that language is hypothesized as a compound.
30 Citations
20 Claims
-
1. A method of calculating translation relationships among words, comprising:
-
calculating word association scores for word pairs based on co-occurrences of words in each of a plurality of sets of aligned, bilingual units in a corpus;
identifying hypothesized compounds in the units based on the word association scores;
re-calculating the word association scores, given the hypothesized compounds; and
obtaining translation relationships based on the re-calculated word association scores. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method of training a machine translation system, comprising:
-
obtaining a corpus of aligned, bilingual multi-word units;
calculating word association scores for word pairs in the corpus based on co-occurrence of words in the aligned units;
identifying hypothesized compounds based on an absence of one-to-one correspondence between words in the aligned units; and
training the machine translation system based on the word association scores and the hypothesized compounds. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A computer-readable medium comprising computer-executable instructions which, when executed by a computer, configure the computer to:
-
calculate word association scores for word pairs based on co-occurrences of words in each of a plurality of sets of aligned, bilingual units in a corpus;
identify hypothesized compounds in the units based on word association scores that indicate a lack of one-to-one correspondence between words in the first unit and words in the second unit;
re-calculate the word association scores based on co-occurrences of words and hypothesized compounds; and
obtain translation relationships based on the re-calculated word association scores.
-
Specification