Learning translation relationships among words
First Claim
1. A method of calculating translation relationships among words, comprising:
- calculating word association scores for word pairs based on co-occurrences of words in each of a plurality of sets of aligned, bilingual units in a corpus;
identifying hypothesized compounds in the units based on the word association scores;
re-calculating the word association scores, given the hypothesized compounds;
ranking the word pairs based on the re-calculated word association scores;
generating transfer mappings that map from words and hypothesized compounds in one language to words and hypothesized compounds in another language, based on the ranking of the re-calculated word association scores; and
obtaining translation relationships based on the transfer mappings.
1 Assignment
0 Petitions
Accused Products
Abstract
A parallel bilingual training corpus is parsed into its content words. Word association scores for each pair of content words consisting of a word of language L1 that occurs in a sentence aligned in the bilingual corpus to a sentence of language L2 in which the other word occurs. A pair of words is considered “linked” in a pair of aligned sentences if one of the words is the most highly associated, of all the words in its sentence, with the other word. The occurrence of compounds is hypothesized in the training data by identifying maximal, connected sets of linked words in each pair of aligned sentences in the processed and scored training data. Whenever one of these maximal, connected sets contains more than one word in either or both of the languages, the subset of the words in that language is hypothesized as a compound.
179 Citations
19 Claims
-
1. A method of calculating translation relationships among words, comprising:
-
calculating word association scores for word pairs based on co-occurrences of words in each of a plurality of sets of aligned, bilingual units in a corpus; identifying hypothesized compounds in the units based on the word association scores; re-calculating the word association scores, given the hypothesized compounds; ranking the word pairs based on the re-calculated word association scores; generating transfer mappings that map from words and hypothesized compounds in one language to words and hypothesized compounds in another language, based on the ranking of the re-calculated word association scores; and obtaining translation relationships based on the transfer mappings. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method of training a machine translation system, comprising:
-
obtaining a corpus of aligned, bilingual multi-word units; calculating word association scores for word pairs in the corpus based on co-occurrence of words in the aligned units; identifying hypothesized compounds based on an absence of one-to-one correspondence between words in the aligned units; providing a rewritten corpus in which the hypothesized compounds have been replaced by fused tokens; re-calculating the word association scores using the rewritten corpus; generating transfer mappings that map from words and fused tokens in one language to words and fused tokens in another language, based on the selected translation relationships; filtering the transfer mappings based on at least one of;
frequency of appearance;
completeness of parses of the multi-word units;
or completeness of alignment of multi-word units; andtraining the machine translation system based on the filtered transfer mappings. - View Dependent Claims (16, 17, 18)
-
-
19. A computer-readable medium comprising computer-executable instructions which, when executed by a computer, configure the computer to:
-
calculate word association scores for word pairs based on co-occurrences of words in each of a plurality of sets of aligned, bilingual units in a corpus; identify hypothesized compounds in the units based on word association scores that indicate a lack of one-to-one correspondence between words in the first unit and words in the second unit; re-calculate the word association scores based on co-occurrences of words and hypothesized compounds; rank the word pairs based on the re-calculated word association scores; and generate transfer mappings that map from words and hypothesized compounds in one language to words and hypothesized compounds in another language, based on the ranking of the re-calculated word association scores.
-
Specification