WORD ALIGNMENT METHOD AND SYSTEM FOR IMPROVED VOCABULARY COVERAGE IN STATISTICAL MACHINE TRANSLATION
First Claim
1. A method for generating word alignments from pairs of aligned text strings comprising:
- from a corpus of text strings, receiving a pair of text strings comprising a first text string in a first language and a second text string in a second language;
with a first alignment tool, generating a first alignment between the first and second text strings which creates links between the first and second text string, each link linking a single token of the first text string to a single token of the second text string, the tokens of the first and second text strings including words;
with a second alignment tool, generating a second alignment between the first and second text strings which creates links between the first and second text strings, each link linking at least one token of the first text string to at least one token of the second text string,generating a modified first alignment by selectively modifying links in the first alignment which include a word which is infrequent in the corpus, based on links generated in the second alignment.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for generating word alignments from pairs of aligned text strings are provided. A corpus of text strings provides pairs of text strings, primarily sentences, in source and target languages. A first alignment between a text string pair creates links therebetween. Each link links a single token of the first text string to a single token of the second text string. A second alignment also creates links between the text string pair. In some cases, these links may correspond to bi-phrases. A modified first alignment is generated by selectively modifying links in the first alignment which include a word which is infrequent in the corpus, based on links generated in the second alignment. This results in removing at least some of the links for the infrequent words, allowing more compact and better quality bi-phrases, with higher vocabulary coverage, to be extracted for use in a machine translation system.
-
Citations
22 Claims
-
1. A method for generating word alignments from pairs of aligned text strings comprising:
-
from a corpus of text strings, receiving a pair of text strings comprising a first text string in a first language and a second text string in a second language; with a first alignment tool, generating a first alignment between the first and second text strings which creates links between the first and second text string, each link linking a single token of the first text string to a single token of the second text string, the tokens of the first and second text strings including words; with a second alignment tool, generating a second alignment between the first and second text strings which creates links between the first and second text strings, each link linking at least one token of the first text string to at least one token of the second text string, generating a modified first alignment by selectively modifying links in the first alignment which include a word which is infrequent in the corpus, based on links generated in the second alignment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A system for generating word alignments from word aligned text strings comprising:
-
instructions stored in memory for receiving a source sentence in a source language and a target sentence in a target language from a text corpus, the target sentence having been automatically identified as being a translation of the source sentence; instructions for generating a first alignment between the source sentence and the target sentence by forming links, including some links that each link a source word with a target word; instructions for generating a second alignment between the source sentence and the target sentence by forming links, including some links that each link at least one source word with at least one target word, the instructions for generating the second alignment generating alignments for sentence pairs in the corpus which include fewer links, on average, than the instructions for generating a first alignment; instructions for identifying of links in the second alignment which comprise infrequent words and based on at least some of these identified links, modifying the first alignment to remove links between the infrequent words present in the second alignment links and other words of the first alignment which do not form a part of one of the identified second alignment links.
-
-
22. A method for generating word alignments from aligned sentences comprising:
-
receiving a source sentence in a source language and a target sentence in a target language from a text corpus, the target sentence having been automatically identified as being a translation of the source sentence; with a processor, generating a word alignment between the source sentence and the target sentence by forming links, including some links that each link a source word with a target word; generating a second alignment between the source sentence and the target sentence by a method which generates alignments for sentence pairs in the corpus which include fewer links, on average, than the method for generating the first alignment, the second alignment including some links that each link at least one source word with at least one target word; identifying links in the second alignment which comprise infrequent words and based on at least some of these identified links, modifying the first alignment to remove links between the infrequent words present in the second alignment links and other words of the first alignment which do not form a part of one of the identified second alignment links.
-
Specification