Word alignment method and system for improved vocabulary coverage in statistical machine translation
First Claim
1. A method for generating word alignments from pairs of aligned text strings comprising:
- from a corpus of text strings, receiving a pair of text strings comprising a first text string in a first language and a second text string in a second language;
with a first alignment tool, generating a first alignment between the first and second text strings which creates links between the first and second text string, each link linking a single token of the first text string to a single token of the second text string, the tokens of the first and second text strings including words;
with a second alignment tool, generating a second alignment between the first and second text strings which creates links between the first and second text strings, each link linking at least one token of the first text string to at least one token of the second text string, andgenerating a modified first alignment by selectively modifying links in the first alignment which include a word which is infrequent in the corpus, based on links generated in the second alignment, the selective modification of the links comprising identifying links in the first alignment to be retained which include the infrequent word and a linked target word where there is a corresponding link present in the second alignment which includes the infrequent word and the same linked target word and identifying for removal, at least a portion of the links in the first alignment which include the infrequent word and a linked target word for which there is no corresponding link between the infrequent word and the linked target word in the second alignment,wherein the generation of at least one of the first, second, and modified alignments is performed with a computer processor.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for generating word alignments from pairs of aligned text strings are provided. A corpus of text strings provides pairs of text strings, primarily sentences, in source and target languages. A first alignment between a text string pair creates links therebetween. Each link links a single token of the first text string to a single token of the second text string. A second alignment also creates links between the text string pair. In some cases, these links may correspond to bi-phrases. A modified first alignment is generated by selectively modifying links in the first alignment which include a word which is infrequent in the corpus, based on links generated in the second alignment. This results in removing at least some of the links for the infrequent words, allowing more compact and better quality bi-phrases, with higher vocabulary coverage, to be extracted for use in a machine translation system.
-
Citations
21 Claims
-
1. A method for generating word alignments from pairs of aligned text strings comprising:
-
from a corpus of text strings, receiving a pair of text strings comprising a first text string in a first language and a second text string in a second language; with a first alignment tool, generating a first alignment between the first and second text strings which creates links between the first and second text string, each link linking a single token of the first text string to a single token of the second text string, the tokens of the first and second text strings including words; with a second alignment tool, generating a second alignment between the first and second text strings which creates links between the first and second text strings, each link linking at least one token of the first text string to at least one token of the second text string, and generating a modified first alignment by selectively modifying links in the first alignment which include a word which is infrequent in the corpus, based on links generated in the second alignment, the selective modification of the links comprising identifying links in the first alignment to be retained which include the infrequent word and a linked target word where there is a corresponding link present in the second alignment which includes the infrequent word and the same linked target word and identifying for removal, at least a portion of the links in the first alignment which include the infrequent word and a linked target word for which there is no corresponding link between the infrequent word and the linked target word in the second alignment, wherein the generation of at least one of the first, second, and modified alignments is performed with a computer processor. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 18, 19)
-
-
11. A method for generating word alignments from pairs of aligned text strings comprising:
-
from a corpus of text strings, receiving a pair of text strings comprising a first text string in a first language and a second text string in a second language; with a first software alignment tool, generating a first alignment between the first and second text strings which creates links between the first and second text string, each link linking a single token of the first text string to a single token of the second text string, the tokens of the first and second text strings including words; with a second software alignment tool, which outputs word alignments or aligned bi-phrases, generating a second alignment between the first and second text strings which creates links between the first and second text strings, each link linking at least one token of the first text string to at least one token of the second text string, and generating a modified first alignment by selectively modifying links in the first alignment which include a word which is infrequent in the corpus, based on links generated in the second alignment, the selective modification being conditional on there being a bi-phrase identified in the second alignment to be used as a basis for the modification which has at least a threshold frequency k in the corpus or in a set of sub-corpora generated by sampling the corpus, wherein the generation of at least one of the first, second, and modified alignments is performed with a computer processor. - View Dependent Claims (12)
-
-
20. A system for generating word alignments from word aligned text strings comprising:
-
instructions stored in memory for receiving a source sentence in a source language and a target sentence in a target language from a text corpus, the target sentence having been automatically identified as being a translation of the source sentence; instructions for generating a first alignment between the source sentence and the target sentence by forming links, including some links that each link a source word with a target word; instructions for generating a second alignment between the source sentence and the target sentence by forming links, including some links that each link at least one source word with at least one target word, the instructions for generating the second alignment generating alignments for sentence pairs in the corpus which include fewer links, on average, than the instructions for generating a first alignment; instructions for identifying of links in the second alignment which comprise infrequent words and based on at least some of these identified links, modifying the first alignment to remove links between the infrequent words present in the second alignment links and other words of the first alignment which do not form a part of one of the identified second alignment links and for identifying links in the first alignment to be retained which include an infrequent word and a linked target word where there is a corresponding link present in the second alignment which includes the infrequent word and the same linked target word.
-
-
21. A method for generating word alignments from aligned sentences comprising:
-
receiving a source sentence in a source language and a target sentence in a target language from a text corpus, the target sentence having been automatically identified as being a translation of the source sentence; with a processor, generating a first, word alignment between the source sentence and the target sentence by forming links, including some links that each link a source word with a target word; generating a second alignment between the source sentence and the target sentence by a method which generates alignments for sentence pairs in the corpus which include fewer links, on average, than the method for generating the first alignment, the second alignment including some links that each link at least one source word with at least one target word; identifying links in the second alignment which comprise infrequent words and based on at least some of these identified links, modifying the first alignment to remove links between the infrequent words present in the second alignment links and other words of the first alignment which do not form a part of one of the identified second alignment links and retaining links in first alignment which include an infrequent word and a linked target word where there is a corresponding link present in the second alignment which includes the infrequent word and the same linked target word.
-
Specification