Extracting sentence translations from translated documents
First Claim
1. Method of extracting translations from translated texts, the method comprising the steps of:
- accessing a first text in a first language;
accessing a second text in a second language, the second language being different from the first language, the second text being a translation of the first text;
dividing the first text and the second text each into a plurality of textual elements;
forming a sequence of pairs of text portions from said plurality of textual elements, each pair comprising a text portion of the first text and a text portion of the second text, each text portion comprising zero or more adjacent textual elements, each textual element of the first and the second text being comprised in a text portion of the sequence;
calculating a pair score of each pair in the sequence using a number of occurrences of each of a plurality of features in the text portions of the respective pair and using a plurality of weights, each weight being assigned to one feature of said plurality of features, wherein the pair scores are calculated by taking, for each feature occurring in the pair, a minimum number of the number of occurrences of the respective feature in the paired text portions, taking a product of the minimum number and the weight assigned to the respective feature, and summing up all the products of all features;
calculating an alignment score of the sequence using said pair scores, said alignment score indicating the translation quality of the sequence; and
optimizing said alignment score by systematically searching through the space of alternatives and combining optimal alignments for subsequences into optimal alignments for longer sequences.
5 Assignments
0 Petitions
Accused Products
Abstract
A system extracts translations from translated texts, such as sentence translations from translated versions of documents. A first and a second text are accessed and divided into a plurality of textual elements. From these textual elements, a sequence of pairs of text portions is formed, and a pair score is calculated for each pair, using weighted features. Then, an alignment score of the sequence is calculated using the pair scores, and the sequence is systematically varied to identify a sequence that optimizes the alignment score. The invention allows for fast, reliable and robust alignment of sentences within large translated documents. Further, it allows to exploit a broad variety of existing knowledge sources in a flexible way, without performance penalty. Further, a general implementation of dynamic programming search with online memory allocation and garbage collection allows for treating very long documents with limited memory footprint.
70 Citations
20 Claims
-
1. Method of extracting translations from translated texts, the method comprising the steps of:
-
accessing a first text in a first language; accessing a second text in a second language, the second language being different from the first language, the second text being a translation of the first text; dividing the first text and the second text each into a plurality of textual elements; forming a sequence of pairs of text portions from said plurality of textual elements, each pair comprising a text portion of the first text and a text portion of the second text, each text portion comprising zero or more adjacent textual elements, each textual element of the first and the second text being comprised in a text portion of the sequence; calculating a pair score of each pair in the sequence using a number of occurrences of each of a plurality of features in the text portions of the respective pair and using a plurality of weights, each weight being assigned to one feature of said plurality of features, wherein the pair scores are calculated by taking, for each feature occurring in the pair, a minimum number of the number of occurrences of the respective feature in the paired text portions, taking a product of the minimum number and the weight assigned to the respective feature, and summing up all the products of all features; calculating an alignment score of the sequence using said pair scores, said alignment score indicating the translation quality of the sequence; and optimizing said alignment score by systematically searching through the space of alternatives and combining optimal alignments for subsequences into optimal alignments for longer sequences. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer readable storage medium storing instructions for performing a method comprising the steps of:
-
accessing a first text in a first language; accessing a second text in a second language, the second language being different from the first language, the second text being a translation of the first text; dividing the first text and the second text each into a plurality of textual elements; forming a sequence of pairs of text portions from said plurality of textual elements, each pair comprising a text portion of the first text and a text portion of the second text, each text portion comprising zero or more adjacent textual elements, each textual element of the first and the second text being comprised in a text portion of the sequence; calculating a pair score of each pair in the sequence using a number of occurrences of each of a plurality of features in the text portions of the respective pair and using a plurality of weights, each weight being assigned to one feature of said plurality of features, wherein the pair scores are calculated by taking, for each feature occurring in the pair, a minimum number of the number of occurrences of the respective feature in the paired text portions, taking a product of the minimum number and the weight assigned to the respective feature, and summing up all the products of all features; calculating an alignment score of the sequence using said pair scores, said alignment score indicating the translation quality of the sequence; and optimizing said alignment score by systematically searching through the space of alternatives and combining optimal alignments for subsequences into optimal alignments for longer sequences.
-
-
20. A system for extracting translations from translated texts, comprising:
-
a pre-processor for accessing a first text in a first language, accessing a second text in a second language, the second language being different from the first language, the second text being a translation of the first text, and dividing the first and the second text each into a plurality of textual elements; and a processor for forming a sequence of pairs of text portions from said pluralities of textual elements, each pair comprising a text portion of the first text and a text portion of the second text, each text portion comprising zero or more adjacent textual elements, each textual element of the first and the second text being comprised in a text portion of the sequence, the processor being further arranged for calculating a pair score of each pair in the sequence using a number of occurrences of each of a plurality of features in the text portions of the respective pair and using a plurality of weights, each weight being assigned to one feature of said plurality of features, calculating an alignment score of the sequence using said pair scores, said alignment score indicating the translation quality of the sequence, and optimizing said alignment score by repeating said forming and calculating steps, wherein the pair scores are calculated by taking, for each feature occurring in the pair, a minimum number of the number of occurrences of the respective feature in the paired text portions, taking a product of the minimum number and the weight assigned to the respective feature, and summing up all the products of all features.
-
Specification