Parallel fragment extraction from noisy parallel corpora
First Claim
1. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language on a computer having a processor, the method comprising:
- executing on the processor instructions configured to;
for respective elements of the first corpus, calculate;
a monolingual probability of the element with respect to preceding elements of the first corpus, anda bilingual probability of the element with respect to an aligned element of the second corpus;
for respective elements of the first corpus, identify candidate fragments of the first corpus comprising respective elements of the first corpus having a greater bilingual probability of the element with aligned elements of the second corpus than only the monolingual probability of the element with respect to preceding elements of the first corpus to align elements of the first corpus with elements of the second corpus; and
extract parallel fragments respectively comprising;
the first corpus elements of a candidate fragment, andthe second corpus elements aligned with the first corpus elements of the candidate fragment.
2 Assignments
0 Petitions
Accused Products
Abstract
Machine translation algorithms for translating between a first language and a second language are often trained using parallel fragments, comprising a first language corpus and a second language corpus comprising an element-for-element translation of the first language corpus. Such training may involve large training sets that may be extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events; however, extracted fragments may be comparatively “noisy,” with extra elements inserted in each corpus. Extraction techniques may be devised that can differentiate between “bilingual” elements represented in both corpora and “monolingual” elements represented in only one corpus, and for extracting cleaner parallel fragments of bilingual elements. Such techniques may involve conditional probability determinations on one corpus with respect to the other corpus, or joint probability determinations that concurrently evaluate both corpora for bilingual elements.
19 Citations
20 Claims
-
1. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language on a computer having a processor, the method comprising:
executing on the processor instructions configured to; for respective elements of the first corpus, calculate; a monolingual probability of the element with respect to preceding elements of the first corpus, and a bilingual probability of the element with respect to an aligned element of the second corpus; for respective elements of the first corpus, identify candidate fragments of the first corpus comprising respective elements of the first corpus having a greater bilingual probability of the element with aligned elements of the second corpus than only the monolingual probability of the element with respect to preceding elements of the first corpus to align elements of the first corpus with elements of the second corpus; and extract parallel fragments respectively comprising; the first corpus elements of a candidate fragment, and the second corpus elements aligned with the first corpus elements of the candidate fragment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
12. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language on a computer having a processor, the method comprising:
executing on the processor instructions configured to; for at least one first corpus element and at least one second corpus element, calculate; a first corpus monolingual probability of the at least one first corpus element with respect to preceding elements of the first corpus, and a second corpus monolingual probability of the at least one second corpus element with respect to preceding elements of the second corpus, and a bilingual probability of the at least one first corpus elements and the at least one second corpus elements with respect to one another; align the first corpus elements and the second corpus elements to identify candidate fragments comprising; a sequence of first corpus elements having a greater bilingual probability than only a first corpus monolingual property, and a sequence of second corpus elements aligned with the first corpus elements having a greater bilingual probability than only a second corpus monolingual property; and extract parallel fragments respectively comprising first corpus elements of a candidate fragment and aligned second corpus elements of the candidate fragment. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
20. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language on a computer having a processor, the method comprising:
executing on the processor instructions configured to; prepare a bilingual coincidence data set associating elements of the first language with elements of the second language according to a bilingual coincidence; generate a hidden Markov model representing transition probabilities between a bilingual generation mode and a monolingual generation mode; set a transition parameter to one of a bilingual generation mode and a monolingual generation mode; and align elements of the first corpus with elements of the second corpus by; for respective elements of the first corpus, identifying maximally coincident elements of the second corpus according to the bilingual coincidence data set for the first corpus element and the second corpus element, and upon identifying in the first corpus a structural element of the first language that is not translatable into the second language, aligning the structural element with a null element; for respective elements of the first corpus, calculating; a monolingual probability of the element with respect to preceding elements of the first corpus, calculated with respect to the transition probability between the monolingual generation mode of the element and the generation mode of a preceding element; for elements not aligned with the null element of the second corpus, a bilingual probability of the element with respect to an aligned element of the second corpus based on the bilingual coincidence, and calculated according to the mathematical formula;
Specification