PARALLEL FRAGMENT EXTRACTION FROM NOISY PARALLEL CORPORA
First Claim
1. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language, comprising:
- for respective elements of the first corpus, calculating;
a monolingual probability of the element with respect to preceding elements of the first corpus, anda bilingual probability of the element with respect to an aligned element of the second corpus;
aligning elements of the first corpus with elements of the second corpus to identify candidate fragments comprising a sequence of first corpus elements having a greater bilingual probability than a monolingual probability; and
extracting parallel fragments respectively comprising;
the first corpus elements of a candidate fragment, andthe second corpus elements aligned with the first corpus elements of the candidate fragment.
2 Assignments
0 Petitions
Accused Products
Abstract
Machine translation algorithms for translating between a first language and a second language are often trained using parallel fragments, comprising a first language corpus and a second language corpus comprising an element-for-element translation of the first language corpus. Such training may involve large training sets that may be extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events; however, extracted fragments may be comparatively “noisy,” with extra elements inserted in each corpus. Extraction techniques may be devised that can differentiate between “bilingual” elements represented in both corpora and “monolingual” elements represented in only one corpus, and for extracting cleaner parallel fragments of bilingual elements. Such techniques may involve conditional probability determinations on one corpus with respect to the other corpus, or joint probability determinations that concurrently evaluate both corpora for bilingual elements.
30 Citations
20 Claims
-
1. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language, comprising:
-
for respective elements of the first corpus, calculating; a monolingual probability of the element with respect to preceding elements of the first corpus, and a bilingual probability of the element with respect to an aligned element of the second corpus; aligning elements of the first corpus with elements of the second corpus to identify candidate fragments comprising a sequence of first corpus elements having a greater bilingual probability than a monolingual probability; and extracting parallel fragments respectively comprising; the first corpus elements of a candidate fragment, and the second corpus elements aligned with the first corpus elements of the candidate fragment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 19)
-
-
11. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language, comprising:
-
for at least one first corpus element and at least one second corpus element, calculating; a first corpus monolingual probability of the at least one first corpus element with respect to preceding elements of the first corpus, and a second corpus monolingual probability of the at least one second corpus element with respect to preceding elements of the second corpus, and a bilingual probability of the at least one first corpus elements and the at least one second corpus elements with respect to one another; aligning the first corpus elements and the second corpus elements to identify candidate fragments comprising; a sequence of first corpus elements having a greater bilingual probability than a first corpus monolingual property, and a sequence of second corpus elements aligned with the first corpus elements having a greater bilingual probability than a second corpus monolingual property; and extracting parallel fragments respectively comprising first corpus elements of a candidate fragment and aligned second corpus elements of the candidate fragment. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
-
20. A method of extracting parallel fragments from a first corpus in a first language and a second corpus in a second language, comprising:
-
preparing a bilingual coincidence data set associating elements of the first language with elements of the second language according to a bilingual coincidence; generating a hidden Markov model representing transition probabilities between a bilingual generation mode and a monolingual generation mode; setting a transition parameter to one of a bilingual generation mode and a monolingual generation mode; and aligning elements of the first corpus with elements of the second corpus by; for respective elements of the first corpus, identifying maximally coincident elements of the second corpus according to the bilingual coincidence data set for the first corpus element and the second corpus element, and upon identifying in the first corpus a structural element of the first language that is not translatable into the second language, aligning the structural element with a null element; for respective elements of the first corpus, calculating; a monolingual probability of the element with respect to preceding elements of the first corpus, calculated with respect to the transition probability between the monolingual generation mode of the element and the generation mode of a preceding element; for elements not aligned with the null element of the second corpus, a bilingual probability of the element with respect to an aligned element of the second corpus based on the bilingual coincidence, and calculated according to the mathematical formula; wherein; t represents the first corpus; tx represents the element of the first corpus at position x; txy represents the elements of the first corpus between positions x and y; n represents the size of the first corpus; s represents the second corpus; sx represents the element of the second corpus at position x; sxy represents the elements of the second corpus between positions x and y; m represents the size of the second corpus; ax represents alignment of element x of the first corpus with at least zero elements of the second corpus, wherein; an alignment of −
1 indicates a monolingual generation mode,an alignment of 0 indicates an alignment with the null element of the second corpus, and an alignment greater than 0 indicates an alignment with element x of the second corpus; axy represents alignments of elements x through y of the first corpus with elements of the second corpus; Pr(tj|a1j,t1j−
1,s1m) represents a probability of element t of the first corpus in view of the first j−
1 elements of the first corpus aligned with the elements of the second corpus, and is computed according to the mathematical formula;wherein; e(tj|tj−
1) represents a monolingual probability of generating first corpus element tj in view of first corpus elements t1j−
1, ande(tj|sa j ) represents a bilingual probability of generating first corpus element tj in view of at least zero second corpus elements saj ; andPr(aj|a1j−
1,t1j−
1,s1m) represents a probability of alignment of element j of the first corpus with at least zero elements of the second corpus in view of the first j−
1 elements of the first corpus aligned with the elements of the second corpus, and is computed according to the mathematical formula;
Pr(aj|a1j−
1,t1j−
1,s1m)=d(aj|aj−
1)wherein; d(aj|aj−
1) represents a probability of jumping to a target position aj at source position j of the other corpus if element j−
1 of the first corpus is chosen for alignment with the element of the second corpus at position aj−
1; andthe calculating performed according to at least one of; a state search, a dynamic programming search, and a pathfinding search; identifying candidate fragments comprising a sequence of first corpus elements having a greater bilingual probability than a monolingual probability, comprising; for respective elements of the first corpus, updating the transition parameter comprising one of the bilingual generation mode and the monolingual generation mode based on the generation state of a preceding element and the hidden Markov model; and extracting parallel fragments respectively comprising; the first corpus elements of a candidate fragment, and the second corpus elements aligned with the first corpus elements of the candidate fragment, and where the parallel fragment satisfies parallel fragment conditions comprising at least one of; a parallel fragment length of at least three first corpus elements; fewer than 30% of the first corpus elements and the second corpus elements aligned with the null element; and fewer than 70% of the first corpus elements and the second corpus elements comprising a structural element.
-
Specification