Mining multi-lingual data
First Claim
1. A method, performed by a computing device, for mining translation pairs for training in-domain machine translation engines, comprising:
- obtaining one or more sources of potential translation pairs comprising one or more content items,wherein the one or more sources of potential translation pairs are in an identified domain for which a machine translation engine is to be trained;
generating one or more potential translation pairs from the obtained one or more sources of potential translation pairs by applying one or more automated filtering techniques to the obtained one or more sources of potential translation pairs,wherein one of the one or more automated filtering techniques applied to a selected obtained source of potential translation pairs is configured based on a type of the selected obtained source of potential translation pairs, andwherein each of the one or more potential translation pairs comprises at least two language snippets;
selecting at least one actual translation pair from the generated one or more potential translation pairs, said selecting comprising;
extracting characteristics from each of the two language snippets of at least one of the one or more potential translation pairs;
determining that the two language snippets of the at least one of the one or more potential translation pairs are translations of each other by comparing the extracted characteristics; and
training the machine translation engine using the selected at least one actual translation pair.
2 Assignments
0 Petitions
Accused Products
Abstract
Technology is disclosed for mining training data to create machine translation engines. Training data can be mined as translation pairs from single content items that contain multiple languages; multiple content items in different languages that are related to the same or similar target; or multiple content items that are generated by the same author in different languages. Locating content items can include identifying potential sources of translation pairs that fall into these categories and applying filtering techniques to quickly gather those that are good candidates for being actual translation pairs. When actual translation pairs are located, they can be used to retrain a machine translation engine as in-domain for social media content items.
204 Citations
19 Claims
-
1. A method, performed by a computing device, for mining translation pairs for training in-domain machine translation engines, comprising:
-
obtaining one or more sources of potential translation pairs comprising one or more content items, wherein the one or more sources of potential translation pairs are in an identified domain for which a machine translation engine is to be trained; generating one or more potential translation pairs from the obtained one or more sources of potential translation pairs by applying one or more automated filtering techniques to the obtained one or more sources of potential translation pairs, wherein one of the one or more automated filtering techniques applied to a selected obtained source of potential translation pairs is configured based on a type of the selected obtained source of potential translation pairs, and wherein each of the one or more potential translation pairs comprises at least two language snippets; selecting at least one actual translation pair from the generated one or more potential translation pairs, said selecting comprising; extracting characteristics from each of the two language snippets of at least one of the one or more potential translation pairs; determining that the two language snippets of the at least one of the one or more potential translation pairs are translations of each other by comparing the extracted characteristics; and training the machine translation engine using the selected at least one actual translation pair.
-
-
2. The method of claim 1, wherein:
-
the obtained one or more sources of potential translation pairs comprise single content items that each contain multiple languages; each of the at least two language snippets for each potential translation pair is a portion of one of the single content items; each of the at least two language snippets for each potential translation pair comprises two or more consecutive words for which a particular language has been identified; and the identified domain for which the machine translation engine is to be trained is a social media domain.
-
-
3. The method of claim 2, wherein applying the one of the or more automated filtering techniques comprises eliminating from consideration an unlikely potential translation pair of the one or more potential translation pairs by:
-
determining a first count of terms in a first of the at least two language snippets of the unlikely potential translation pair; determining a second count of terms in a second of the at least two language snippets of the unlikely potential translation pair; computing that a ratio of terms between the first count of terms and the second count of terms is beyond a specified threshold value; and in response to the computing that the ratio of terms is beyond the specified threshold value, eliminating from consideration the unlikely potential translation pair.
-
-
4. The method of claim 1:
-
wherein each of the obtained one or more sources of potential translation pairs comprise multiple content items in different languages; wherein the multiple content items in different languages of each individual obtained one or more sources of potential translation pairs are related to the same target; wherein the at least two language snippets for each potential translation pair are; from different ones of the multiple content items of one of the obtained one or more sources of potential translation pairs and are in different languages; and wherein the identified domain for which the machine translation engine is to be trained is a social media domain.
-
-
5. The method of claim 4, wherein the obtained one or more sources of potential translation pairs comprise multiple content items that are linked to the same social graph node.
-
6. The method of claim 4, wherein the obtained one or more sources of potential translation pairs comprise multiple content items that contain the same URL target.
-
7. The method of claim 4, wherein applying the one of the one or more automated filtering techniques comprises eliminating from consideration an unlikely potential translation pair by:
-
determining, for each of the multiple content items of the unlikely potential translation pair, a corresponding time indicator specifying when that content item was created or published; computing that the determined time indicators are not within a specified time window threshold; and in response to computing that the time indicators are not within the specified time window threshold, eliminating from consideration the unlikely potential translation pair.
-
-
8. The method of claim 4, wherein applying the one of the one or more automated filtering techniques comprises:
-
dividing a first content item of the multiple content items into a first group of sentences; dividing a second content item of the multiple content items into a second group of sentences; receiving an identification of a particular segment length; dividing each sentence of the first group of sentences into a third group of consecutive term segments each segment of length no greater than the particular segment length; dividing each sentence of the second group of sentences into a fourth group of consecutive term segments each segment of length no greater than the identified segment length; finding at least one segment match between a particular segment of the third group of consecutive term segments and a particular segment of the fourth group of consecutive term segments by determining that a specified threshold number of terms between the particular segment of the third group and the particular segment of the fourth group are translations of each other; and in response to the finding of at least one segment match, generating as the one or more potential translation pairs each permutation of sentence pairs where one sentence of each sentence pair is selected from the first group of sentences and the other sentence of each sentence pair is selected from the second group of sentences.
-
-
9. The method of claim 8, wherein the received identification of the particular segment length identifies a segment length of three terms.
-
10. The method of claim 1:
-
wherein the obtained one or more sources of potential translation pairs comprise multiple content items that are generated by the same author; wherein the at least two language snippets for each potential translation pair; are from different ones of the multiple content items, are in different languages, and were published within a time window of each other; and wherein the identified domain for which the machine translation engine is to be trained is a social media domain.
-
-
11. The method of claim 1, wherein applying the one of the one or more automated filtering techniques comprises applying smoothing to at least one of the obtained one or more sources of potential translation pairs by:
-
identifying one or more language classifications for at least one term in the one or more obtained sources of potential translation pairs as a mistaken classification; and changing the classification for the at least one term to a language classification of an adjacent term.
-
-
12. The method of claim 1, wherein applying the one of the one or more automated filtering techniques comprises:
-
receiving an identification of one or more desired languages; for at least one selected language snippet of the at least two language snippets of each potential translation pair, identifying a language for the at least one selected language snippet; and determining that the identified language for the at least one selected language snippet is one of the one or more desired languages.
-
-
13. The method of claim 1, wherein the extracted characteristics comprise data to compute, between the two language snippets, two or more of:
-
a ratio of a number words; an IBM score, maximum fertility, a number of covered words, a length of a longest sequence of covered words, a length of a longest sequence of not-covered words; a set of three top fertility values; a maximal number of consequent source words which have corresponding consequent target words;
ora maximum number of consequent not-covered words.
-
-
14. The method of claim 1, wherein training the machine translation engine comprises assigning to an in-domain machine translation engine a classification according to a type for sources of translation pairs used to train that machine translation engine.
-
15. A non-transitory computer-readable medium storing instructions that, when executed by a computing system, cause the computing system to perform operations for mining translation pairs for training in-domain machine translation engines, the operations comprising:
-
obtaining one or more sources of potential translation pairs comprising one or more content items, wherein the one or more sources of potential translation pairs are in an identified domain for which a machine translation engine is to be trained; generating one or more potential translation pairs from the obtained one or more sources of potential translation pairs by applying one or more automated filtering techniques to the obtained one or more sources of potential translation pairs, wherein one of the one or more automated filtering techniques applied to a selected obtained source of potential translation pairs is configured based on a type of the selected obtained source of potential translation pairs, and wherein each of the one or more potential translation pairs comprises at least two language snippets; selecting at least one actual translation pair from the generated one or more potential translation pairs, said selecting comprising; extracting characteristics from each of the two language snippets of at least one of the one or more potential translation pairs; determining that the two language snippets of the at least one of the one or more potential translation pairs are translations of each other by comparing the extracted characteristics; and training the machine translation engine using the selected at least one actual translation pair.
-
-
16. The computer-readable medium of claim 15, wherein:
-
the obtained one or more sources of potential translation pairs comprise single content items that contain multiple languages; each of the at least two language snippets for each potential translation pair is a portion of one of the single content items; each of the at least two language snippets for each potential translation pair comprises two or more consecutive words for which a particular language has been identified; and applying the one of the or more automated filtering techniques comprises eliminating from consideration an unlikely potential translation pair of the one or more potential translation pairs by; determining a first count of terms in a first of the at least two language snippets of the unlikely potential translation pair; determining a second count of terms in a second of the at least two language snippets of the unlikely potential translation pair; computing that a ratio of terms between the first count of terms and the second count of terms is beyond a specified threshold value; and in response to the computing that the ratio of terms is beyond the specified threshold value, eliminating from consideration the unlikely potential translation pair.
-
-
17. The computer-readable medium of claim 15:
-
wherein each of the obtained one or more sources of potential translation pairs comprise multiple content items in different languages; wherein the multiple content items in different languages of each individual obtained one or more sources of potential translation pairs are related to the same target URL or social graph node; and wherein the at least two language snippets for each potential translation pair are; from different ones of the multiple content items of one of the obtained one or more sources of potential translation pairs and are in different languages.
-
-
18. The computer-readable medium of claim 17, wherein applying the one of the or more automated filtering techniques comprises:
-
dividing a first content item of the multiple content items into a first group of sentences; dividing a second content item of the multiple content items into a second group of sentences; receiving an identification of a particular segment length; dividing each sentence of the first group of sentences into a third group of consecutive term segments each segment of length no greater than the particular segment length; dividing each sentence of the second group of sentences into a fourth group of consecutive term segments each segment of length no greater than the identified segment length; finding at least one segment match between a particular segment of the third group of consecutive term segments and a particular segment of the fourth group of consecutive term segments by determining that a specified threshold number of terms between the particular segment of the third group and the particular segment of the fourth group are translations of each other; and in response to the finding of at least one segment match, generating as the one or more potential translation pairs each permutation of sentence pairs where one sentence of each sentence pair is selected from the first group of sentences and the other sentence of each sentence pair is selected from the second group of sentences.
-
-
19. A computing system for mining in-domain translation pairs comprising:
-
one or more processors; a memory; a potential translation pair finder configured to; obtain one or more sources of potential translation pairs comprising one or more content items, wherein the one or more sources of potential translation pairs are in a identified domain for which a machine translation engine is to be trained; and generate one or more potential translation pairs from the obtained one or more sources of potential translation pairs by applying one or more automated filtering techniques to the obtained one or more sources of potential translation pairs, wherein one of the one or more automated filtering techniques applied to a selected obtained source of potential translation pairs is configured based on a type of the selected obtained source of potential translation pairs, and wherein the one or more potential translation pairs each comprise at least two language snippets; and an actual pair analyzer configured to select at least one actual translation pair from the generated one or more potential translation pairs by extracting characteristics from each of the two language snippets of at least one of the one or more potential translation pairs; and determining that the two language snippets of the at least one of the one or more potential translation pairs are translations of each other by comparing the extracted characteristics.
-
Specification