METHODS FOR AUTOMATIC GENERATION OF PARALLEL CORPORA
First Claim
1. A computer implemented method comprising:
- receiving sets of item listings in a first language and sets of item listings in a second language, each of the item listings in the sets of item listings comprising one or more descriptions and metadata;
collecting the metadata from the sets of item listings and aligning the sets of item listings using the metadata;
mapping the aligned sets of item listings from the first language to the second language for each of the sets of item listings;
fetching the descriptions of the mapped aligned sets of item listings and measuring the structural similarity of the fetched descriptions of the mapped aligned sets of item listings to assess whether mapped aligned sets of item listings are likely to be translations of each other, andfor pairs of mapped aligned sets of item listings having structurally similar descriptions, forming the descriptions of the mapped aligned sets of item listings into respective sentences in the first language and in the second language.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of forming parallel corpora comprises receiving sets of items in first language and second languages, each of the sets having one or more associated descriptions and metadata. The metadata is collected from the two sets of items and are aligned using the metadata. The aligned metadata are mapped from the first language to the second language for each of the sets. The descriptions of two items are fetched and the structural similarity of the descriptions is measured to assess whether two items are likely to be translations of each other. For mapped items with structurally similar descriptions, the mapped item descriptions are formed into respective sentences in first language and in the second language. The sentences are parallel corpora which may be used to translate an item from the first language to the second language, and also to train a machine translation system.
-
Citations
20 Claims
-
1. A computer implemented method comprising:
-
receiving sets of item listings in a first language and sets of item listings in a second language, each of the item listings in the sets of item listings comprising one or more descriptions and metadata; collecting the metadata from the sets of item listings and aligning the sets of item listings using the metadata; mapping the aligned sets of item listings from the first language to the second language for each of the sets of item listings; fetching the descriptions of the mapped aligned sets of item listings and measuring the structural similarity of the fetched descriptions of the mapped aligned sets of item listings to assess whether mapped aligned sets of item listings are likely to be translations of each other, and for pairs of mapped aligned sets of item listings having structurally similar descriptions, forming the descriptions of the mapped aligned sets of item listings into respective sentences in the first language and in the second language. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. One or more computer-readable hardware storage device having embedded therein a set of instructions which, when executed by one or more processors of a computer, causes the computer to execute operations comprising:
-
receiving sets of item listings in a first language and sets of item listings in a second language, each of the item listings in the sets of item listings comprising one or more descriptions and metadata; collecting the metadata from the sets of item listings and aligning the sets of item listings using the metadata; mapping the aligned sets of item listings from the first language to the second language for each of the sets of item listings; fetching the descriptions of the mapped aligned sets of item listings and measuring the structural similarity of the fetched descriptions of the mapped aligned sets of item listings to assess whether mapped aligned sets of item listings are likely to be translations of each other; and for pairs of mapped aligned sets of item listings having structurally similar descriptions, forming the descriptions of the mapped aligned sets of item listings into respective sentences in the first language and in the second language. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. One or more hardware processors configured to include:
-
a receiver module to receive sets of item listings in a first language and sets of item listings in a second language, each of the item listings in the sets of item listings comprising one or more descriptions and metadata; metadata collection module to collect the metadata from the sets of item listings and aligning the sets of item listings using the metadata; a mapping module to map the aligned sets of item listings from the first language to the second language for each of the sets of item listings; a description fetch module to fetch the descriptions of the mapped aligned sets of item listings and measuring the structural similarity of the fetched descriptions of the mapped aligned sets of item listings to assess whether mapped aligned sets of item listings are likely to be translations of each other; and a sentence formation module to form, for pairs of mapped aligned sets of item listings having structurally similar descriptions, the descriptions of the mapped aligned sets of item listings into respective sentences in the first language and in the second language. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification