Methods for automatic generation of parallel corpora
First Claim
1. A computer implemented method comprising:
- obtaining a first item listing that is posted by a first seller in a first language and that is related to selling a particular item that is a product, a service, or a combination of a product and service;
obtaining a second item listing that is posted by a second seller in a second language and that is also related to selling the particular item;
aligning the first item listing with the second item listing in response to the first item listing and the second item listing both being related to selling the same item;
identifying a first organizational structure with respect to first hierarchal relationships between first hypertext markup language (HTML) tags of first HTML code of a first description of the first item listing;
identifying a second organizational structure with respect to second hierarchal relationships between second HTML tags of second HTML code of a second description of the second item listing;
measuring, based on the aligning of the first item listing with the second item listing, an organizational structural similarity of the first HTML code with respect to the second HTML code by comparing the first organizational structure against the second organizational structure, the comparing including comparing the first hierarchal relationships against the second hierarchal relationships by comparing first nodes and first edges of a first tree that represents the first hierarchal relationships against second nodes and second edges of a second tree that represents the second hierarchal relationships; and
in response to the first HTML code and the second HTML code being determined as being organizationally structurally similar based on the measured organizational structural similarity, forming the first description into a first sentence in the first language as a translation of the second description into the first language.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of forming parallel corpora comprises receiving sets of items in first language and second languages, each of the sets having one or more associated descriptions and metadata. The metadata is collected from the two sets of items and are aligned using the metadata. The aligned metadata are mapped from the first language to the second language for each of the sets. The descriptions of two items are fetched and the structural similarity of the descriptions is measured to assess whether two items are likely to be translations of each other. For mapped items with structurally similar descriptions, the mapped item descriptions are formed into respective sentences in first language and in the second language. The sentences are parallel corpora which may be used to translate an item from the first language to the second language, and also to train a machine translation system.
-
Citations
20 Claims
-
1. A computer implemented method comprising:
-
obtaining a first item listing that is posted by a first seller in a first language and that is related to selling a particular item that is a product, a service, or a combination of a product and service; obtaining a second item listing that is posted by a second seller in a second language and that is also related to selling the particular item; aligning the first item listing with the second item listing in response to the first item listing and the second item listing both being related to selling the same item; identifying a first organizational structure with respect to first hierarchal relationships between first hypertext markup language (HTML) tags of first HTML code of a first description of the first item listing; identifying a second organizational structure with respect to second hierarchal relationships between second HTML tags of second HTML code of a second description of the second item listing; measuring, based on the aligning of the first item listing with the second item listing, an organizational structural similarity of the first HTML code with respect to the second HTML code by comparing the first organizational structure against the second organizational structure, the comparing including comparing the first hierarchal relationships against the second hierarchal relationships by comparing first nodes and first edges of a first tree that represents the first hierarchal relationships against second nodes and second edges of a second tree that represents the second hierarchal relationships; and in response to the first HTML code and the second HTML code being determined as being organizationally structurally similar based on the measured organizational structural similarity, forming the first description into a first sentence in the first language as a translation of the second description into the first language. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
one or more processors; a memory to store instructions that, in response to being executed by the one or more processors, cause the system to perform operations comprising; obtaining a first item listing that is posted by a first seller in a first language and that is related to selling a particular item that is a product, a service, or a combination of a product and service; obtaining a second item listing that is posted by a second seller in a second language and that is also related to selling the particular item; aligning the first item listing with the second item listing in response to the first item listing and the second item listing both being related to selling the same item; measuring, based on the aligning of the first item listing with the second item listing, an organizational structural similarity of first hypertext markup language (HTML) code of a first description of the first item listing with respect to second HTML code of a second description of the second item listing, the measuring of the organizational structural similarity including comparing first hierarchal connections between first HTML tags of the first HTML code against second hierarchal connections between second HTML tags of the second HTML code; in response to the first HTML code and the second HTML code being determined as being organizationally structurally similar based on the measured organizational structural similarity, forming the first description into a first sentence in the first language and forming the second description into a second sentence in the second language in which the first sentence and the second sentence are parallel corpora; and using the parallel corpora to perform one or more operations selected from a group of operations consisting of;
translating another item listing from the first language to the second language; and
training a machine translation system. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. One or more non-transitory computer-readable media embodying instructions that, in response to being executed by one or more processors of a system, cause the system to perform operations comprising:
-
obtaining a first item listing that is posted by a first seller in a first language and that is related to selling a particular item that is a product, a service, or a combination of a product and service; obtaining a second item listing that is posted by a second seller in a second language and that is also related to selling the particular item; measuring an organizational structural similarity of first hypertext markup language (HTML) code of a first description of the first item listing with respect to second HTML code of a second description of the second item listing in response to the first item listing and the second item listing both being related to selling the same item, the measuring of the organizational structural similarity including comparing first nodes and first edges of a first tree that represents the first HTML code against second nodes and second edges of a second tree that represents the second HTML code; and in response to the first HTML code and the second HTML code being determined as being organizationally structurally similar based on the measured organizational structural similarity, forming the first description into a first sentence in the first language as a translation of the second description into the first language. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification