Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
First Claim
Patent Images
1. A method, comprising:
- Obtaining a collection of texts which are not parallel texts;
determining sentence portions within the collection of texts, whose meaning is substantially the same, by comparing a plurality of sentence portions within the collection of texts, and determining at least one parameter indicative of a sentence portion in the first document and a sentence portion in the second document, and using said at least one parameter to determine sentence portions which have similar meanings; and
using said sentence portions which have similar meanings to create training data for a machine translation system.
1 Assignment
0 Petitions
Accused Products
Abstract
A translation training device which extracts from two nonparallel Corpora a set of parallel sentences. The system finds parameters between different sentences or phrases, in order to find parallel sentences. The parallel sentences are then used for training a data-driven machine translation system. The process can be applied repetitively until sufficient data is collected or until the performance of the translation system stops improving.
-
Citations
29 Claims
-
1. A method, comprising:
-
Obtaining a collection of texts which are not parallel texts;
determining sentence portions within the collection of texts, whose meaning is substantially the same, by comparing a plurality of sentence portions within the collection of texts, and determining at least one parameter indicative of a sentence portion in the first document and a sentence portion in the second document, and using said at least one parameter to determine sentence portions which have similar meanings; and
using said sentence portions which have similar meanings to create training data for a machine translation system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method, comprising:
-
obtaining a first amount of parallel training data for a learning component of a machine translation system;
using the learning component of the machine translation system trained using said parallel data to determine translation parameters, including at least one probabilistic word dictionary;
using said translation parameters to extract parallel sentences from a second corpus of nonparallel data, where said second corpus is larger than a database of said parallel training data;
using said parallel sentences to create training data for said learning component of said machine translation system;
training said learning component using said training data, and iteratively re-analyzing said comparable corpus using the system thus trained;
continuing said iterative re-analyzing when training reaches a specified level. - View Dependent Claims (15, 16, 17)
-
-
18. A computer system, comprising:
-
a database, storing a first collection of texts in a first language, and a second collection of texts, which are not parallel to said first collection of texts, that are in a second language;
a training processor, that processes said texts to determine portions in the first collection of texts whose meaning is substantially the same as portions within the second collection of texts, by comparing a plurality of sentences within the collection of texts, and determining at least one parameter indicative of a first portion within the first collection and a second portion within the second collection, and using said at least one parameter to determine portions which have similar meanings; and
a translation processor, using training data based on said portions which have similar meanings to translate input text between said first and second languages. - View Dependent Claims (19, 20, 21, 22, 23, 24)
-
-
25. A system, comprising:
-
a database including a first amount of parallel training data for a learning component of a machine translation system, and a second corpus of non parallel data, said second amount greater than said first amount;
a learning processor, forming at least one probabilistic word dictionary using said parallel data and also forming training data, and using said probabilistic word dictionary to determine portions within said second corpus which have comparable meanings, and to refine said training data to form refined training data, based on said second corpus, and to re-train again, based on said second corpus and said refined training data. - View Dependent Claims (26, 27, 28, 29)
-
Specification