Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections
First Claim
1. A method for identifying parallel sub-sentential fragments in a bilingual collection of documents, the method comprising:
- translating a source document in the bilingual collection of documents using a processor configured to perform statistical machine translation;
querying a target library associated with the bilingual collection of documents using the translated source document using a query engine of the processor;
identifying a plurality of target documents in the target library that are most similar to the translated source document, based on the query, using the query engine;
aligning a source sentence associated with the source document to one or more target sentences associated with each of the plurality of identified target documents to generate one or more aligned sentence pairs, using a document selector engine of the processor;
discarding an aligned sentence pair based on a number of words in the sentence pair that are translations of each other, using the document selector engine; and
determining for each of the aligned sentence pairs that have not been discarded whether a source fragment in the source sentence comprises a parallel translation of a target fragment in the target sentence based on a number of words in the source fragment that are translations of words in the target fragment, the determining performed using a parallel document engine of the processor and comprising;
determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking,determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, andselecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, computer programs, and methods for identifying parallel documents and/or fragments in a bilingual collection are provided. The method for identifying parallel sub-sentential fragments in a bilingual collection comprises translating a source document from a bilingual collection. The method further includes querying a target library associated with the bilingual collection using the translated source document, and identifying one or more target documents based on the query. Subsequently, a source sentence associated with the source document is aligned to one or more target sentences associated with the one or more target documents. Finally, the method includes determining whether a source fragment associated with the source sentence comprises a parallel translation of a target fragment associated with the one or more target sentences.
-
Citations
19 Claims
-
1. A method for identifying parallel sub-sentential fragments in a bilingual collection of documents, the method comprising:
-
translating a source document in the bilingual collection of documents using a processor configured to perform statistical machine translation; querying a target library associated with the bilingual collection of documents using the translated source document using a query engine of the processor; identifying a plurality of target documents in the target library that are most similar to the translated source document, based on the query, using the query engine; aligning a source sentence associated with the source document to one or more target sentences associated with each of the plurality of identified target documents to generate one or more aligned sentence pairs, using a document selector engine of the processor; discarding an aligned sentence pair based on a number of words in the sentence pair that are translations of each other, using the document selector engine; and determining for each of the aligned sentence pairs that have not been discarded whether a source fragment in the source sentence comprises a parallel translation of a target fragment in the target sentence based on a number of words in the source fragment that are translations of words in the target fragment, the determining performed using a parallel document engine of the processor and comprising; determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking, determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, and selecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer program embodied on a non-transitory computer readable medium having instructions for identifying parallel sub-sentential fragments in a bilingual collection, the instructions comprising the steps:
-
translating a source document from a bilingual collection; querying a target library associated with the bilingual collection using the translated source document; identifying one or more target documents based on the query; aligning a source sentence associated with the source document to one or more target sentences associated with the one or more target documents; and determining whether a source fragment associated with the source sentence comprises a parallel translation of a target fragment associated with the one or more target sentences, the determination comprising; determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking, determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, and selecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method for identifying parallel documents in a bilingual collection of documents, the method comprising:
-
translating a source document from a bilingual collection of documents using a processor configured to perform statistical machine translation; querying a target library associated with the bilingual collection of documents using the translated source document using a query engine of the processor; identifying a predetermined limited number one or more target documents in the target library that are most similar to the translated source document, based on the query using the query engine; aligning one or more source sentence associated with the source document to one or more target sentences associated with each of the one or more identified target documents to generate one or more aligned sentence pairs, the aligning performed using a document selector engine of the processor; discarding each of the one or more identified target documents that does not have a specified number of sentence pairs that can be aligned, the discarding performed using the document selector engine; determining for each of the one or more identified target documents a number of aligned sentence pairs that are translations of each other, the determining performed using the document selector engine; and determining whether the source document comprises a parallel translation of one of the one or more target documents, the determining performed using a parallel document engine of the processor and comprising; determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking, determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, and selecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document. - View Dependent Claims (12, 13, 14)
-
-
15. A system for identifying parallel documents in a bilingual collection, the system comprising:
-
a word translator engine stored in a memory, executable by a processor, and configured to translate a source document from a bilingual collection; a query engine configured to query a target library associated with the bilingual collection using the translated source document and identify one or more target documents based on the query; a document selector engine configured to align a source sentence associated with the source document to one or more target sentences associated with the one or more target documents; and a parallel document engine configured to determine whether the source document comprises a parallel translation of one of the one or more the target documents, the parallel document engine further configured to; determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking, determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, and selecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document. - View Dependent Claims (16, 17, 18, 19)
-
Specification