Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections

US 8,943,080 B2
Filed: 12/05/2006
Issued: 01/27/2015
Est. Priority Date: 04/07/2006
Status: Active Grant

First Claim

Patent Images

1. A method for identifying parallel sub-sentential fragments in a bilingual collection of documents, the method comprising:

translating a source document in the bilingual collection of documents using a processor configured to perform statistical machine translation;

querying a target library associated with the bilingual collection of documents using the translated source document using a query engine of the processor;

identifying a plurality of target documents in the target library that are most similar to the translated source document, based on the query, using the query engine;

aligning a source sentence associated with the source document to one or more target sentences associated with each of the plurality of identified target documents to generate one or more aligned sentence pairs, using a document selector engine of the processor;

discarding an aligned sentence pair based on a number of words in the sentence pair that are translations of each other, using the document selector engine; and

determining for each of the aligned sentence pairs that have not been discarded whether a source fragment in the source sentence comprises a parallel translation of a target fragment in the target sentence based on a number of words in the source fragment that are translations of words in the target fragment, the determining performed using a parallel document engine of the processor and comprising;

determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking,determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, andselecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, computer programs, and methods for identifying parallel documents and/or fragments in a bilingual collection are provided. The method for identifying parallel sub-sentential fragments in a bilingual collection comprises translating a source document from a bilingual collection. The method further includes querying a target library associated with the bilingual collection using the translated source document, and identifying one or more target documents based on the query. Subsequently, a source sentence associated with the source document is aligned to one or more target sentences associated with the one or more target documents. Finally, the method includes determining whether a source fragment associated with the source sentence comprises a parallel translation of a target fragment associated with the one or more target sentences.

Citations

19 Claims

1. A method for identifying parallel sub-sentential fragments in a bilingual collection of documents, the method comprising:
- translating a source document in the bilingual collection of documents using a processor configured to perform statistical machine translation;
  
  querying a target library associated with the bilingual collection of documents using the translated source document using a query engine of the processor;
  
  identifying a plurality of target documents in the target library that are most similar to the translated source document, based on the query, using the query engine;
  
  aligning a source sentence associated with the source document to one or more target sentences associated with each of the plurality of identified target documents to generate one or more aligned sentence pairs, using a document selector engine of the processor;
  
  discarding an aligned sentence pair based on a number of words in the sentence pair that are translations of each other, using the document selector engine; and
  
  determining for each of the aligned sentence pairs that have not been discarded whether a source fragment in the source sentence comprises a parallel translation of a target fragment in the target sentence based on a number of words in the source fragment that are translations of words in the target fragment, the determining performed using a parallel document engine of the processor and comprising;
  
  determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking,determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, andselecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method recited in claim 1 further comprising discarding a sentence pair based on a number of words that are translations of each other determined according to a coarse lexicon.
  - 3. The method recited in claim 2 further comprising assigning a translation probability to words greedily linked in the aligned source sentence, and based on a numerical value retrieved from a fine lexicon.
  - 4. The method recited in claim 3 further comprising detecting a parallel fragment based on a threshold associated with a number of essentially continuous words assigned a positive translation probability.
  - 5. The method recited in claim 1 further comprising determining whether the source document comprises a parallel translation of the target document based on a number of sentences that can be aligned in the source document and the target document.

6. A computer program embodied on a non-transitory computer readable medium having instructions for identifying parallel sub-sentential fragments in a bilingual collection, the instructions comprising the steps:
- translating a source document from a bilingual collection;
  
  querying a target library associated with the bilingual collection using the translated source document;
  
  identifying one or more target documents based on the query;
  
  aligning a source sentence associated with the source document to one or more target sentences associated with the one or more target documents; and
  
  determining whether a source fragment associated with the source sentence comprises a parallel translation of a target fragment associated with the one or more target sentences, the determination comprising;
  
  determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking,determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, andselecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The computer program recited in claim 6 further comprising an instruction for discarding the one or more target sentences based on a coarse lexicon.
  - 8. The computer program recited in claim 7 further comprising an instruction for assigning a translation probability to words in the aligned source sentence based on a fine lexicon.
  - 9. The computer program recited in claim 8 further comprising an instruction for detecting a parallel fragment based on a threshold associated with a number of essentially continuous words assigned a positive translation probability.
  - 10. The computer program recited in claim 6 further comprising an instruction for determining whether the source document comprises a parallel translation of the target document.

11. A method for identifying parallel documents in a bilingual collection of documents, the method comprising:
- translating a source document from a bilingual collection of documents using a processor configured to perform statistical machine translation;
  
  querying a target library associated with the bilingual collection of documents using the translated source document using a query engine of the processor;
  
  identifying a predetermined limited number one or more target documents in the target library that are most similar to the translated source document, based on the query using the query engine;
  
  aligning one or more source sentence associated with the source document to one or more target sentences associated with each of the one or more identified target documents to generate one or more aligned sentence pairs, the aligning performed using a document selector engine of the processor;
  
  discarding each of the one or more identified target documents that does not have a specified number of sentence pairs that can be aligned, the discarding performed using the document selector engine;
  
  determining for each of the one or more identified target documents a number of aligned sentence pairs that are translations of each other, the determining performed using the document selector engine; and
  
  determining whether the source document comprises a parallel translation of one of the one or more target documents, the determining performed using a parallel document engine of the processor and comprising;
  
  determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking,determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, andselecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document.
- View Dependent Claims (12, 13, 14)
- - 12. The method recited in claim 11 wherein determining whether the source document comprises a parallel translation of one of the one or more target documents further comprises determining whether the number of the source sentences aligned to the one or more target sentences satisfies a threshold associated with noisy sentences.
  - 13. The method recited in claim 11 wherein determining whether the source document comprises a parallel translation of the target document further comprises determining whether the number of the source sentences aligned to the one or more target sentences satisfies a threshold associated with monotone sentences.
  - 14. The method recited in claim 11 further comprising determining whether a source fragment associated with the source sentence comprises a parallel translation of a target fragment associated with one of the one or more target sentences.

15. A system for identifying parallel documents in a bilingual collection, the system comprising:
- a word translator engine stored in a memory, executable by a processor, and configured to translate a source document from a bilingual collection;
  
  a query engine configured to query a target library associated with the bilingual collection using the translated source document and identify one or more target documents based on the query;
  
  a document selector engine configured to align a source sentence associated with the source document to one or more target sentences associated with the one or more target documents; and
  
  a parallel document engine configured to determine whether the source document comprises a parallel translation of one of the one or more the target documents, the parallel document engine further configured to;
  
  determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking,determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, andselecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The system recited in claim 15 wherein the parallel document engine further comprises a sentence analysis module configured to select the one of the one or more target documents according to a number of the source sentences within the source document aligned to the one or more target sentences.
  - 17. The system recited in claim 16 wherein the parallel document engine further comprises a document classification module configured to determine whether the number of the source sentences aligned to the one or more target sentences satisfies a threshold associated with noisy sentences.
  - 18. The system recited in claim 16 wherein the parallel document engine further comprises a document classification module configured to determine whether the number of the source sentences aligned to the one or more target sentences satisfies a threshold associated with monotone sentences.
  - 19. The system recited in claim 15 further comprising a parallel fragment engine configured to determine whether a source fragment associated with the source sentence comprises a parallel translation of a target fragment associated with the one or more target sentences.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
University of Southern California
Original Assignee
University of Southern California
Inventors
Marcu, Daniel, Munteanu, Dragos Stefan
Primary Examiner(s)
Bhatia, Ajay
Assistant Examiner(s)
MINA, FATIMA P

Application Number

US11/635,248
Publication Number

US 20070250306A1
Time in Patent Office

2,975 Days
Field of Search

707/761, 707/708, 707/728, 707/748, 707/749, 707/758
US Class Current

707/758
CPC Class Codes

G06F 40/45 Example-based machine trans...

Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links