×

Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections

  • US 8,943,080 B2
  • Filed: 12/05/2006
  • Issued: 01/27/2015
  • Est. Priority Date: 04/07/2006
  • Status: Active Grant
First Claim
Patent Images

1. A method for identifying parallel sub-sentential fragments in a bilingual collection of documents, the method comprising:

  • translating a source document in the bilingual collection of documents using a processor configured to perform statistical machine translation;

    querying a target library associated with the bilingual collection of documents using the translated source document using a query engine of the processor;

    identifying a plurality of target documents in the target library that are most similar to the translated source document, based on the query, using the query engine;

    aligning a source sentence associated with the source document to one or more target sentences associated with each of the plurality of identified target documents to generate one or more aligned sentence pairs, using a document selector engine of the processor;

    discarding an aligned sentence pair based on a number of words in the sentence pair that are translations of each other, using the document selector engine; and

    determining for each of the aligned sentence pairs that have not been discarded whether a source fragment in the source sentence comprises a parallel translation of a target fragment in the target sentence based on a number of words in the source fragment that are translations of words in the target fragment, the determining performed using a parallel document engine of the processor and comprising;

    determining a percentage of words that are aligned in each sentence pair of the target document using greedily linking,determining a number of sentence pairs in the target document that are parallel sentence pairs based on the percentage of words that are aligned in each sentence pair, andselecting the target document as being a parallel translation of the source document according to a number of parallel sentence pairs in the target document.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×