Method for building parallel corpora
First Claim
1. A method for identifying documents for enriching a statistical:
- translation tool comprising;
retrieving at least one source document which is responsive to a source language query;
for each retrieved source document;
extracting a set of text segments from the retrieved source document;
translating the extracted text segments into target language segments with a statistical translation tool to be enriched;
formulating target language queries based on the target language segments;
for each of a plurality of the target language queries, retrieving a set of target documents responsive to the target language query;
filtering the sets of retrieved target documents that are responsive to the target language queries, the filtering including identifying candidate documents which meet a selection criterion that is based on co-occurrence of a target document in a plurality of the sets; and
comparing the candidate documents with the retrieved source document for determining whether any of the candidate documents match the source document.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for identifying documents for enriching a statistical translation tool includes retrieving a source document which is responsive to a source language query that may be specific to a selected domain. A set of text segments is extracted from the retrieved source document and translated into corresponding target language segments with a statistical translation tool to be enriched. Target language queries based on the target language segments are formulated. Sets of target documents responsive to the target language queries are retrieved. The sets of retrieved target documents are filtered, including identifying any candidate documents which meet a selection criterion that is based on co-occurrence of a document in a plurality of the sets. The candidate documents, where found, are compared with the retrieved source document for determining whether any of the candidate documents match the source document. Matching documents can then be stored and used at their turn in a training phase for enriching the translation tool.
-
Citations
19 Claims
-
1. A method for identifying documents for enriching a statistical:
- translation tool comprising;
retrieving at least one source document which is responsive to a source language query; for each retrieved source document; extracting a set of text segments from the retrieved source document; translating the extracted text segments into target language segments with a statistical translation tool to be enriched; formulating target language queries based on the target language segments; for each of a plurality of the target language queries, retrieving a set of target documents responsive to the target language query; filtering the sets of retrieved target documents that are responsive to the target language queries, the filtering including identifying candidate documents which meet a selection criterion that is based on co-occurrence of a target document in a plurality of the sets; and comparing the candidate documents with the retrieved source document for determining whether any of the candidate documents match the source document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- translation tool comprising;
-
18. A system for enriching a statistical translation tool comprising:
-
memory which stores instructions for; retrieving at least one source document which is responsive to a source language query; for each retrieved source document, extracting a set of text segments from the retrieved source document; translating the extracted text segments into target language segments with a statistical translation tool; formulating target language queries based on the target language segments; retrieving target documents responsive to the target language queries; filtering the retrieved target documents that are responsive to the target language queries, the filtering including identifying candidate documents from the target documents that are responsive to a preselected minimum amount of queries; comparing the candidate documents with the retrieved source document for identifying whether any of the candidate documents match the source document; and enriching the translation tool with aligned text fragments from matching source and target documents; and a processor which executes the instructions.
-
-
19. A method for enriching a statistical translation tool comprising:
-
for each of a plurality of source documents in a target language; extracting a set of text segments from the source document; translating the extracted text segments into target language segments with a statistical translation tool; formulating target language queries, each query being based on one of the target language segments; retrieving target documents responsive to the target language queries; and filtering the retrieved target documents that are responsive to the target language queries, the filtering including identifying candidate documents from among the retrieved target documents which are meet a selection criterion, the selection criterion being based on a measure of the queries to which a document is retrieved as being responsive; and comparing the candidate documents which meet the selection criterion with the retrieved source document for identifying whether any of the candidate documents match the source document; and enriching the translation tool with aligned text fragments from the matched source and target documents.
-
Specification