Bi-phrase filtering for statistical machine translation
First Claim
Patent Images
1. A method for pruning a library of bi-phrases, comprising:
- partitioning a bi-phrase library for use in a machine translation system into a set of sub-libraries, the bi-phrase library including, bi-phrases, each bi-phrase including a source phrase in a source language and a target phrase in a target language which is predicted to be a translation of the source phrase;
for each of at least a plurality of the sub-libraries and for each of a plurality of association score thresholds, with a computer processor, computing values of noise as a function of expected counts and observed counts, where observed counts corresponds to a number of the bi-phrases in the sub-library having an association score above the association score threshold, and expected counts corresponds to a number of bi-phrases in the sub-library having an association score above the association score threshold under an independence hypothesis;
wherein the association score of a bi-phrase represents a statistical dependency between source and target phrases of the bi-phrase and is computed as −
log(p-value) of a contingency table for the bi-phrase, the contingency table including elements which represent a partition of N corpus bi-sentences into bi-sentences that contain both S and T, contain S but not T, contain T but not S, and contain neither S nor T, where N represent a number of bi-sentences in a parallel corpus from which the bi-phrases are extracted, S represents a source phrase of the bi-phrase, T represents a target phrase of the bi-phrase and wherein the p-value is computed from the contingency table using Fisher'"'"'s exact test; and
pruning the plurality of sub-libraries based on a common noise threshold on the noise values, including, for each sub-library, determining an association score pruning threshold corresponding to the common noise threshold, and pruning bi-phrases from the sub-library having an association score which exceeds the association score pruning threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented system and a method for pruning a library of bi-phrases, suitable for use in a machine translation system are provided. The method includes partitioning a bi-phrase library into a set of sub-libraries. The sub-libraries may be of different complexity such that, when pruning bi-phrases from the plurality of sub-libraries is based on a common noise threshold, a complexity of bi-phrases is taken into account in pruning the bi-phrases.
48 Citations
16 Claims
-
1. A method for pruning a library of bi-phrases, comprising:
-
partitioning a bi-phrase library for use in a machine translation system into a set of sub-libraries, the bi-phrase library including, bi-phrases, each bi-phrase including a source phrase in a source language and a target phrase in a target language which is predicted to be a translation of the source phrase; for each of at least a plurality of the sub-libraries and for each of a plurality of association score thresholds, with a computer processor, computing values of noise as a function of expected counts and observed counts, where observed counts corresponds to a number of the bi-phrases in the sub-library having an association score above the association score threshold, and expected counts corresponds to a number of bi-phrases in the sub-library having an association score above the association score threshold under an independence hypothesis; wherein the association score of a bi-phrase represents a statistical dependency between source and target phrases of the bi-phrase and is computed as −
log(p-value) of a contingency table for the bi-phrase, the contingency table including elements which represent a partition of N corpus bi-sentences into bi-sentences that contain both S and T, contain S but not T, contain T but not S, and contain neither S nor T, where N represent a number of bi-sentences in a parallel corpus from which the bi-phrases are extracted, S represents a source phrase of the bi-phrase, T represents a target phrase of the bi-phrase and wherein the p-value is computed from the contingency table using Fisher'"'"'s exact test; andpruning the plurality of sub-libraries based on a common noise threshold on the noise values, including, for each sub-library, determining an association score pruning threshold corresponding to the common noise threshold, and pruning bi-phrases from the sub-library having an association score which exceeds the association score pruning threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
Specification