×

Bi-phrase filtering for statistical machine translation

  • US 8,326,599 B2
  • Filed: 04/21/2009
  • Issued: 12/04/2012
  • Est. Priority Date: 04/21/2009
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for pruning a library of bi-phrases, comprising:

  • partitioning a bi-phrase library for use in a machine translation system into a set of sub-libraries, the bi-phrase library including, bi-phrases, each bi-phrase including a source phrase in a source language and a target phrase in a target language which is predicted to be a translation of the source phrase;

    for each of at least a plurality of the sub-libraries and for each of a plurality of association score thresholds, with a computer processor, computing values of noise as a function of expected counts and observed counts, where observed counts corresponds to a number of the bi-phrases in the sub-library having an association score above the association score threshold, and expected counts corresponds to a number of bi-phrases in the sub-library having an association score above the association score threshold under an independence hypothesis;

    wherein the association score of a bi-phrase represents a statistical dependency between source and target phrases of the bi-phrase and is computed as −

    log(p-value) of a contingency table for the bi-phrase, the contingency table including elements which represent a partition of N corpus bi-sentences into bi-sentences that contain both S and T, contain S but not T, contain T but not S, and contain neither S nor T, where N represent a number of bi-sentences in a parallel corpus from which the bi-phrases are extracted, S represents a source phrase of the bi-phrase, T represents a target phrase of the bi-phrase and wherein the p-value is computed from the contingency table using Fisher'"'"'s exact test; and

    pruning the plurality of sub-libraries based on a common noise threshold on the noise values, including, for each sub-library, determining an association score pruning threshold corresponding to the common noise threshold, and pruning bi-phrases from the sub-library having an association score which exceeds the association score pruning threshold.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×