Optimized statistical machine translation system with rapid adaptation capability

US 9,959,271 B1
Filed: 09/28/2015
Issued: 05/01/2018
Est. Priority Date: 09/28/2015
Status: Expired due to Fees

First Claim

Patent Images

1. An apparatus, comprising:

one or more processors; and

one or more non-transitory computer-readable storage media having instructions stored thereupon which are executable by the one or more processors and which, when executed, cause the apparatus to;

determine a number of out-of-vocabulary words in input text segments;

generate an estimated difficulty feature score of a supervised machine learning model in translating the input text segments based, at least in part, on the number of out-of-vocabulary words;

modify a misclassification cost associated with the supervised machine learning model, stored in a memory, to offset an imbalance between a plurality of classes of training data utilized to train a machine translation quality classifier to classify a quality of machine translated text segments, the training data comprising one or more feature scores including the estimated difficulty feature score;

modify a loss function associated with the supervised machine learning model stored in the memory to penalize a misclassification of a lower-quality text segment as a higher-quality text segment more greatly than a misclassification of a higher-quality text segment as a lower-quality text segment;

train the machine translation quality classifier utilizing the supervised machine learning model based, at least in part on the misclassification cost and the loss function;

cause the machine translation quality classifier to be deployed to a computer in a service provider network; and

utilize the machine translation quality classifier is utilized to classify a quality of translated segments received from a machine translation system operating in the service provider network into one of the plurality of classes in real time.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Technologies are disclosed herein for statistical machine translation. In particular, the disclosed technologies include extensions to conventional machine translation pipelines: the use of multiple domain-specific and non-domain-specific dynamic language translation models and language models; cluster-based language models; and large-scale discriminative training. Incremental update technologies are also disclosed for use in updating a machine translation system in four areas: word alignment; translation modeling; language modeling; and parameter estimation. A mechanism is also disclosed for training and utilizing a runtime machine translation quality classifier for estimating the quality of machine translations without the benefit of reference translations. The runtime machine translation quality classifier is generated in a manner to offset imbalances in the number of training instances in various classes, and to assign a greater penalty to the misclassification of lower-quality translations as higher-quality translations than to misclassification of higher-quality translations as lower-quality translations.

75 Citations

View as Search Results

21 Claims

1. An apparatus, comprising:
- one or more processors; and
  
  one or more non-transitory computer-readable storage media having instructions stored thereupon which are executable by the one or more processors and which, when executed, cause the apparatus to;
  
  determine a number of out-of-vocabulary words in input text segments;
  
  generate an estimated difficulty feature score of a supervised machine learning model in translating the input text segments based, at least in part, on the number of out-of-vocabulary words;
  
  modify a misclassification cost associated with the supervised machine learning model, stored in a memory, to offset an imbalance between a plurality of classes of training data utilized to train a machine translation quality classifier to classify a quality of machine translated text segments, the training data comprising one or more feature scores including the estimated difficulty feature score;
  
  modify a loss function associated with the supervised machine learning model stored in the memory to penalize a misclassification of a lower-quality text segment as a higher-quality text segment more greatly than a misclassification of a higher-quality text segment as a lower-quality text segment;
  
  train the machine translation quality classifier utilizing the supervised machine learning model based, at least in part on the misclassification cost and the loss function;
  
  cause the machine translation quality classifier to be deployed to a computer in a service provider network; and
  
  utilize the machine translation quality classifier is utilized to classify a quality of translated segments received from a machine translation system operating in the service provider network into one of the plurality of classes in real time.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The apparatus of claim 1, wherein the plurality of classes comprises a perfect or near perfect class, an understandable class, and a residual class.
  - 3. The apparatus of claim 1, wherein the non-transitory computer-readable storage media has further instructions stored thereupon to:
    - aggregate a plurality of classifications for translated segments; and
      
      generate one or more document-level or corpus-level distribution statistics based upon the aggregated classifications.
  - 4. The apparatus of claim 1, wherein the one or more feature scores are associated with machine translated text segments in a target language and correct class labels for the machine translated text segments in the target language.
  - 5. The apparatus of claim 4, wherein the correct class labels for the machine translated text segments in the target language are generated, at least in part, based upon a translation edit rate (“
    - TER”
      
      ) between the machine translated text segments in the target language and associated reference translations.
  - 6. The apparatus of claim 4, wherein the one or more feature scores associated with the machine translated text segments in the target language comprise one or more of:
    - a fluency of the machine translated text segments,a level of ambiguity experienced by the machine translation system in translating the input text segments,a difference in length or punctuation between the input text segments and the machine translated text segments, orone or more statistical confidence measures generated by the machine translation system for the machine translated text segments.

7. A computer-implemented method for classifying a quality of translated segments generated by a machine translation system, the method comprising:
- generating an estimated difficulty feature score of a supervised machine learning model in translating input text segments based, at least in part, on a number of out-of-vocabulary words in the input text segments;
  
  training a machine translation quality classifier stored in a memory to classify the quality of the translated segments utilizing the supervised machine learning model configured witha misclassification cost configured to offset an imbalance between a plurality of classes of training data, the training data comprising one or more feature scores associated with machine translated segments of a target language and correct class labels for the machine translated segments in the target language, the one or more feature scores including the estimated difficulty feature score; and
  
  a loss function configured to penalize a misclassification of a lower-quality translated segment as a higher-quality translated segment more greatly than a misclassification of a higher-quality translated segment as a lower-quality translated segment; and
  
  utilizing the machine translation quality classifier at a computer in a service provider network to classify the quality of the translated segments generated by the machine translation system into the plurality of classes.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 8. The computer-implemented method of claim 7, wherein the correct class labels for the machine translated segments in the target language are generated, at least in part, based upon a translation edit rate (“
    - TER”
      
      ) between the machine translated segments in the target language and associated reference translations.
  - 9. The computer-implemented method of claim 7, wherein the one or more feature scores associated with the machine translated segments in the target language comprise one or more of:
    - a fluency of the machine translated segments,a level of ambiguity experienced by the machine translation system in translating the input text segments,a difference in length or punctuation between the input text segments and the machine translated text segments, orone or more statistical confidence measures generated by the machine translation system for the machine translated text segments.
  - 10. The computer-implemented method of claim 7, further comprising:
    - aggregating a plurality of classifications for translated segments; and
      
      generating one or more document-level or corpus-level distribution statistics based upon the aggregated classifications.
  - 11. The computer-implemented method of claim 10, further comprising initiating one or more actions based, at least in part, on the document-level or corpus-level distribution statistics.
  - 12. The computer-implemented method of claim 7, wherein the plurality of classes comprises a perfect or near perfect class, an understandable class, and a residual class.
  - 13. The computer-implemented method of claim 12, further comprising providing translated segments in the understandable class to an editor for post-editing.
  - 14. The computer-implemented method of claim 12, further comprising:
    - discarding translated segments in the residual class; and
      
      providing input segments associated with translated segments in the residual class to an editor for translation.
  - 15. The computer-implemented method of claim 12, further comprising:
    - discarding translated segments in the residual class; and
      
      retranslating input segments associated with translated segments in the residual class using a dedicated cluster of instances of a statistical machine translation system configured to examine broad segments of text that are compute intensive and statistically complex.
  - 16. The computer-implemented method of claim 12, further comprising:
    - calculating a compute cost associated with retranslating one or more translated segments in the residual class using a dedicated cluster of instances of a statistical machine translation system configured to examine broad segments of text that are compute intensive and statistically complex;
      
      determining that the compute cost associated with translating the one or more translated segments in the residual class exceeds a cost associated with retranslating the one or more translated segments using a human post-editor; and
      
      causing the translated segments in the residual class to be provided to the dedicated cluster of instances or the human post-editor for retranslation based, at least in part, upon the determination that the compute cost exceeds the cost associated with retranslating the one or more translated segments using the human post-editor.

17. A non-transitory computer-readable storage media having instructions stored thereupon which are executable by one or more processors and which, when executed, cause the one or more processors to:
- generate an estimated difficulty feature score of a supervised machine learning model in translating input text segments based, at least in part, on a number of out-of-vocabulary words in the input text segments;
  
  train a machine translation quality classifier stored in a memory to classify a quality of the translated segments utilizing the supervised machine learning model configured witha misclassification cost configured to offset an imbalance between a plurality of classes of training data, the training data comprising feature scores associated with machine translated segments in a target language, the feature scores including the estimated difficulty feature score; and
  
  a loss function configured to penalize a misclassification of a lower-quality translated segment as a higher-quality translated segment more greatly than a misclassification of a higher-quality translated segment as a lower-quality translated segment; and
  
  utilize the machine translation quality classifier at a computer operating in a service provider network to classify the quality of the translated segments generated by the machine translation system into the plurality of classes.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The non-transitory computer-readable storage media of claim 17, wherein the machine translation quality classifier is further configured toreceive an indication as to whether a term dictionary was utilized to translate at least a portion of the translated segments, andto utilize the indication as to whether a term dictionary was utilized to translate at least a portion of the translated segments, at least in part, to classify the quality of the translated segments generated by the machine translation system into the plurality of classes.
  - 19. The non-transitory computer-readable storage media of claim 17, wherein the machine translation quality classifier is trained using training data comprising correct class labels for machine translated segments in a target language that have been generated, at least in part, based upon a translation edit rate (“
    - TER”
      
      ) between the machine translated segments in the target language and associated reference translations.
  - 20. The non-transitory computer-readable storage media of claim 19, wherein the feature scores further comprise one or more ofa fluency of the machine translated segments,a level of ambiguity experienced by the machine translation system in translating the input text segments,a difference in length or punctuation between the input text segments and the machine translated text segments, orone or more statistical confidence measures generated by the machine translation system for the machine translated text segments.
  - 21. The non-transitory computer-readable storage media of claim 17, having further instructions stored thereupon to retrain the machine translation quality classifier in conjunction with the retraining of a statistical machine translation system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Goyal, Kartik, Lavie, Alon, Denkowski, Michael, Hanneman, Gregory Alan, Fiorillo, Matthew Ryan, Olszewski, Robert Thomas, Hershkovich, Ehud, Kaper, William Joseph, Klementiev, Alexandre Alexandrovich, Jewell, Gavin R.
Primary Examiner(s)
ORTIZ SANCHEZ, MICHAEL

Application Number

US14/868,083
Time in Patent Office

946 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/44 Statistical methods, e.g. p...

G06F 40/51 Translation evaluation

Optimized statistical machine translation system with rapid adaptation capability

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

75 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Optimized statistical machine translation system with rapid adaptation capability

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

75 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others