Domain adaptation for query translation

US 8,543,563 B1
Filed: 05/24/2012
Issued: 09/24/2013
Est. Priority Date: 05/24/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A translation method comprising:

receiving an input query in a source language;

translating the input query with a phrase-based statistical machine translation system to generate a set of candidate translations of the input query in a target language;

extracting a set of features from each of the candidate translations in the set, the set of features including at least one domain specific feature which is based on a comparison of at least one term in the candidate translation with words in a domain-specific corpus of documents;

scoring each of the candidate translations with a scoring function in which the extracted features are weighted with respective weights, the weights having been learned on;

features extracted from translated queries, each of the translated queries having been generated by translation of an original query from a set of original queries into the target language with a machine translation system, anda measure of information retrieval performance of each of the translated queries, for each original query in the set of original queries, the information retrieval performance of each translated query being based on a relevance score, with respect to the respective original query, for documents in a set of documents that have been retrieved in response to the translated query; and

outputting a target query in the target language based on the scores of the candidate translations.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A translation system and method suited to use in Cross Language Information Retrieval employ a retrieval-based scoring function for reranking candidate translations. The method includes translating an input source language query to generate a set of the candidate translations in a target language. The candidate translations are scored with the scoring function, which allows them to be reranked, and an optimal one or more selected for use in querying a domain-specific collection of documents in the target language. The scoring function applies weights to features extracted from the candidate translations. The weights have been learned on features extracted from translated queries, each generated by translation of an original query into the target language, and a measure of information retrieval performance of each the translated queries. One or more of the features used is a domain-specific feature which relies on a corpus of documents in the domain of interest.

82 Citations

View as Search Results

26 Claims

1. A translation method comprising:
- receiving an input query in a source language;
  
  translating the input query with a phrase-based statistical machine translation system to generate a set of candidate translations of the input query in a target language;
  
  extracting a set of features from each of the candidate translations in the set, the set of features including at least one domain specific feature which is based on a comparison of at least one term in the candidate translation with words in a domain-specific corpus of documents;
  
  scoring each of the candidate translations with a scoring function in which the extracted features are weighted with respective weights, the weights having been learned on;
  
  features extracted from translated queries, each of the translated queries having been generated by translation of an original query from a set of original queries into the target language with a machine translation system, anda measure of information retrieval performance of each of the translated queries, for each original query in the set of original queries, the information retrieval performance of each translated query being based on a relevance score, with respect to the respective original query, for documents in a set of documents that have been retrieved in response to the translated query; and
  
  outputting a target query in the target language based on the scores of the candidate translations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein at least one of the translating, extracting, and scoring is performed with a computer processor.
  - 3. The method of claim 1, wherein the outputting comprises outputting the target query to a search engine and retrieving information based on the target query.
  - 4. The method of claim 1, wherein the target query is based on at least one of the candidate translations.
  - 5. The method of claim 1, wherein the outputting comprises:
    - ranking the candidate translations based on scores of the candidate translations; and
      
      selecting at least one of the more highly ranked candidate translations to form the target query.
  - 6. The method of claim 1, wherein the scoring comprises computing a translation score for each of the candidate translations as a weighted linear combination of its extracted features.
  - 7. The method of claim 1, wherein the method includes outputting, as the target query, a candidate translation which satisfies:
    - S_{{circumflex over (t)}}(λ
      
      )=argmax_tε
      
      GEN(q)λ
      
      ·
      
      F(t)
      
      (1)where tε
      
      (GEN(q)) represents a candidate translation generated from a source query q, and λ
      
      represents a set of feature weights learned in training, one weight for each of the features in F(t).
  - 8. The method of claim 1, wherein at least one of the extracted features is based on parts of speech for the candidate translations.
  - 9. The method of claim 8, wherein at least one of the part of speech features is based on a part of speech for an element of the candidate translation and a corresponding part of speech of an element of the input query with which the element of the candidate translation is aligned in the translation, wherein each of the elements comprises at least one word.
  - 10. The method of claim 9, wherein the at least one of the part of speech features is also based on a frequency of the element in the candidate translation as a translation of element of the input query in a training corpus of bi-sentences, each bi-sentence including a sentence in the source language and a sentence in the target language.
  - 11. The method of claim 1, wherein the translating of the input query with the machine translation system comprises:
    - retrieving a set of biphrases, each biphrase comprising at least one word of the input query in the source language and at least one corresponding word in the target language; and
      
      with a translation scoring model, computing a set of the retrieved biphrases to cover the input query, for each of the set of candidate translations, each candidate translation comprising the corresponding words in the target language forming the set of retrieved biphrases.
  - 12. The method of claim 1, wherein the information retrieval performance of each translated query is determined with respect to at least one of the original query and a reference translation thereof in the target language.
  - 13. The method of claim 1, wherein the method further comprises, prior to receiving the input query, learning feature weights for the features in the set.
  - 14. The method of claim 13, wherein the learning of the feature weights is performed with the margin infused relaxed algorithm.
  - 15. The method of claim 1, wherein at least one of the features is a feature which is not used in generating the translation of the input query.

16. A domain-specific translation method comprising:
- receiving an input query in a source language;
  
  with a machine translation system that is not adapted to a specific domain, translating the query to generate a set of candidate translations of the query in a target language;
  
  extracting a set of features from each of the candidate translations in the set, the set of features including at least one domain specific feature which is based on a comparison of at least one term in the candidate translation with words in an associated domain-specific corpus of documents in the target language;
  
  scoring each of the candidate translations with a scoring function in which the extracted features are weighted with respective weights, the weights having been learned on features extracted from translated queries, each generated by translation of an original query into the target language, and a measure of information retrieval performance of each the translated queries for each original query in a set of original queries, the information retrieval performance being assessed on a domain-specific target document collection in which documents in the collection are annotated based on relevance to the original queries; and
  
  outputting a target query based on the scores of the candidate translations, wherein the at least one domain-specific feature is selected from the group consisting of;
  
  a) a language model feature;
  
  b) an out of vocabulary word feature;
  
  c) a query performance predictor which is computed with an equation that correlates with the measure of information retrieval performance; and
  
  combinations thereof.
- View Dependent Claims (17, 18)
- - 17. The method of claim 16, wherein the at least one domain-specific feature comprises a language model feature which, for a candidate translation, is based on a frequency of occurrence of an n-gram of terms of the candidate translation within the domain-specific corpus, where n is at least one and each of the terms comprises a word of the candidate translation or a lemma form thereof.
  - 18. The method of claim 16, wherein the at least one domain-specific feature comprises an out of vocabulary word feature which, for a candidate translation, is based on a number of terms in the candidate translation that are not present within the domain-specific corpus, where each of the terms comprises a word of the candidate translation or a lemma form thereof.

19. A translation method comprising:
- receiving an input query in a source language;
  
  with a machine translation system, translating the query to generate a set of candidate translations of the query in a target language;
  
  extracting a set of features from each of the candidate translations in the set, the set of features including at least one domain specific feature which is based on a comparison of at least one term in the candidate translation with words in a domain-specific corpus of documents;
  
  scoring each of the candidate translations with a scoring function in which the extracted features are weighted with respective weights, the weights having been learned on features extracted from translated queries, each generated by translation of an original query into the target language, and a measure of information retrieval performance of each the translated queries, for each original query in a set of original queries; and
  
  outputting a target query based on the scores of the candidate translations,wherein the at least one domain-specific feature comprises a query performance predictor which, for a candidate translation, is based on at least one of;
  
  a) Average Inverse Document frequency, computed according to the expression;

20. A computer program product comprising a non-transitory computer-readable recording medium which stores instructions for performing a translation method, comprising:
- receiving an input query in a source language;
  
  translating the input query with a phrase-based statistical machine translation system to generate a set of candidate translations of the input query in a target language;
  
  extracting a set of features from each of the candidate translations in the set, the set of features including at least one domain specific feature which is based on a comparison of at least one term in the candidate translation with words in a domain-specific corpus of documents;
  
  scoring each of the candidate translations with a scoring function in which the extracted features are weighted with respective weights, the weights having been learned on;
  
  features extracted from translated queries, each of the translated queries having been generated by translation of an original query from a set of original queries into the target language with a machine translation system, anda measure of information retrieval performance of each of the translated queries, for each original query in the set of original queries, the information retrieval performance of each translated query being based on a relevance score, with respect to the respective original query, for documents in a set of documents that have been retrieved in response to the translated query; and
  
  outputting a target query in the target language based on the scores of the candidate translations.

21. A translation system comprising non-transitory memory which stores instructions for translating an input source language query to generate a set of the candidate translations in a target language and a processor in communication with the memory for executing the instructions, comprising:
- receiving an input query in a source language;
  
  translating the input query with a phrase-based statistical machine translation system to generate a set of candidate translations of the input query in a target language;
  
  extracting a set of features from each of the candidate translations in the set, the set of features including at least one domain specific feature which is based on a comparison of at least one term in the candidate translation with words in a domain- specific corpus of documents;
  
  scoring each of the candidate translations with a scoring function in which the extracted features are weighted with respective weights, the weights having been learned on;
  
  features extracted from translated queries, each of the translated queries having been generated by translation of an original query from a set of original queries into the target language with a machine translation system, anda measure of information retrieval performance of each of the translated queries, for each original query in the set of original queries, the information retrieval performance of each translated query being based on a relevance score, with respect to the respective original query, for documents in a set of documents that have been retrieved in response to the translated query; and
  
  outputting a target query in the target language based on the scores of the candidate translations.

22. A query translation system comprising:
- a statistical machine translation system including a decoder which receives a source query in a source language and outputs a set of candidate translations in a target language using biphrases extracted from a biphrase library, each of the candidate translations being a translation of the same source query; and
  
  a reranking component which outputs a target query in the target language based on at least one of the candidate translations, the reranking component extracting features of each of the candidate translations and computing a function in which the extracted features are weighted by feature weights, the weights having been learned on features of each of a set of translated queries generated by translation of an original query into the target language and a measure of information retrieval performance of each the translated queries from a collection of domain-specific documents in which documents in the collection are annotated based on relevance to original queries, for each original query in a set of original queries, at least one of the features comprising a domain-specific feature; and
  
  a processor which implements the reranking component.
- View Dependent Claims (23, 24, 25)
- - 23. The query translation system of claim 22, wherein the at least one domain-specific feature is based on a comparison of at least one word in the candidate translation with words in a domain-specific corpus of documents.
  - 24. The query translation system of claim 22, wherein the decoder comprises a phrase-based statistical machine translation system.
  - 25. The query translation system of claim 22, wherein the system has access to at least one of:
    - a domain-specific document corpus for computing the features, and statistics extracted from the domain specific document corpus.

26. A method for training a translation system for domain-adapted translation of queries, comprising:
- for each of a set of original queries in a source language;
  
  translating the query to generate a set of translations in a target language;
  
  for each translation in the set of translations, extracting values of features for each of a finite set of features, at least one of the features comprising a domain-specific feature which relies on a domain-specific corpus; and
  
  obtaining a measure of retrieval performance for each translation based on annotations of documents retrieved from a domain-specific corpus with the translation, the document annotations being based on a relevance of each document to original queries in the set of original queries;
  
  learning feature weights for each of the features based on the extracted values of the features and the respective measure of retrieval performance of each translation; and
  
  storing the feature weights for use in translating a new query, different from each of the original queries, from the source language to the target language, whereby candidate translations of the new query are ranked based on their respective extracted values of features and the stored feature weights.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Nikoulina, Vassilina, Lagos, Nikolaos, Clinchant, Stephane
Primary Examiner(s)
Corrielus, Jean M

Application Number

US13/479,648
Time in Patent Office

488 Days
Field of Search

707/706, 707/760, 704/4, 704/10, 704/2, 704/7, 704/9
US Class Current

707/706
CPC Class Codes

G06F 16/243 Natural language query form...

G06F 40/42 Data-driven translation

Domain adaptation for query translation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

82 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Domain adaptation for query translation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

82 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links