Named entity translation

US 7,249,013 B2
Filed: 03/11/2003
Issued: 07/24/2007
Est. Priority Date: 03/11/2002
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

obtaining a named entity from text input of a source language;

generating potential translations of the named entity from the source language to a target language using a pronunciation-based and spelling-based transliteration model using a first probabilistic model to generate words in the target language and first transliteration scores for the words based on language pronunciation characteristics, using a second probabilistic model to generate second transliteration scores for the words based on a mapping of letter sequences from the target language into the source language, and combining the first transliteration scores and the second transliteration scores into third transliteration scores for the words;

searching a monolingual resource in the target language for information relating to usage frequency; and

providing output comprising at least one of the potential translations based on the usage frequency information.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Translating named entities from a source language to a target language. In general, in one implementation, the technique includes: generating potential translations of a named entity from a source language to a target language using a pronunciation-based and spelling-based transliteration model, searching a monolingual resource in the target language for information relating to usage frequency, and providing output including at least one of the potential translations based on the usage frequency information.

116 Citations

27 Claims

1. A method comprising:
- obtaining a named entity from text input of a source language;
  
  generating potential translations of the named entity from the source language to a target language using a pronunciation-based and spelling-based transliteration model using a first probabilistic model to generate words in the target language and first transliteration scores for the words based on language pronunciation characteristics, using a second probabilistic model to generate second transliteration scores for the words based on a mapping of letter sequences from the target language into the source language, and combining the first transliteration scores and the second transliteration scores into third transliteration scores for the words;
  
  searching a monolingual resource in the target language for information relating to usage frequency; and
  
  providing output comprising at least one of the potential translations based on the usage frequency information.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein:
    - using the first probabilistic model comprises generating at least a portion of the words according to unigram probabilities P(w), generating phoneme sequences corresponding to the words with pronunciation probabilities P(e|w) and converting the phoneme sequences into the source language with conversion probabilities P(a|e), the first transliteration scores being governed by $P_{p} (w | a) ≅ \sum_{\forall ɛ}^{} P (w) P (e | w) P (a | e); and$ using the second probabilistic model comprises generating letters in the source language for the words using the letter sequences mapping with probabilities P(a|w), and generating at least a portion of the words according to a letter trigram model with extended probabilities P(w), the second transliteration scores being governed by $P_{s} (w | a) ≅ \sum_{\forall ɛ} P (w) P (a | w) .$
  - 3. The method of claim 2, wherein combining the first transliteration scores and the second transliteration scores comprises calculating a linear combination, the third transliteration scores being governed by
    P(w|a)=λ
    - P_s(w|a)+(1−
      
      λ
      
      )P_p(w|a).

4. A method comprising:
- obtaining a named entity from text input of a source language by obtaining phrase boundaries of the named entity and by obtaining a category of the named entity;
  
  generating potential translations of the named entity from the source language to a target language using a pronunciation-based and spelling-based transliteration model, and selectively using a bilingual resource based on the category of the named entity;
  
  searching a monolingual resource in the target language for information relating to usage frequency; and
  
  providing output comprising at least one of the potential translations based on the usage frequency information.
- View Dependent Claims (5, 6, 7)
- - 5. The method of claim 4, wherein selectively using the bilingual resource comprises:
    - if the category comprises an organization or location name, translating one or more words in the named entity using a bilingual dictionary, transliterating the one or more words in the named entity using the pronunciation-based and spelling-based transliteration model, combining the translated one or more words with the transliterated one or more words into a regular expression defining available permutations of the translated one or more words and the transliterated one or more words, and matching the regular expression against a monolingual resource in the target language.
  - 6. The method of claim 5, wherein combining the translated one or more words with the transliterated one or more words comprises combining the translated one or more words with n-best transliterations of the transliterated one or more words.
  - 7. The method of claim 5, wherein matching the regular expression against the monolingual resource comprises generating scores for the potential translations according to:
    - $\begin{matrix} P (e | f) = α \sum_{\forall a}^{} P (e, a | f) \\ = α \sum_{a_{1} = 0}^{l} \dots \sum_{a_{m} = 0}^{l} \prod_{j = 0}^{m} t (e_{a_{j}} | f_{j}) \end{matrix}$ where f is a phrase from the potential translations, e is a given word from the translated and transliterated words, l is the length of e, m is the length of f, a is a scaling factor based on a number of found matches fore, e, a_jis an index of the target language word aligned with faccording to an alignment a, and probability t(e_a_j|f_j) is a linear combination of a transliteration score and a translation score, where the translation score is a uniform probability over all dictionary entries for f_j.

8. A method comprising:
- obtaining a named entity from text input of a source language;
  
  generating potential translations of the named entity from the source language to a target language using a pronunciation-based and spelling-based transliteration model;
  
  searching a monolingual resource in the target language for information relating to usage frequency; and
  
  providing output comprising at least one of the potential translations based on the usage frequency information and adjusting probability scores of the potential translations based on the usage frequency, wherein adjusting the probability scores comprises comparing the named entity with other named entities of a common type in the text input and, if the named entity is a sub-phrase of one of the other named entities, adjusting the probability scores based on normalized full-phrase hit counts corresponding to the one other named entity.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
- - 9. The method of claim 8, wherein providing the output further comprises selecting a translation of the named entity from the potential translations based on the adjusted probability scores.
  - 10. The method of claim 8, wherein providing the output further comprises selecting a list of likely translations of the named entity from the potential translations based on the adjusted probability scores and a threshold.
  - 11. The method of claim 8, wherein the usage frequency information comprises normalized full-phrase hit counts for the potential translations in the monolingual resource, and adjusting the probability scores comprises multiplying the probability scores by the normalized full-phrase hit counts for the potential translations.
  - 12. The method of claim 8, further comprising identifying contextual information in the text input, and wherein searching the monolingual resource comprises searching multiple documents for the potential translations in conjunction with the contextual information to obtain the usage frequency information.
  - 13. The method of claim 8, wherein searching the monolingual resource comprises searching multiple documents available over a communications network.
  - 14. The method of claim 13, wherein the multiple documents comprise news stories in the target language.
  - 15. The method of claim 14, wherein the target language is English.
  - 16. The method of claim 15, wherein the source language is Arabic.

17. A method comprising:
- obtaining a named entity from text input of a source language;
  
  identifying contextual information in the text input;
  
  generating potential translations of the named entity from the source language to a target language using a pronunciation-based and spelling-based transliteration model;
  
  by discovering documents in the target language that include the contextual information, identifying named entities in the documents, generating transliteration scores for the named entities in the documents, in relation to the named entity in the text input, using a probabilistic model that uses language pronunciation characteristics and a mapping of letter sequences from the target language into the source language, and adding the scored named entities to the potential translations;
  
  searching a monolingual resource in the target language for information relating to usage frequency; and
  
  providing output comprising at least one of the potential translations based on the usage frequency information.

18. A method comprising:
- obtaining a named entity from text input of a source language;
  
  generating potential translations of the named entity from the source language to a target language using a pronunciation-based and spelling-based transliteration model by generating phrases in the target language and corresponding transliteration scores with a probabilistic model that uses language pronunciation characteristics and a mapping of letter sequences from the target language into the source language, the potential translations comprising the scored phrases, identifying sub-phrases in the generated phrases, discovering documents in the target language using the sub-phrases, identifying, in the discovered documents, named entities that include one or more of the sub-phrases, generating transliteration scores for the identified named entities in the discovered documents using the probabilistic model, and adding the scored named entities to the potential translations;
  
  searching a monolingual resource in the target language for information relating to usage frequency; and
  
  providing output comprising at least one of the potential translations based on the usage frequency information.

19. A system comprising:
- an input/output (I/O) system comprising a network interface configured to provide access to a monolingual resource;
  
  a potential translations generator coupled with the I/O system, the potential translations generator incorporating a combined pronunciation-based and spelling-based transliteration model used to generate translation candidates for a named entity;
  
  a re-ranker module configured to adjust scores of the translation candidates based on usage frequency information discovered in the monolingual resource using the network interface; and
  
  a bilingual resource, wherein the potential translations generator selectively uses the bilingual resource based on a category of the named entity.
- View Dependent Claims (20, 21, 22, 23)
- - 20. The system of claim 19, wherein the potential translations generator comprises:
    - a person entity handling module;
      
      a location and organization entity handling module that accesses the bilingual resource; and
      
      a re-matcher module that accesses a news corpus to generate scores for translation candidates generated by the location and organization entity handling module.
  - 21. The system of claim 19, wherein the re-ranker module incorporates multiple separate re-scoring modules that apply different re-scoring factors.
  - 22. The system of claim 19, wherein the re-ranker module adjusts scores of the translation candidates based at least in part on context information corresponding to the named entity.
  - 23. The system of claim 19, wherein the potential translations generator generates the translation candidates based at least in part on context information corresponding to the named entity.

24. A system comprising:
- an input/output (I/O) system; and
  
  a potential translations generator coupled with the I/O system, the potential translations generator incorporating a combined pronunciation-based and spelling-based transliteration model used to generate translation candidates for a named entity based at least in part on sub-phrases identified in an initial set of translation candidates.
- View Dependent Claims (25)
- - 25. The system of claim 24, wherein the potential translations generator generates the translation candidates based at least in part on context information corresponding to the named entity.

26. A system comprising:
- means for generating potential translations of a named entity from a source language to a target language using spelling-based transliteration the means for generating comprising means for selectively using a bilingual dictionary and a news corpus; and
  
  means for adjusting probability scores of the generated potential translations based on usage frequency information discovered in a monolingual resource.
- View Dependent Claims (27)
- - 27. The system of claim 26, wherein the means for adjusting comprises means for re-ranking the potential translations based on context information and identified sub-phrases of the potential translations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
University of Southern California
Original Assignee
University of Southern California
Inventors
Al-Onaizan, Yaser, Knight, Kevin
Primary Examiner(s)
Hudspeth; David
Assistant Examiner(s)
Albertalli; Brian

Application Number

US10/387,032
Publication Number

US 20030191626A1
Time in Patent Office

1,596 Days
Field of Search

None
US Class Current

704/9
CPC Class Codes

G06F 40/129   Handling non-Latin characte...

G06F 40/295   Named entity recognition

G06F 40/44   Statistical methods, e.g. p...

G06F 40/45   Example-based machine trans...

G06F 40/49   using very large corpora, e...

G06F 40/53   Processing of non-Latin tex...

Named entity translation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

116 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Named entity translation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

116 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links