Building A Translation Lexicon From Comparable, Non-Parallel Corpora

US 20100042398A1
Filed: 10/08/2009
Published: 02/18/2010
Est. Priority Date: 03/26/2002
Status: Active Grant

First Claim

Patent Images

1. A method for building a translation lexicon from non-parallel corpora by a machine translation system, the method comprising:

identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language, wherein the first corpus and the second corpus are non-parallel and are accessed by the machine translation system;

generating a seed lexicon by the machine translation system, the seed lexicon including identically spelled words; and

expanding the seed lexicon by the machine translation system by identifying possible translations of words in the first and second corpora using one or more clues.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A machine translation system may use non-parallel monolingual corpora to generate a translation lexicon. The system may identify identically spelled words in the two corporal and use them as a seed lexicon. The system may use various clues 1 e.g., context and frequency, to identify and score other possible translation pairs 1 using the seed lexicon as a basis. An alternative system may use a small bilingual lexicon in addition to non-parallel corpora to learn translations of unknown words and to generate a parallel corpus.

139 Citations

View as Search Results

28 Claims

1. A method for building a translation lexicon from non-parallel corpora by a machine translation system, the method comprising:
- identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language, wherein the first corpus and the second corpus are non-parallel and are accessed by the machine translation system;
  
  generating a seed lexicon by the machine translation system, the seed lexicon including identically spelled words; and
  
  expanding the seed lexicon by the machine translation system by identifying possible translations of words in the first and second corpora using one or more clues.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1 wherein said expanding comprises using the identically spelled words in the seed lexicon as accurate translations.
  - 3. The method of claim 1, further comprising:
    - identifying substantially identical words in the first and second corpora; and
      
      adding said substantially identical words to the seed lexicon.
  - 4. The method of claim 3, wherein said identifying substantially identical words comprisesapplying transformation rules to words in the first corpora to form transformed words;
    - andcomparing said transformed words to words in the second corpora.
  - 5. The method of claim 1, wherein said one or more clues includes similar spelling.
  - 6. The method of claim 1, wherein said identifying comprises identifying cognates.
  - 7. The method of claim 1, wherein said identifying comprises identifying word pairs having a minimum longest common subsequence ratio.
  - 8. The method of claim 1, wherein said one or more clues includes similar context.
  - 9. The method of claim 1, wherein said identifying comprises:
    - identifying a plurality of context words; and
      
      identifying a frequency of context words in an n-word window around a target word.
  - 10. The method of claim 9, further comprising generating a context vector.
  - 11. The method of claim 1, wherein said identifying comprises identifying frequencies of occurrence of word in the first and second first corpora.
  - 12. The method of claim 1, further comprising:
    - generating matching scores for each of a plurality of clues.
  - 13. The method of claim 12, further comprising adding the matching scores.
  - 14. The method of claim 13, further comprising weighting the matching scores.

15. A computer readable medium having embodied thereon a program, the program being executable by a processor for performing a method for building a translation lexicon from non-parallel corpora, the method comprising:
- identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language, wherein the first corpus and the second corpus are non-parallel and are accessed by the machine translation system;
  
  generating a seed lexicon by the machine translation system, the seed lexicon including identically spelled words; and
  
  expanding the seed lexicon by the machine translation system by identifying possible translations of words in the first and second corpora using one or more clues.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
- - 16. The computer readable medium of claim 15 wherein said expanding comprises using the identically spelled words in the seed lexicon as accurate translations.
  - 17. The computer readable medium of claim 15, further comprising:
    - identifying substantially identical words in the first and second corpora; and
      
      adding said substantially identical words to the seed lexicon.
  - 18. The computer readable medium of claim 17, wherein said identifying substantially identical words comprisesapplying transformation rules to words in the first corpora to form transformed words;
    - andcomparing said transformed words to words in the second corpora.
  - 19. The computer readable medium of claim 15, wherein said one or more clues includes similar spelling.
  - 20. The computer readable medium of claim 15, wherein said identifying comprises identifying cognates.
  - 21. The computer readable medium of claim 15, wherein said identifying comprises identifying word pairs having a minimum longest common subsequence ratio.
  - 22. The computer readable medium of claim 15, wherein said one or more clues includes similar context.
  - 23. The computer readable medium of claim 15, wherein said identifying comprises:
    - identifying a plurality of context words; and
      
      identifying a frequency of context words in an n-word window around a target word.

24. An apparatus comprising:
- a word comparator operative to be executed to identify identically spelled words in a first corpus and a second corpus and build a seed lexicon including said identically spelled words, the first corpus including words in a first language and the second corpus including words in a second language, the first corpus and the second corpus are not parallel; and
  
  a lexicon builder operative to be executed to expand the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues.
- View Dependent Claims (25)
- - 25. The apparatus of claim 24, wherein the lexicon builder is configured to use the identically spelled words in the seed lexicon as accurate translations.

26. The apparatus of 24, further comprising a matching module operative to be executed to match strings in the two non-parallel corpora to generate a parallel corpus including the matched strings as translation pairs

27. The apparatus of 26, the apparatus comprising:
- an alignment module operative to be executed to align text segments in two non-parallel corpora, the corpora including a source language corpus and a target language corpus; and
- View Dependent Claims (28)
- - 28. The apparatus of claim 27, wherein the aligning module is operative to build a Bilingual Suffix Tree from a text segment from one of said two non-parallel corpora.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
University of Southern California
Original Assignee
University of Southern California
Inventors
Kohen, Philipp, Knight, Kevin, Munteanu, Dragos Stefan, Marcu, Daniel

Granted Patent

US 8,234,106 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/2
CPC Class Codes

G06F 40/242 Dictionaries

Building A Translation Lexicon From Comparable, Non-Parallel Corpora

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

139 Citations

28 Claims

Specification

Use Cases

Quick Links

Others

Building A Translation Lexicon From Comparable, Non-Parallel Corpora

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

139 Citations

28 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others