Building A Translation Lexicon From Comparable, Non-Parallel Corpora
First Claim
Patent Images
1. A method for building a translation lexicon from non-parallel corpora by a machine translation system, the method comprising:
- identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language, wherein the first corpus and the second corpus are non-parallel and are accessed by the machine translation system;
generating a seed lexicon by the machine translation system, the seed lexicon including identically spelled words; and
expanding the seed lexicon by the machine translation system by identifying possible translations of words in the first and second corpora using one or more clues.
2 Assignments
0 Petitions
Accused Products
Abstract
A machine translation system may use non-parallel monolingual corpora to generate a translation lexicon. The system may identify identically spelled words in the two corporal and use them as a seed lexicon. The system may use various clues 1 e.g., context and frequency, to identify and score other possible translation pairs 1 using the seed lexicon as a basis. An alternative system may use a small bilingual lexicon in addition to non-parallel corpora to learn translations of unknown words and to generate a parallel corpus.
139 Citations
28 Claims
-
1. A method for building a translation lexicon from non-parallel corpora by a machine translation system, the method comprising:
-
identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language, wherein the first corpus and the second corpus are non-parallel and are accessed by the machine translation system; generating a seed lexicon by the machine translation system, the seed lexicon including identically spelled words; and expanding the seed lexicon by the machine translation system by identifying possible translations of words in the first and second corpora using one or more clues. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer readable medium having embodied thereon a program, the program being executable by a processor for performing a method for building a translation lexicon from non-parallel corpora, the method comprising:
-
identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language, wherein the first corpus and the second corpus are non-parallel and are accessed by the machine translation system; generating a seed lexicon by the machine translation system, the seed lexicon including identically spelled words; and expanding the seed lexicon by the machine translation system by identifying possible translations of words in the first and second corpora using one or more clues. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
-
-
24. An apparatus comprising:
-
a word comparator operative to be executed to identify identically spelled words in a first corpus and a second corpus and build a seed lexicon including said identically spelled words, the first corpus including words in a first language and the second corpus including words in a second language, the first corpus and the second corpus are not parallel; and a lexicon builder operative to be executed to expand the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues. - View Dependent Claims (25)
-
-
26. The apparatus of 24, further comprising a matching module operative to be executed to match strings in the two non-parallel corpora to generate a parallel corpus including the matched strings as translation pairs
-
27. The apparatus of 26, the apparatus comprising:
an alignment module operative to be executed to align text segments in two non-parallel corpora, the corpora including a source language corpus and a target language corpus; and - View Dependent Claims (28)
Specification