Constructing a translation lexicon from comparable, non-parallel corpora
First Claim
Patent Images
1. A method for building a translation lexicon from non-parallel corpora, the method comprising:
- identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language;
generating a seed lexicon including identically spelled words; and
expanding the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues.
2 Assignments
0 Petitions
Accused Products
Abstract
A machine translation system may use non-parallel monolingual corpora to generate a translation lexicon. The system may identify identically spelled words in the two corpora, and use them as a seed lexicon. The system may use various clues, e.g., context and frequency, to identify and score other possible translation pairs, using the seed lexicon as a basis. An alternative system may use a small bilingual lexicon in addition to non-parallel corpora to learn translations of unknown words and to generate a parallel corpus.
92 Citations
30 Claims
-
1. A method for building a translation lexicon from non-parallel corpora, the method comprising:
-
identifying identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language;
generating a seed lexicon including identically spelled words; and
expanding the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for generating parallel corpora from non-parallel corpora, the method comprising:
-
aligning text segments in two non-parallel corpora, the corpora including a source language corpus and a target language corpus;
matching strings in the two non-parallel corpora; and
generating a parallel corpus including the matched strings as translation pairs. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
-
23. An apparatus comprising:
-
a word comparator operative to identify identically spelled words in a first corpus and a second corpus and build a seed lexicon including said identically spelled words, the first corpus including words in a first language and the second corpus including words in a second language; and
a lexicon builder operative to expand the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues. - View Dependent Claims (24)
-
-
25. An apparatus for generating parallel corpora from non-parallel corpora, the apparatus comprising:
-
an alignment module operative to align text segments in two non-parallel corpora, the corpora including a source language corpus and a target language corpus; and
a matching module operative to match strings in the two non-parallel corpora generate a parallel corpus including the matched strings as translation pairs. - View Dependent Claims (26)
-
-
27. An article comprising a machine-readable medium including machine-executable instructions, the instructions operative to cause a machine to:
-
identify identically spelled words in a first corpus and a second corpus, the first corpus including words in a first language and the second corpus including words in a second language;
generate a seed lexicon including identically spelled words; and
expand the seed lexicon by identifying possible translations of words in the first and second corpora using one or more clues. - View Dependent Claims (28)
-
-
29. An article comprising a machine-readable medium including machine-executable instructions, the instructions operative to cause a machine to:
-
align text segments in two non-parallel corpora, the corpora including a source language corpus and a target language corpus;
match strings in the two non-parallel corpora; and
generate a parallel corpus including the matched strings as translation pairs. - View Dependent Claims (30)
-
Specification