Word breaker from cross-lingual phrase table
First Claim
1. A computer-implemented process, comprising:
- receiving a parallel corpus of a source language and a target language;
applying a machine translation training process to the parallel corpus to generate a cross-lingual phrase table comprising a plurality of source language phrases, each source language phrase having at least one target language translation;
applying a blocking operation to the cross-lingual phrase table to group phrases of the source language into blocks by searching the cross-lingual phrase table to find blocks of two or more source language phrases that share similar translations in the target language;
searching each of the different source language phrases in each block to identify a stem of a word of the source language, the stem in each block comprising a same sequence of characters occurring in each of the different source language phrases of that block;
searching each of the different source language phrases in each block to find a plurality of affixes of the stem of that block, each affix in each block comprising a sequence of characters preceding or following the characters comprising the stem in any of the different source language phrases in that block;
generating a set of morphemes comprising the stems and affixes of words of the source language;
in response to receipt of a user query in the source language, applying the set of morphemes to automatically create one or more different forms of one or more words of the user query; and
performing an expanded query search using the automatically created different forms of the words of the user query.
3 Assignments
0 Petitions
Accused Products
Abstract
Automatically creating word breakers which segment words into morphemes is described, for example, to improve information retrieval, machine translation or speech systems. In embodiments a cross-lingual phrase table, comprising source language (such as Turkish) phrases and potential translations in a target language (such as English) with associated probabilities, is available. In various examples, blocks of source language phrases from the phrase table are created which have similar target language translations. In various examples, inference using the target language translations in a block enables stem and affix combinations to be found for source language words without the need for input from human-judges or prior knowledge of source language linguistic rules or a source language lexicon.
19 Citations
14 Claims
-
1. A computer-implemented process, comprising:
-
receiving a parallel corpus of a source language and a target language; applying a machine translation training process to the parallel corpus to generate a cross-lingual phrase table comprising a plurality of source language phrases, each source language phrase having at least one target language translation; applying a blocking operation to the cross-lingual phrase table to group phrases of the source language into blocks by searching the cross-lingual phrase table to find blocks of two or more source language phrases that share similar translations in the target language; searching each of the different source language phrases in each block to identify a stem of a word of the source language, the stem in each block comprising a same sequence of characters occurring in each of the different source language phrases of that block; searching each of the different source language phrases in each block to find a plurality of affixes of the stem of that block, each affix in each block comprising a sequence of characters preceding or following the characters comprising the stem in any of the different source language phrases in that block; generating a set of morphemes comprising the stems and affixes of words of the source language; in response to receipt of a user query in the source language, applying the set of morphemes to automatically create one or more different forms of one or more words of the user query; and performing an expanded query search using the automatically created different forms of the words of the user query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system, comprising:
-
a general purpose computing device; and a computer program comprising program modules executable by the computing device, wherein the computing device is directed by the program modules of the computer program to; receive a cross-lingual phrase table comprising a plurality of source language phrases, each source language phrase having at least one target language translation; searching the cross-lingual phrase table to find blocks of two or more source language phrases which have similar target language translations; apply the cross-lingual phrase table and blocks to infer and store, for source language words from the cross-lingual phrase table, a set of morphemes comprising stems and affixes of the words, the inference comprising; searching each of the different source language phrases in each block to identify a stem of a word of the source language, the stem in each block comprising a same sequence of characters occurring in each of the different source language phrases of that block, and searching each of the different source language phrases in each block to find a plurality of affixes of the stem of that block, each affix in each block comprising a sequence of characters preceding or following the sequence of characters comprising the stem in any of the different source language phrases in that block; in response to receipt of a user query in the source language, apply the set of morphemes to automatically create one or more different forms of one or more words of the user query; and perform an expanded query search using the automatically created different forms of the words of the user query.
-
-
12. The system of 11 further comprising applying probability values associated with the target language translations to calculate a probability associated with a stem and an affix.
-
13. A computer-readable memory having computer executable instructions stored therein, said instructions causing a computing device to execute a method comprising:
-
accessing a cross-lingual phrase table comprising a plurality of source language phrases, each source language phrase having at least one target language translation; searching the cross-lingual phrase table to find blocks of two or more source language phrases which have similar target language translations; applying the cross-lingual phrase table and blocks to infer and store, for source language words from the cross-lingual phrase table, a set of morphemes comprising stems and affixes of the words, the inference comprising; searching each of the different source language phrases in each block to identify a stem of a word of the source language, the stem in each block comprising a same sequence of characters occurring in each of the different source language phrases of that block, and searching each of the different source language phrases in each block to find a plurality of affixes of the stem of that block, each affix in each block comprising a sequence of characters preceding or following the sequence of characters comprising the stem in any of the different source language phrases in that block; in response to receipt of a user query in the source language, applying the set of morphemes to automatically create one or more different forms of one or more words of the user query; and performing an expanded query search using the automatically created different forms of the words of the user query. - View Dependent Claims (14)
-
Specification